K8s监控之Metrics-Server指标获取链路分析

自在人生分类：IT技术人气：184 回帖：0 发布于1年前收藏

Metrics-server基于cAdvisor收集指标数据，获取、格式化后以metrics API的形式从apiserver对外暴露，核心作用是为kubectl top以及HPA等组件提供决策指标支持。

本文主要针对metrics-server展开指标收集链路的梳理，目的在于快速解决容器指标监控中的常见问题，比如：

为何kubectl top node 看到的资源使用率远高于top看到的?

为何kubectl top node看到资源百分比>100%?

为何HPA基于指标伸缩异常？

Metrics-Server指标获取链路

以下是metrics-server收集基础指标（CPU/Memory)的链路：从cgroup的数据源，到cadvisor负责数据收集，kubelet负责数据计算汇总，再到apiserver中以api方式暴露出去供客户端（HPA/kubectl top）访问。

上图中数据请求流程（逆向路径）：

Step1 . kubectl top向APIServer的Metrics API发起请求:               
#kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes/xxxx              
#kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/xxxx/pods/xxxx

Step2 . Aggregation 根据API service "metrics.k8s.io"的定义转发给后端svc：metrics-server  
#kubectl get apiservices  v1beta1.metrics.k8s.io  

Step3 . Metrics-server pod获取最近一次的指标数据返回。
注：阿里云云监控容器监控控制台展示的指标基于metrics-server配置的sinkprovider 8093端口获取数据           
# kubectl get svc -n kube-system heapster -oyaml  此处需要注意历史遗留的heapster svc是否也指向后端metrics-server pod

Step4. Metrics server定期向kubelet暴露的endpoint收集数据转换成k8s API格式，暴露给Metrics API.  
Metrics-server本身不做数据采集，不做永久存储，相当于是将kubelet的数据做转换。              
kubelet的在cadvisor采集数据的基础上做了计算汇总,提供container+pod+node级别的cgroup数据       older version:# curl 127.0.0.1:10255/stats/summary?only_cpu_and_memory=true          v0.6.0+:#curl 127.0.0.1:10255/metrics/resource 

Step5: cAdvisor定期向cgroup采集数据，container cgroup 级别。cadvisor的endpoint是  /metrics/cadvisor，仅提供contianer+machine数据，不做计算pod/node等指标。              
#curl http://127.0.0.1:10255/metrics/cadvisor  

Step6: 在Node/Pod对应的Cgroup目录查看指标文件中的数据。               
#cd  /sys/fs/cgroup/xxxx

通过不同方式查看指标

此处简单总结一下各种方式查看指标，后续会对每一步做详细分析

1. 确定节点cgroup根目录

mount |grep cgroup

查看节点cgroup根目录：/sys/fs/cgroup

2. 查看pod/container cgroup指标文件目录路径的三种方式

通过pod uid定位指标文件： 获取pod uid: kubectl get pod -n xxxx xxxxx -oyaml |grep Uid -i -B10 获取containerid: kubectl describe pod -n xxxx xxxxx |grep id -i 比如可以根据以上得到的uid进入pod中container对应的cgroup目录：

/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod-<pod-uid>.slice/cri-containerd-<container-id>.scope

通过container pid定位指标源文件 获取pod对应pid的cgroup文件目录

crictl pods |grep pod-name 可以拿到pod-id
crictl ps |grep container-name    或者crictl ps |grep pod-id 可以拿到 container-id
critl inspect <container-id> |grep -i pid 
cat /proc/${CPID}/cgroup   或者 cat /proc/${CPID}/mountinfo  |grep cgroup

比如cgroup文件可以看到具体的pod cgroup子目录：

"cpuacct": "/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podxxxx.slice/cri-containerd-xxxx.scope",
"memory": "/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podxxxx.slice/cri-containerd-xxxx.scope",

pod的state.json文件也可以看到pod对应的cgroup信息

# crictl pods |grep pod-name
可以获取<pod-id>， 注意pod-id不是pod uid
# cat /run/containerd/runc/k8s.io/<pod-id>/state.json  |jq .
"cgroup_paths": {
  "cpu": "/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope",
  "cpuacct": "/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope",
  "cpuset": "/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope",
  ...
  "memory": "/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope",
  ...
},
"namespace_paths": {
  "NEWCGROUP": "/proc/8443/ns/cgroup",
  "NEWNET": "/proc/8443/ns/net",
  "NEWNS": "/proc/8443/ns/mnt",
  "NEWPID": "/proc/8443/ns/pid",
  "NEWUTS": "/proc/8443/ns/uts"
  ...
},

3. 指标计算

进入到node/pod/container对应的cgroup目录中，查看指标对应的文件,此处不过多解读每个指标文件的含义。

//pod cgroup for cpu
# cd /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope
# ls
...
cpuacct.block_latency        cpuacct.stat             cpuacct.usage_percpu_sys   cpuacct.wait_latency  cpu.cfs_period_us       cpu.stat
cpuacct.cgroup_wait_latency  cpuacct.usage            cpuacct.usage_percpu_user  cpu.bvt_warp_ns       cpu.cfs_quota_us        notify_on_release

//注意： CPU的cgroup目录  其实都是指向了cpu,cpuacct
lrwxrwxrwx  1 root root  11 3月   5 05:10 cpu -> cpu,cpuacct
lrwxrwxrwx  1 root root  11 3月   5 05:10 cpuacct -> cpu,cpuacct
dr-xr-xr-x  7 root root   0 4月  28 15:41 cpu,cpuacct
dr-xr-xr-x  3 root root   0 4月   3 17:33 cpuset

//pod cgroup for memory
# cd  /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f8616f847da6685139113f6df14b010.slice/cri-containerd-c2eade28d94676563342077bab6c95bf48add7b872d66f246846b83d0eec5c78.scope
# ls
memory.direct_swapout_global_latency  memory.kmem.tcp.failcnt             memory.min                       memory.thp_reclaim          tasks
memory.direct_swapout_memcg_latency   memory.kmem.tcp.limit_in_bytes      memory.move_charge_at_immigrate  memory.thp_reclaim_ctrl
memory.events                         memory.kmem.tcp.max_usage_in_bytes  memory.numa_stat                 memory.thp_reclaim_stat
memory.events.local                   memory.kmem.tcp.usage_in_bytes      memory.oom_control               memory.usage_in_bytes
...

内存计算： 公式：

container_memory_working_set_bytes = container_memory_usage_bytes - total_inactive_file

比如：

cat memory.usage_in_bytes 
cat memory.stat |grep total_inactive_file
workingset=expr $[$usage_in_bytes-$total_inactive_file]
转换为兆为单位： echo [workingset/1024/1024]

CPU计算：

公式：

cpuUsage := float64(last.CumulativeCpuUsed-prev.CumulativeCpuUsed) / window.Seconds()

//其中，cgroup文件 cpuacct.usage 显示的为cpu累计值usagenamocoreseconds，需要按照以上公式做计算方可得到cpu使用量
//对一段时间 从 startTime ~ endTime间的瞬时的CPU Core的计算公式是：
(endTime的usagenamocoreseconds - startTime的usagenamocoreseconds) / (endTime - startTime)

比如计算过去十秒的平均使用量（不是百分比，是cpu core的使用量）：

tstart=$(date +%s%N);cstart=$(cat /sys/fs/cgroup/cpu/cpuacct.usage);sleep 10;tstop=$(date +%s%N);cstop=$(cat /sys/fs/cgroup/cpu/cpuacct.usage);result=`awk 'BEGIN{printf "%.2f\n",'$(($cstop - $cstart))'/'$(($tstop - $tstart))'}'`;echo $result;

以上手动计算cpu的脚本引自：《记一次pod oom的异常shmem输出》

4. 查看cadvisor指标

如何通过raw api获取cadvisor指标 kubectl get --raw=/api/v1/nodes/nodename/proxy/metrics/cadvisor
如何通过cAdvisor的本地接口/metrics/cadvisor 获取数据 curl http://127.0.0.1:10255/metrics/cadvisor

5. 如何通过kubelet接口获取指标

curl 127.0.0.1:10255/metrics/resource

curl 127.0.0.1:10255/stats/summary?only_cpu_and_memory=true |jq '.node’

6. 如何通过metrics-server API获取数据 (kubectl top或者HPA的调用方式)

kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes/xxxx |jq .

kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/xxxx/pods/xxxx |jq .

指标链路分五步解读

如果客户端获取指标数据失败，可以在指标获取流程中（cgoup- > cadvisor-> kubelet-> MetrcisServer Pod -> apiserver/Metrics Api）通过各自暴露的接口获取数据，定位问题发生点。

一数据源：Linux cgroup 层级结构：

最外层是node cgoup =》 qos级别cgroup =》 pod级别cgroup -》container级别cgroup

其中 node cgoup包含 kubepods +user+system部分

下图可直观显示cgroup层级包含关系：

也可以使用systemd-cgls看层级结构，以下在node级别根目录 : /sys/fs/cgroup

二 cAdvisor从cgroup中收集container级别的指标数据

注意：cadvsior 也只是个指标收集者，它的数据来自于cgroup 文件。

2.1 cadvisor如何计算working_set内存值

	inactiveFileKeyName := "total_inactive_file"
	if cgroups.IsCgroup2UnifiedMode() {
		inactiveFileKeyName = "inactive_file"
	}

	workingSet := ret.Memory.Usage
	if v, ok := s.MemoryStats.Stats[inactiveFileKeyName]; ok {
		if workingSet < v {
			workingSet = 0
		} else {
			workingSet -= v
		}
	}
	ret.Memory.WorkingSet = workingSet
}

2.2 cAdvisor的接口/metrics/cadvisor 数据分析

kubectl get --raw=/api/v1/nodes/nodename/proxy/metrics/cadvisor

或者 curl http://127.0.0.1:10255/metrics/cadvisor

注意：/metrics/cadvisor 也是prometheus的其中一个数据源。

https://github.com/google/cadvisor/blob/master/info/v1/container.go#L320

该接口返回以下指标：

# curl http://127.0.0.1:10255/metrics/cadvisor   |awk -F "\{" '{print $1}' |sort |uniq |grep "#" |grep -v TYPE 
1 # HELP cadvisor_version_info A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.
1 # HELP container_blkio_device_usage_total Blkio Device bytes usage
1 # HELP container_cpu_cfs_periods_total Number of elapsed enforcement period intervals.
1 # HELP container_cpu_cfs_throttled_periods_total Number of throttled period intervals.
1 # HELP container_cpu_cfs_throttled_seconds_total Total time duration the container has been throttled.
1 # HELP container_cpu_load_average_10s Value of container cpu load average over the last 10 seconds.
1 # HELP container_cpu_system_seconds_total Cumulative system cpu time consumed in seconds.
1 # HELP container_cpu_usage_seconds_total Cumulative cpu time consumed in seconds.
1 # HELP container_cpu_user_seconds_total Cumulative user cpu time consumed in seconds.
1 # HELP container_file_descriptors Number of open file descriptors for the container.
1 # HELP container_fs_inodes_free Number of available Inodes
1 # HELP container_fs_inodes_total Number of Inodes
1 # HELP container_fs_io_current Number of I/Os currently in progress
1 # HELP container_fs_io_time_seconds_total Cumulative count of seconds spent doing I/Os
1 # HELP container_fs_io_time_weighted_seconds_total Cumulative weighted I/O time in seconds
1 # HELP container_fs_limit_bytes Number of bytes that can be consumed by the container on this filesystem.
1 # HELP container_fs_reads_bytes_total Cumulative count of bytes read
1 # HELP container_fs_read_seconds_total Cumulative count of seconds spent reading
1 # HELP container_fs_reads_merged_total Cumulative count of reads merged
1 # HELP container_fs_reads_total Cumulative count of reads completed
1 # HELP container_fs_sector_reads_total Cumulative count of sector reads completed
1 # HELP container_fs_sector_writes_total Cumulative count of sector writes completed
1 # HELP container_fs_usage_bytes Number of bytes that are consumed by the container on this filesystem.
1 # HELP container_fs_writes_bytes_total Cumulative count of bytes written
1 # HELP container_fs_write_seconds_total Cumulative count of seconds spent writing
1 # HELP container_fs_writes_merged_total Cumulative count of writes merged
1 # HELP container_fs_writes_total Cumulative count of writes completed
1 # HELP container_last_seen Last time a container was seen by the exporter
1 # HELP container_memory_cache Number of bytes of page cache memory.
1 # HELP container_memory_failcnt Number of memory usage hits limits
1 # HELP container_memory_failures_total Cumulative count of memory allocation failures.
1 # HELP container_memory_mapped_file Size of memory mapped files in bytes.
1 # HELP container_memory_max_usage_bytes Maximum memory usage recorded in bytes
1 # HELP container_memory_rss Size of RSS in bytes.
1 # HELP container_memory_swap Container swap usage in bytes.
1 # HELP container_memory_usage_bytes Current memory usage in bytes, including all memory regardless of when it was accessed
1 # HELP container_memory_working_set_bytes Current working set in bytes.
1 # HELP container_network_receive_bytes_total Cumulative count of bytes received
1 # HELP container_network_receive_errors_total Cumulative count of errors encountered while receiving
1 # HELP container_network_receive_packets_dropped_total Cumulative count of packets dropped while receiving
1 # HELP container_network_receive_packets_total Cumulative count of packets received
1 # HELP container_network_transmit_bytes_total Cumulative count of bytes transmitted
1 # HELP container_network_transmit_errors_total Cumulative count of errors encountered while transmitting
1 # HELP container_network_transmit_packets_dropped_total Cumulative count of packets dropped while transmitting
1 # HELP container_network_transmit_packets_total Cumulative count of packets transmitted
1 # HELP container_processes Number of processes running inside the container.
1 # HELP container_scrape_error 1 if there was an error while getting container metrics, 0 otherwise
1 # HELP container_sockets Number of open sockets for the container.
1 # HELP container_spec_cpu_period CPU period of the container.
1 # HELP container_spec_cpu_quota CPU quota of the container.
1 # HELP container_spec_cpu_shares CPU share of the container.
1 # HELP container_spec_memory_limit_bytes Memory limit for the container.
1 # HELP container_spec_memory_reservation_limit_bytes Memory reservation limit for the container.
1 # HELP container_spec_memory_swap_limit_bytes Memory swap limit for the container.
1 # HELP container_start_time_seconds Start time of the container since unix epoch in seconds.
1 # HELP container_tasks_state Number of tasks in given state
1 # HELP container_threads_max Maximum number of threads allowed inside the container, infinity if value is zero
1 # HELP container_threads Number of threads running inside the container
1 # HELP container_ulimits_soft Soft ulimit values for the container root process. Unlimited if -1, except priority and nice
1 # HELP machine_cpu_cores Number of logical CPU cores.
1 # HELP machine_cpu_physical_cores Number of physical CPU cores.
1 # HELP machine_cpu_sockets Number of CPU sockets.
1 # HELP machine_memory_bytes Amount of memory installed on the machine.
1 # HELP machine_nvm_avg_power_budget_watts NVM power budget.
1 # HELP machine_nvm_capacity NVM capacity value labeled by NVM mode (memory mode or app direct mod

2.3 问题：kubectl top pod包含pause容器的指标么？

实验发现cadvisor接口返回了pause container的数据，跟网络上很多帖子的结论不同，这里待确认。

# curl http://127.0.0.1:10255/metrics/cadvisor   |grep csi-plugin |grep container_cpu_usage_seconds_total
container_cpu_usage_seconds_total{container="",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice",image="",name="",namespace="kube-system",pod="csi-plugin-tfrc6"} 675.44788393 1665548285204
container_cpu_usage_seconds_total{container="",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-9f878a414fa02a1009e84dc7cea417084e9f856e8ae811112d841b9b1b86713f.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/pause:3.5",name="9f878a414fa02a1009e84dc7cea417084e9f856e8ae811112d841b9b1b86713f",namespace="kube-system",pod="csi-plugin-tfrc6"} 0.019981257 1665548276989
container_cpu_usage_seconds_total{container="csi-plugin",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-eab2bc5aba47420c5dcfee0cdc63e9327f6154e95a64da2a10eb5c4b9bc9b8d0.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-plugin:v1.22.11-abbb810e-aliyun",name="eab2bc5aba47420c5dcfee0cdc63e9327f6154e95a64da2a10eb5c4b9bc9b8d0",namespace="kube-system",pod="csi-plugin-tfrc6"} 452.018439495 1665548286584
container_cpu_usage_seconds_total{container="csi-provisioner",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod3ef3dfea_a052_4ed5_8b21_647e0ac42817.slice/cri-containerd-9b2a563af4409e0f19260af23f46c3f10102f5089ad97c696dbb77be20d95a82.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-plugin:v1.20.7-aafce42-aliyun",name="9b2a563af4409e0f19260af23f46c3f10102f5089ad97c696dbb77be20d95a82",namespace="kube-system",pod="csi-provisioner-66d47b7f64-88lzs"} 533.027518847 1665548276678
container_cpu_usage_seconds_total{container="disk-driver-registrar",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-b565d92528539b319d929935e20316c3cd121ab816f00ec5e09f2b1d7ec57eec.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.0-038aeb6-aliyun",name="b565d92528539b319d929935e20316c3cd121ab816f00ec5e09f2b1d7ec57eec",namespace="kube-system",pod="csi-plugin-tfrc6"} 79.695711837 1665548287771
container_cpu_usage_seconds_total{container="nas-driver-registrar",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-addd318c7faf17a566d11b9916452df298eb1c4f96214d8ca572920d53a05e06.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.0-038aeb6-aliyun",name="addd318c7faf17a566d11b9916452df298eb1c4f96214d8ca572920d53a05e06",namespace="kube-system",pod="csi-plugin-tfrc6"} 71.727112206 1665548288055
container_cpu_usage_seconds_total{container="oss-driver-registrar",cpu="total",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9bfea2a2_75a9_4c41_8225_77f9053ee153.slice/cri-containerd-7742122f2aca67336939e9c4f24696df5dc2d400e3674c4008c5e8202bf5ed7a.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.0-038aeb6-aliyun",name="7742122f2aca67336939e9c4f24696df5dc2d400e3674c4008c5e8202bf5ed7a",namespace="kube-system",pod="csi-plugin-tfrc6"} 71.986891847 1665548274774
# curl http://127.0.0.1:10255/metrics/cadvisor   |grep csi-plugin-cvjwm |grep container_memory_working_set_bytes
container_memory_working_set_bytes{container="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice",image="",name="",namespace="kube-system",pod="csi-plugin-cvjwm"} 5.6066048e+07 1651810984383
container_memory_working_set_bytes{container="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-2b08fab496e500709102969ed459f5150e7db008194fb3e71f1f7b0ad48f7e8e.scope",image="sha256:ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e2534269c459",name="2b08fab496e500709102969ed459f5150e7db008194fb3e71f1f7b0ad48f7e8e",namespace="kube-system",pod="csi-plugin-cvjwm"} 40960 1651810984964
container_memory_working_set_bytes{container="csi-plugin",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-5c79494498d7905db8ffda91f0037dc7208fcf75b67a2c85932065e95640bd77.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-plugin:v1.20.7-aafce42-aliyun",name="5c79494498d7905db8ffda91f0037dc7208fcf75b67a2c85932065e95640bd77",namespace="kube-system",pod="csi-plugin-cvjwm"} 2.316288e+07 1651810988657
container_memory_working_set_bytes{container="disk-driver-registrar",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-c9b1bac3ca7539b317047188f233c3396be3df0796bc90b408caf98a4f51a70b.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v1.3.0-6e9fff3-aliyun",name="c9b1bac3ca7539b317047188f233c3396be3df0796bc90b408caf98a4f51a70b",namespace="kube-system",pod="csi-plugin-cvjwm"} 1.1063296e+07 1651810991117
container_memory_working_set_bytes{container="nas-driver-registrar",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-6e4c525abb2337ae7eb6f884dc9fc7a2605699581e450d4586173f8b7a4187cd.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v1.3.0-6e9fff3-aliyun",name="6e4c525abb2337ae7eb6f884dc9fc7a2605699581e450d4586173f8b7a4187cd",namespace="kube-system",pod="csi-plugin-cvjwm"} 1.0940416e+07 1651810981728
container_memory_working_set_bytes{container="oss-driver-registrar",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podefa8311a_ec9b_4a18_a3ce_4bf39149a314.slice/cri-containerd-34f5248e3a3f812480a42b16e3c8514b41f8da8e5ae1c84684dd2525f583da79.scope",image="registry-vpc.cn-beijing.aliyuncs.com/acs/csi-node-driver-registrar:v1.3.0-6e9fff3-aliyun",name="34f5248e3a3f812480a42b16e3c8514b41f8da8e5ae1c84684dd2525f583da79",namespace="kube-system",pod="csi-plugin-cvjwm"} 1.0846208e+07 1651810991024

三 Kubelet的API endpoint 提供Node/Pod/Container级别指标

kubelet提供两个endpoint，是kubelet计算后的数据值。因为cAdvisor采集的是container级别的原始数据，不包含pod以及node的计算加和值。

不同版本的metrics-server请求的endpoint如下：

older version: /stats/summary?only_cpu_and_memory=true

v0.6.0+: /metrics/resource

Metrics-server v0.6.0跟kubelet的/metrics/resource获取数据：

// GetMetrics implements client.KubeletMetricsGetter
func (kc *kubeletClient) GetMetrics(ctx context.Context, node *corev1.Node) (*storage.MetricsBatch, error) {
    port := kc.defaultPort
    nodeStatusPort := int(node.Status.DaemonEndpoints.KubeletEndpoint.Port)
    if kc.useNodeStatusPort && nodeStatusPort != 0 {
        port = nodeStatusPort
    }
    addr, err := kc.addrResolver.NodeAddress(node)
    if err != nil {
        return nil, err
    }
    url := url.URL{
        Scheme: kc.scheme,
        Host:   net.JoinHostPort(addr, strconv.Itoa(port)),
        Path:   "/metrics/resource",
    }
    return kc.getMetrics(ctx, url.String(), node.Name)
}

3.1 Kubelet接口提供的指标数据

/metrics/resource 接口：

该接口可以提供以下指标返回：

$ curl 127.0.0.1:10255/metrics/resource |awk -F "{" '{print $1}'  | grep  "#" |grep HELP
# HELP container_cpu_usage_seconds_total [ALPHA] Cumulative cpu time consumed by the container in core-seconds
# HELP container_memory_working_set_bytes [ALPHA] Current working set of the container in bytes
# HELP container_start_time_seconds [ALPHA] Start time of the container since unix epoch in seconds
# HELP node_cpu_usage_seconds_total [ALPHA] Cumulative cpu time consumed by the node in core-seconds
# HELP node_memory_working_set_bytes [ALPHA] Current working set of the node in bytes
# HELP pod_cpu_usage_seconds_total [ALPHA] Cumulative cpu time consumed by the pod in core-seconds
# HELP pod_memory_working_set_bytes [ALPHA] Current working set of the pod in bytes
# HELP scrape_error [ALPHA] 1 if there was an error while getting container metrics, 0 otherwise

比如，查询特定pod的metrics，也会包含container metrcis :

#curl 127.0.0.1:10255/metrics/resource |grep csi-plugin-cvjwm

container_cpu_usage_seconds_total{container="csi-plugin",namespace="kube-system",pod="csi-plugin-cvjwm"} 787.211455926 1651651825597
container_cpu_usage_seconds_total{container="disk-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"} 16.58508124 1651651825600
container_cpu_usage_seconds_total{container="nas-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"} 16.820329754 1651651825602
container_cpu_usage_seconds_total{container="oss-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"} 16.016630434 1651651825605
container_memory_working_set_bytes{container="csi-plugin",namespace="kube-system",pod="csi-plugin-cvjwm"} 2.312192e+07 1651651825597
container_memory_working_set_bytes{container="disk-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"} 1.1071488e+07 1651651825600
container_memory_working_set_bytes{container="nas-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"} 1.0940416e+07 1651651825602
container_memory_working_set_bytes{container="oss-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"} 1.0846208e+07 1651651825605
container_start_time_seconds{container="csi-plugin",namespace="kube-system",pod="csi-plugin-cvjwm"} 1.64639996363012e+09 1646399963630
container_start_time_seconds{container="disk-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"} 1.646399924462264e+09 1646399924462
container_start_time_seconds{container="nas-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"} 1.646399937591126e+09 1646399937591
container_start_time_seconds{container="oss-driver-registrar",namespace="kube-system",pod="csi-plugin-cvjwm"} 1.6463999541537158e+09 1646399954153
pod_cpu_usage_seconds_total{namespace="kube-system",pod="csi-plugin-cvjwm"} 836.633497354 1651651825616
pod_memory_working_set_bytes{namespace="kube-system",pod="csi-plugin-cvjwm"} 5.5980032e+07 1651651825643

/stats/summary 接口

接口返回指标如下，包含node/pod/container：

# curl 127.0.0.1:10255/stats/summary?only_cpu_and_memory=true |jq '.node'

   {
  "nodeName": "ack.dedicated00009yibei",
  "systemContainers": [
    {
      "name": "pods",
      "startTime": "2022-03-04T13:17:01Z",
      "cpu": {
        ...
      },
      "memory": {
        ...
      }
    },
    {
      "name": "kubelet",
      "startTime": "2022-03-04T13:17:22Z",
      "cpu": {
        ...
      },
      "memory": {
        ...
      }
    }
  ],
  "startTime": "2022-03-04T13:11:06Z",
  "cpu": {
    "time": "2022-05-05T13:54:02Z",
    "usageNanoCores": 182358783,     =======> HPA/kubectl top的取值来源，也是也是基于下面的累计值的计算值。
    "usageCoreNanoSeconds": 979924257151862====>累计值，使用时间窗口window做后续计算。
  },
  "memory": {
    "time": "2022-05-05T13:54:02Z",
    "availableBytes": 1296015360,
    "usageBytes": 3079581696,
    "workingSetBytes": 2592722944,     =======>kubectl top node 大体一致
    "rssBytes": 1459187712,
    "pageFaults": 9776943,
    "majorPageFaults": 1782
  }
}
###########################

 #curl 127.0.0.1:10255/stats/summary?only_cpu_and_memory=true |jq '.pods[1]'
  {
    "podRef": {
      "name": "kube-flannel-ds-s6mrk",
      "namespace": "kube-system",
      "uid": "5b328994-c4a1-421d-9ab0-68992ca79807"
    },
    "startTime": "2022-03-04T13:18:41Z",
    "containers": [
      {
        ...
      }
    ],
    "cpu": {
      "time": "2022-05-05T13:53:03Z",
      "usageNanoCores": 2817176,
      "usageCoreNanoSeconds": 11613876607138
    },
    "memory": {
      "time": "2022-05-05T13:53:03Z",
      "availableBytes": 237830144,
      "usageBytes": 30605312,
      "workingSetBytes": 29876224,
      "rssBytes": 26116096,
      "pageFaults": 501002073,
      "majorPageFaults": 1716
    }
  }
###########################  
 # curl 127.0.0.1:10255/stats/summary?only_cpu_and_memory=true |jq '.pods[1].containers[0]'
 {
  "name": "kube-scheduler",
  "startTime": "2022-05-05T08:27:55Z",
  "cpu": {
    "time": "2022-05-05T13:56:16Z",
    "usageNanoCores": 1169892,
    "usageCoreNanoSeconds": 29353035680
  },
  "memory": {
    "time": "2022-05-05T13:56:16Z",
    "availableBytes": 9223372036817981000,
    "usageBytes": 36790272,
    "workingSetBytes": 32735232,
    "rssBytes": 24481792,
    "pageFaults": 5511,
    "majorPageFaults": 165
  }
}

3.2 Kubelet如何计算CPU使用率：

CPU重点指标解析：

usageCoreNanoSeconds： //累计值，单位是nano core * seconds . 根据时间窗口window跟总核数计算出cpu usage.   
usageNanoCores：//计算值，利用某个默认时间段的两个累计值usageCoreNanoSeconds做计算的结果. 
CPU使用量usageNanoCores= (endTime的usagenamocoreseconds - startTime的usagenamocoreseconds) / (endTime - startTime)

计算方式：

Metrcis API返回的window是用来计算CPU使用率的时间窗口，关于此处cpu使用率的计算：

https://github.com/google/cadvisor/issues/2126

https://github.com/google/cadvisor/issues/2032

https://github.com/google/cadvisor/issues/2198#issuecomment-584230223

比如计算过去十秒的平均使用量（不是百分比，是cpu core的使用量）：

tstart=$(date +%s%N);cstart=$(cat /sys/fs/cgroup/cpu/cpuacct.usage);sleep 10;tstop=$(date +%s%N);cstop=$(cat /sys/fs/cgroup/cpu/cpuacct.usage);result=`awk 'BEGIN{printf "%.2f\n",'$(($cstop - $cstart))'/'$(($tstop - $tstart))'}'`;echo $result;

// MetricsPoint represents the a set of specific metrics at some point in time.
type MetricsPoint struct {
    // StartTime is the start time of container/node. Cumulative CPU usage at that moment should be equal zero.
    StartTime time.Time
    // Timestamp is the time when metric point was measured. If CPU and Memory was measured at different time it should equal CPU time to allow accurate CPU calculation.
    Timestamp time.Time
    // CumulativeCpuUsed is the cumulative cpu used at Timestamp from the StartTime of container/node. Unit: nano core * seconds.
    CumulativeCpuUsed uint64
    // MemoryUsage is the working set size. Unit: bytes.
    MemoryUsage uint64
}

func resourceUsage(last, prev MetricsPoint) (corev1.ResourceList, api.TimeInfo, error) {
    if last.CumulativeCpuUsed < prev.CumulativeCpuUsed {
        return corev1.ResourceList{}, api.TimeInfo{}, fmt.Errorf("unexpected decrease in cumulative CPU usage value")
    }
    window := last.Timestamp.Sub(prev.Timestamp)
    cpuUsage := float64(last.CumulativeCpuUsed-prev.CumulativeCpuUsed) / window.Seconds()
    return corev1.ResourceList{
        corev1.ResourceCPU:    uint64Quantity(uint64(cpuUsage), resource.DecimalSI, -9),
        corev1.ResourceMemory: uint64Quantity(last.MemoryUsage, resource.BinarySI, 0),
    }, api.TimeInfo{
        Timestamp: last.Timestamp,
        Window:    window,
    }, nil
}

3.3 Kubelet计算节点内存逻辑脚本(计算驱逐时的逻辑)：

#!/bin/bash
#!/usr/bin/env bash

# This script reproduces what the kubelet does
# to calculate memory.available relative to root cgroup.

# current memory usage
memory_capacity_in_kb=$(cat /proc/meminfo | grep MemTotal | awk '{print $2}')
memory_capacity_in_bytes=$((memory_capacity_in_kb * 1024))
memory_usage_in_bytes=$(cat /sys/fs/cgroup/memory/memory.usage_in_bytes)
memory_total_inactive_file=$(cat /sys/fs/cgroup/memory/memory.stat | grep total_inactive_file | awk '{print $2}')

memory_working_set=${memory_usage_in_bytes}
if [ "$memory_working_set" -lt "$memory_total_inactive_file" ];
then
    memory_working_set=0
else
    memory_working_set=$((memory_usage_in_bytes - memory_total_inactive_file))
fi

memory_available_in_bytes=$((memory_capacity_in_bytes - memory_working_set))
memory_available_in_kb=$((memory_available_in_bytes / 1024))
memory_available_in_mb=$((memory_available_in_kb / 1024))

echo "memory.capacity_in_bytes $memory_capacity_in_bytes"
echo "memory.usage_in_bytes $memory_usage_in_bytes"
echo "memory.total_inactive_file $memory_total_inactive_file"
echo "memory.working_set $memory_working_set"
echo "memory.available_in_bytes $memory_available_in_bytes"
echo "memory.available_in_kb $memory_available_in_kb"
echo "memory.available_in_mb $memory_available_in_mb"

四 Metrics-Server解读

4.1 关于apiservice metrics.k8s.io

部署apiservice的时候向apiserver的aggregation layer注册Metrcis API。这样apiserver收到特定metrics api （/metrics.k8s.io/）的请求后，会转发给后端定义的metrics-server做处理。

kubectl get apiservices v1beta1.metrics.k8s.io 后端指向的svc是metrics-service.

如果此处apiservice指向有问题导致指标获取异常，可清理资源重装metrics-server：

1，删除 v1beta1.metrics.k8s.io ApiServices
kubectl delete apiservice v1beta1.metrics.k8s.io

2，卸载metrics-server
kubectl delete deployment metrics-server -nkube-system

3，重新安装 metrics-server组件

4.2 关于heapster svc

Heapster是老版本集群中用来收集指标的，后续基础CPU/Memory由metrics-server负责，自定义指标可以prometheus负责。ACK集群中Heapster跟metrics-server两个svc共同指向metrics-server pod。

4.3 关于 metrics-server pod 启动参数

阿里云ACK集群中官方metrics-server组件的启动参数如下：

containers:

command:

/metrics-server

--source=kubernetes.hybrid:''

--sink=socket:tcp://monitor.csk.cn-beijing.aliyuncs.com:8093?clusterId=xxx&public=true image: registry-vpc.cn-beijing.aliyuncs.com/acs/metrics-server:v0.3.8.5-307cf45-aliyun

注意，sink中定义的8093端口，用于metrics-server pod向阿里云云监控提供指标数据。

另外，启动参数中没指定的，采用默认值，比如：

--metric-resolution duration

The resolution at which metrics-server will retain metrics. (default 30s)

指标缓存sink的定义：

https://github.com/kubernetes-sigs/metrics-server/blob/v0.3.5/pkg/provider/sink/sinkprov.go#L134

// sinkMetricsProvider is a provider.MetricsProvider that also acts as a sink.MetricSink
type sinkMetricsProvider struct {
	mu    sync.RWMutex
	nodes map[string]sources.NodeMetricsPoint
	pods  map[apitypes.NamespacedName]sources.PodMetricsPoint
}

4.4 Metrics-server API提供的指标数据

Metrcis-server相当于对kubelet endpoint的数据做了一次转换：

# kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes/ack.dedicated00009yibei |jq .
{
  "kind": "NodeMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "ack.dedicated00009yibei",
    "selfLink": "/apis/metrics.k8s.io/v1beta1/nodes/ack.dedicated00009yibei",
    "creationTimestamp": "2022-05-05T14:28:27Z"
  },
  "timestamp": "2022-05-05T14:27:32Z",
  "window": "30s",  ===》计算cpu累计值差值的时间窗口
  "usage": {
    "cpu": "157916713n",        ====》 nanocore，基于CPU累计值的计算结果值
    "memory": "2536012Ki"       ====》cgroup.working_set
  }
}

# kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/default/pods/centos7-74cd758d98-wcwnj |jq .
{
  "kind": "PodMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "centos7-74cd758d98-wcwnj",
    "namespace": "default",
    "selfLink": "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/centos7-74cd758d98-wcwnj",
    "creationTimestamp": "2022-05-05T14:32:39Z"
  },
  "timestamp": "2022-05-05T14:32:04Z",
  "window": "30s",
  "containers": [
    {
      "name": "centos7",
      "usage": {
        "cpu": "0",
        "memory": "224Ki"
      }
    }
  ]
}

五其他组件对Metrics API的调用

5.1 kubectl top

// -v=6看API请求

5.2 HPA

向metrics API请求数据拿到CPU/Memory值后做计算与展示：

https://kubernetes.io/zh/docs/tasks/run-application/horizontal-pod-autoscale/#kubectl-%E5%AF%B9-horizontal-pod-autoscaler-%E7%9A%84%E6%94%AF%E6%8C%81

计算逻辑：

DesiredReplicas=ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]                    
CurrentUtilization=int32((metricsTotal*100) /requestsTotal)     
其中，计算requestsTotal时，是将pod.Spec.Containers loop相加，一旦某一个container没有request就return报错，不累计init container.

***衍生出一个知识点***

开启HPA的deployment必须给每个container定义request值，不过initcontainer不做要求。

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/replica_calculator.go#L433

func calculatePodRequests(pods []*v1.Pod, container string, resource v1.ResourceName) (map[string]int64, error) {
	requests := make(map[string]int64, len(pods))
	for _, pod := range pods {
		podSum := int64(0)
		for _, c := range pod.Spec.Containers {
			if container == "" || container == c.Name {
				if containerRequest, ok := c.Resources.Requests[resource]; ok {
					podSum += containerRequest.MilliValue()
				} else {
					return nil, fmt.Errorf("missing request for %s", resource)
				}
			}
		}
		requests[pod.Name] = podSum
	}
	return requests, nil
}

// GetResourceUtilizationRatio takes in a set of metrics, a set of matching requests,
// and a target utilization percentage, and calculates the ratio of
// desired to actual utilization (returning that, the actual utilization, and the raw average value)
func GetResourceUtilizationRatio(metrics PodMetricsInfo, requests map[string]int64, targetUtilization int32) (utilizationRatio float64, currentUtilization int32, rawAverageValue int64, err error) {
	metricsTotal := int64(0)
	requestsTotal := int64(0)
	numEntries := 0

	for podName, metric := range metrics {
		request, hasRequest := requests[podName]
		if !hasRequest {
			// we check for missing requests elsewhere, so assuming missing requests == extraneous metrics
			continue
		}

		metricsTotal += metric.Value
		requestsTotal += request
		numEntries++
	}

	// if the set of requests is completely disjoint from the set of metrics,
	// then we could have an issue where the requests total is zero
	if requestsTotal == 0 {
		return 0, 0, 0, fmt.Errorf("no metrics returned matched known pods")
	}
额
	currentUtilization = int32((metricsTotal * 100) / requestsTotal)

	return float64(currentUtilization) / float64(targetUtilization), currentUtilization, metricsTotal / int64(numEntries), nil
}

参考：

《记一次pod oom的异常shmem输出》 https://developer.aliyun.com/article/1040230?spm=a2c6h.13262185.profile.46.994e2382PrPuO5
https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/cri-container-stats.md
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
https://github.com/kubernetes/kubernetes/blob/65178fec72df6275ed0aa3ede12c785ac79ab97a/pkg/controller/podautoscaler/replica_calculator.go#L424
https://github.com/kubernetes-sigs/metrics-server
https://github.com/kubernetes-sigs/metrics-server/blob/master/FAQ.md#how-cpu-usage-is-calculated
cAdvisor源码分析 : https://cloud.tencent.com/developer/article/1096375
https://github.com/google/cadvisor/blob/d6b0ddb07477b17b2f3ef62b032d815b1cb6884e/machine/machine.go
https://github.com/google/cadvisor/tree/3beb265804ea4b00dc8ed9125f1f71d3328a7a94/container/libcontainer
https://www.jianshu.com/p/7c18075aa735
https://www.cnblogs.com/gaorong/p/11716907.html
https://www.cnblogs.com/vinsent/p/15830271.html

标签：暂无标签

Metrics-Server指标获取链路

通过不同方式查看指标

1. 确定节点cgroup根目录

2. 查看pod/container cgroup指标文件目录路径的三种方式

3. 指标计算

4. 查看cadvisor指标

指标链路分五步解读

一数据源：Linux cgroup 层级结构：

二 cAdvisor从cgroup中收集container级别的指标数据

2.1 cadvisor如何计算working_set内存值

2.2 cAdvisor的接口/metrics/cadvisor 数据分析

2.3 问题：kubectl top pod包含pause容器的指标么？

三 Kubelet的API endpoint 提供Node/Pod/Container级别指标

3.1 Kubelet接口提供的指标数据

/metrics/resource 接口：

/stats/summary 接口

3.2 Kubelet如何计算CPU使用率：

3.3 Kubelet计算节点内存逻辑脚本(计算驱逐时的逻辑)：

四 Metrics-Server解读

4.1 关于apiservice metrics.k8s.io

4.2 关于heapster svc

4.3 关于 metrics-server pod 启动参数

4.4 Metrics-server API提供的指标数据

五其他组件对Metrics API的调用

5.1 kubectl top

5.2 HPA

讨论这个帖子（0）垃圾回帖将一律封号处理……

技术交流QQ群

Metrics-Server指标获取链路

通过不同方式查看指标

1. 确定节点cgroup根目录

2. 查看pod/container cgroup指标文件目录路径的三种方式

3. 指标计算

4. 查看cadvisor指标

指标链路分五步解读

一 数据源：Linux cgroup 层级结构：

二 cAdvisor从cgroup中收集container级别的指标数据

2.1 cadvisor如何计算working_set内存值

2.2 cAdvisor的接口/metrics/cadvisor 数据分析

2.3 问题：kubectl top pod包含pause容器的指标么？

三 Kubelet的API endpoint 提供Node/Pod/Container级别指标

3.1 Kubelet接口提供的指标数据

/metrics/resource 接口：

/stats/summary 接口

3.2 Kubelet如何计算CPU使用率：

3.3 Kubelet计算节点内存逻辑脚本(计算驱逐时的逻辑)：

四 Metrics-Server解读

4.1 关于apiservice metrics.k8s.io

4.2 关于heapster svc

4.3 关于 metrics-server pod 启动参数

4.4 Metrics-server API提供的指标数据

五 其他组件对Metrics API的调用

5.1 kubectl top

5.2 HPA

讨论这个帖子（0）垃圾回帖将一律封号处理……

技术交流QQ群

一数据源：Linux cgroup 层级结构：

五其他组件对Metrics API的调用