Compartir a través de

Azure Monitor 中的默认 Prometheus 指标配置

本文列出了将 Prometheus 指标配置为从任何 Azure Kubernetes 服务 (AKS) 群集中抓取时的默认目标、仪表板和记录规则。

最小引入配置文件

Minimal ingestion profile 是一种有助于减少指标引入量的设置,因为只有默认仪表板使用的指标,才会收集默认记录规则和默认警报。 对于基于加载项的集合,Minimal ingestion profile 设置默认处于启用状态。 可以修改集合以启用收集更多指标,如下所示。

抓取频率

所有默认目标和抓取的默认抓取频率为 30 秒。

默认抓取的目标

以下目标默认处于“已启用/开启”状态,这意味着你不必提供任何抓取作业配置来抓取这些目标,因为指标加载项默认会自动抓取这些目标。

  • cadvisor (job=cadvisor)
  • nodeexporter (job=node)
  • kubelet (job=kubelet)
  • kube-state-metrics (job=kube-state-metrics)
  • networkobservabilityRetina (job=networkobservabilityRetina)

启用控制平面指标(预览版)功能时,以下目标将启用/打开。 可以使用控制平面指标最大程度地提高整体可观测性,并保持 AKS 群集的卓越运营。

  • controlplane-apiserver (job=controlplane-apiserver)
  • controlplane-etcd (job=controlplane-etcd)

启用容器网络可观测性(这是高级容器网络服务套件的一项功能,它与所有 Linux 工作负载兼容,并与 Hubble 无缝集成,对于基于 Cilium 和非 Cilium 的数据平面均适用)时,以下目标将启用/打开。 这样可以灵活地满足容器网络需求。

  • networkobservabilityHubble (job=networkobservabilityHubble)
  • networkobservabilityCilium (job=networkobservabilityCilium)

从默认目标收集的指标

默认情况下,将从每个默认目标收集以下指标。 所有其他指标都将通过重新标记规则删除。

cadvisor (job=cadvisor)

  • container_spec_cpu_period
  • container_spec_cpu_quota
  • container_cpu_usage_seconds_total
  • container_memory_rss
  • container_network_receive_bytes_total
  • container_network_transmit_bytes_total
  • container_network_receive_packets_total
  • container_network_transmit_packets_total
  • container_network_receive_packets_dropped_total
  • container_network_transmit_packets_dropped_total
  • container_fs_reads_total
  • container_fs_writes_total
  • container_fs_reads_bytes_total
  • container_fs_writes_bytes_total
  • container_memory_working_set_bytes
  • container_memory_cache
  • container_memory_swap
  • container_cpu_cfs_throttled_periods_total
  • container_cpu_cfs_periods_total
  • container_memory_usage_bytes
  • kubernetes_build_info"

kubelet (job=kubelet)

  • kubelet_volume_stats_used_bytes
  • kubelet_node_name
  • kubelet_running_pods
  • kubelet_running_pod_count
  • kubelet_running_containers
  • kubelet_running_container_count
  • volume_manager_total_volumes
  • kubelet_node_config_error
  • kubelet_runtime_operations_total
  • kubelet_runtime_operations_errors_total
  • kubelet_runtime_operations_duration_seconds kubelet_runtime_operations_duration_seconds_bucket kubelet_runtime_operations_duration_seconds_sum kubelet_runtime_operations_duration_seconds_count
  • kubelet_pod_start_duration_seconds kubelet_pod_start_duration_seconds_bucket kubelet_pod_start_duration_seconds_sum kubelet_pod_start_duration_seconds_count
  • kubelet_pod_worker_duration_seconds kubelet_pod_worker_duration_seconds_bucket kubelet_pod_worker_duration_seconds_sum kubelet_pod_worker_duration_seconds_count
  • storage_operation_duration_seconds storage_operation_duration_seconds_bucket storage_operation_duration_seconds_sum storage_operation_duration_seconds_count
  • storage_operation_errors_total
  • kubelet_cgroup_manager_duration_seconds kubelet_cgroup_manager_duration_seconds_bucket kubelet_cgroup_manager_duration_seconds_sum kubelet_cgroup_manager_duration_seconds_count
  • kubelet_pleg_relist_duration_seconds kubelet_pleg_relist_duration_seconds_bucket kubelet_pleg_relist_duration_sum kubelet_pleg_relist_duration_seconds_count
  • kubelet_pleg_relist_interval_seconds kubelet_pleg_relist_interval_seconds_bucket kubelet_pleg_relist_interval_seconds_sum kubelet_pleg_relist_interval_seconds_count
  • rest_client_requests_total
  • rest_client_request_duration_seconds rest_client_request_duration_seconds_bucket rest_client_request_duration_seconds_sum rest_client_request_duration_seconds_count
  • process_resident_memory_bytes
  • process_cpu_seconds_total
  • go_goroutines
  • kubelet_volume_stats_capacity_bytes
  • kubelet_volume_stats_available_bytes
  • kubelet_volume_stats_inodes_used
  • kubelet_volume_stats_inodes
  • kubernetes_build_info"

nodexporter (job=node)

  • node_cpu_seconds_total
  • node_memory_MemAvailable_bytes
  • node_memory_Buffers_bytes
  • node_memory_Cached_bytes
  • node_memory_MemFree_bytes
  • node_memory_Slab_bytes
  • node_memory_MemTotal_bytes
  • node_netstat_Tcp_RetransSegs
  • node_netstat_Tcp_OutSegs
  • node_netstat_TcpExt_TCPSynRetrans
  • node_load1``node_load5
  • node_load15
  • node_disk_read_bytes_total
  • node_disk_written_bytes_total
  • node_disk_io_time_seconds_total
  • node_filesystem_size_bytes
  • node_filesystem_avail_bytes
  • node_filesystem_readonly
  • node_network_receive_bytes_total
  • node_network_transmit_bytes_total
  • node_vmstat_pgmajfault
  • node_network_receive_drop_total
  • node_network_transmit_drop_total
  • node_disk_io_time_weighted_seconds_total
  • node_exporter_build_info
  • node_time_seconds
  • node_uname_info"

kube-state-metrics (job=kube-state-metrics)

  • kube_job_status_succeeded
  • kube_job_spec_completions
  • kube_daemonset_status_desired_number_scheduled
  • kube_daemonset_status_number_ready
  • kube_deployment_status_replicas_ready
  • kube_pod_container_status_last_terminated_reason
  • kube_pod_container_status_waiting_reason
  • kube_pod_container_status_restarts_total
  • kube_node_status_allocatable
  • kube_pod_owner
  • kube_pod_container_resource_requests
  • kube_pod_status_phase
  • kube_pod_container_resource_limits
  • kube_replicaset_owner
  • kube_resourcequota
  • kube_namespace_status_phase
  • kube_node_status_capacity
  • kube_node_info
  • kube_pod_info
  • kube_deployment_spec_replicas
  • kube_deployment_status_replicas_available
  • kube_deployment_status_replicas_updated
  • kube_statefulset_status_replicas_ready
  • kube_statefulset_status_replicas
  • kube_statefulset_status_replicas_updated
  • kube_job_status_start_time
  • kube_job_status_active
  • kube_job_failed
  • kube_horizontalpodautoscaler_status_desired_replicas
  • kube_horizontalpodautoscaler_status_current_replicas
  • kube_horizontalpodautoscaler_spec_min_replicas
  • kube_horizontalpodautoscaler_spec_max_replicas
  • kubernetes_build_info
  • kube_node_status_condition
  • kube_node_spec_taint
  • kube_pod_container_info
  • kube_resource_labels (ex - kube_pod_labels, kube_deployment_labels)
  • kube_resource_annotations (ex - kube_pod_annotations, kube_deployment_annotations)

controlplane-apiserver (job=controlplane-apiserver)

  • apiserver_request_total
  • apiserver_cache_list_fetched_objects_total
  • apiserver_cache_list_returned_objects_total
  • apiserver_flowcontrol_demand_seats_average
  • apiserver_flowcontrol_current_limit_seats
  • apiserver_request_sli_duration_seconds_bucket
  • apiserver_request_sli_duration_seconds_count
  • apiserver_request_sli_duration_seconds_sum
  • process_start_time_seconds
  • apiserver_request_duration_seconds_bucket
  • apiserver_request_duration_seconds_count
  • apiserver_request_duration_seconds_sum
  • apiserver_storage_list_fetched_objects_total
  • apiserver_storage_list_returned_objects_total
  • apiserver_current_inflight_requests

controlplane-etcd (job=controlplane-etcd)

  • etcd_server_has_leader
  • rest_client_requests_total
  • etcd_mvcc_db_total_size_in_bytes
  • etcd_mvcc_db_total_size_in_use_in_bytes
  • etcd_server_slow_read_indexes_total
  • etcd_server_slow_apply_total
  • etcd_network_client_grpc_sent_bytes_total
  • etcd_server_heartbeat_send_failures_total

networkobservabilityHubble (job=networkobservabilityHubble)networkobservabilityCilium (job=networkobservabilityCilium)

默认抓取的 Windows 目标

以下 Windows 目标配置为抓取,但默认情况下抓取未启用(已禁用/关闭),这意味着你不必提供任何抓取作业配置来抓取这些目标,但它们默认处于禁用/关闭状态,你需要使用 default-scrape-settings-enabled 部分下的 ama-metrics-settings-configmap 为这些目标开启/启用抓取

可以针对 Windows 运行两个默认作业,这些作业将抓取特定于 Windows 的仪表板所需的指标。

  • windows-exporter (job=windows-exporter)
  • kube-proxy-windows (job=kube-proxy-windows)

注意

这需要应用或更新 ama-metrics-settings-configmap configmap 并在所有 Windows 节点上安装 windows-exporter。 有关详细信息,请参阅启用文档

为 Windows 抓取的指标

当启用 windows-exporter 和 kube-proxy-windows 时,将收集以下指标。

windows-exporter (job=windows-exporter)

  • windows_system_system_up_time
  • windows_cpu_time_total
  • windows_memory_available_bytes
  • windows_os_visible_memory_bytes
  • windows_memory_cache_bytes
  • windows_memory_modified_page_list_bytes
  • windows_memory_standby_cache_core_bytes
  • windows_memory_standby_cache_normal_priority_bytes
  • windows_memory_standby_cache_reserve_bytes
  • windows_memory_swap_page_operations_total
  • windows_logical_disk_read_seconds_total
  • windows_logical_disk_write_seconds_total
  • windows_logical_disk_size_bytes
  • windows_logical_disk_free_bytes
  • windows_net_bytes_total
  • windows_net_packets_received_discarded_total
  • windows_net_packets_outbound_discarded_total
  • windows_container_available
  • windows_container_cpu_usage_seconds_total
  • windows_container_memory_usage_commit_bytes
  • windows_container_memory_usage_private_working_set_bytes
  • windows_container_network_receive_bytes_total
  • windows_container_network_transmit_bytes_total

kube-proxy-windows (job=kube-proxy-windows)

  • kubeproxy_sync_proxy_rules_duration_seconds
  • kubeproxy_sync_proxy_rules_duration_seconds_bucket
  • kubeproxy_sync_proxy_rules_duration_seconds_sum
  • kubeproxy_sync_proxy_rules_duration_seconds_count
  • rest_client_requests_total
  • rest_client_request_duration_seconds
  • rest_client_request_duration_seconds_bucket
  • rest_client_request_duration_seconds_sum
  • rest_client_request_duration_seconds_count
  • process_resident_memory_bytes
  • process_cpu_seconds_total
  • go_goroutines

仪表板

当你将 Azure Monitor 工作区链接到 Azure 托管 Grafana 实例时,适用于 Prometheus 的 Azure Monitor 托管服务将自动预配和配置以下默认仪表板。 这些仪表板的源代码可在此 GitHub 存储库中找到。 以下仪表板将在 Grafana 中 Managed Prometheus 文件夹下的指定 Azure Grafana 实例中预配。 这些是标准开放源代码社区仪表板,可通过 Prometheus 和 Grafana 监视 Kubernetes 群集。

  • Kubernetes / Compute Resources / Cluster
  • Kubernetes / Compute Resources / Namespace (Pods)
  • Kubernetes / Compute Resources / Node (Pods)
  • Kubernetes / Compute Resources / Pod
  • Kubernetes / Compute Resources / Namespace (Workloads)
  • Kubernetes / Compute Resources / Workload
  • Kubernetes / Kubelet
  • Node Exporter / USE Method / Node
  • Node Exporter / Nodes
  • Kubernetes / Compute Resources / Cluster (Windows)
  • Kubernetes / Compute Resources / Namespace (Windows)
  • Kubernetes / Compute Resources / Pod (Windows)
  • Kubernetes / USE Method / Cluster (Windows)
  • Kubernetes / USE Method / Node (Windows)

记录规则

当你配置要从 Azure Kubernetes 服务 (AKS) 群集中抓取的 Prometheus 指标时,适用于 Prometheus 的 Azure Monitor 托管服务会自动配置以下默认记录规则。 这些记录规则的源代码可在此 GitHub 存储库中找到。 这些是上述仪表板中使用的标准开放源代码记录规则。

  • cluster:node_cpu:ratio_rate5m
  • namespace_cpu:kube_pod_container_resource_requests:sum
  • namespace_cpu:kube_pod_container_resource_limits:sum
  • :node_memory_MemAvailable_bytes:sum
  • namespace_memory:kube_pod_container_resource_requests:sum
  • namespace_memory:kube_pod_container_resource_limits:sum
  • namespace_workload_pod:kube_pod_owner:relabel
  • node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests
  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits
  • cluster:namespace:pod_memory:active:kube_pod_container_resource_requests
  • cluster:namespace:pod_memory:active:kube_pod_container_resource_limits
  • node_namespace_pod_container:container_memory_working_set_bytes
  • node_namespace_pod_container:container_memory_rss
  • node_namespace_pod_container:container_memory_cache
  • node_namespace_pod_container:container_memory_swap
  • instance:node_cpu_utilisation:rate5m
  • instance:node_load1_per_cpu:ratio
  • instance:node_memory_utilisation:ratio
  • instance:node_vmstat_pgmajfault:rate5m
  • instance:node_network_receive_bytes_excluding_lo:rate5m
  • instance:node_network_transmit_bytes_excluding_lo:rate5m
  • instance:node_network_receive_drop_excluding_lo:rate5m
  • instance:node_network_transmit_drop_excluding_lo:rate5m
  • instance_device:node_disk_io_time_seconds:rate5m
  • instance_device:node_disk_io_time_weighted_seconds:rate5m
  • instance:node_num_cpu:sum
  • node:windows_node:sum
  • node:windows_node_num_cpu:sum
  • :windows_node_cpu_utilisation:avg5m
  • node:windows_node_cpu_utilisation:avg5m
  • :windows_node_memory_utilisation:
  • :windows_node_memory_MemFreeCached_bytes:sum
  • node:windows_node_memory_totalCached_bytes:sum
  • :windows_node_memory_MemTotal_bytes:sum
  • node:windows_node_memory_bytes_available:sum
  • node:windows_node_memory_bytes_total:sum
  • node:windows_node_memory_utilisation:ratio
  • node:windows_node_memory_utilisation:
  • node:windows_node_memory_swap_io_pages:irate
  • :windows_node_disk_utilisation:avg_irate
  • node:windows_node_disk_utilisation:avg_irate
  • node:windows_node_filesystem_usage:
  • node:windows_node_filesystem_avail:
  • :windows_node_net_utilisation:sum_irate
  • node:windows_node_net_utilisation:sum_irate
  • :windows_node_net_saturation:sum_irate
  • node:windows_node_net_saturation:sum_irate
  • windows_pod_container_available
  • windows_container_total_runtime
  • windows_container_memory_usage
  • windows_container_private_working_set_usage
  • windows_container_network_received_bytes_total
  • windows_container_network_transmitted_bytes_total
  • kube_pod_windows_container_resource_memory_request
  • kube_pod_windows_container_resource_memory_limit
  • kube_pod_windows_container_resource_cpu_cores_request
  • kube_pod_windows_container_resource_cpu_cores_limit
  • namespace_pod_container:windows_container_cpu_usage_seconds_total:sum_rate

Prometheus 可视化效果记录规则

当使用基于 Prometheus 的容器见解时,将部署更多记录规则以支持 Prometheus 可视化效果。

  • ux:cluster_pod_phase_count:sum
  • ux:node_cpu_usage:sum_irate
  • ux:node_memory_usage:sum
  • ux:controller_pod_phase_count:sum
  • ux:controller_container_count:sum
  • ux:controller_workingset_memory:sum
  • ux:controller_cpu_usage:sum_irate
  • ux:controller_rss_memory:sum
  • ux:controller_resource_limit:sum
  • ux:controller_container_restarts:max
  • ux:pod_container_count:sum
  • ux:pod_cpu_usage:sum_irate
  • ux:pod_workingset_memory:sum
  • ux:pod_rss_memory:sum
  • ux:pod_resource_limit:sum
  • ux:pod_container_restarts:max
  • ux:node_network_receive_drop_total:sum_irate
  • ux:node_network_transmit_drop_total:sum_irate

Windows 支持需要以下记录规则。 这些规则与上述规则一起部署,但默认情况下并未启用。

  • ux:node_cpu_usage_windows:sum_irate
  • ux:node_memory_usage_windows:sum
  • ux:controller_cpu_usage_windows:sum_irate
  • ux:controller_workingset_memory_windows:sum
  • ux:pod_cpu_usage_windows:sum_irate
  • ux:pod_workingset_memory_windows:sum

后续步骤

自定义 Prometheus 指标抓取