Azure Kubernetes Service monitoring data reference

This article contains all the monitoring reference information for this service.

See Monitor Azure Kubernetes Service (AKS) for details on the data you can collect for AKS and how to use it.

Metrics

This section lists all the automatically collected platform metrics for this service.

For information on metric retention, see Azure Monitor Metrics overview.

Minimal ingestion for default ON targets

The following metrics are allow-listed with minimalingestionprofile=true for default ON targets. The below metrics are collected by default, as these targets are scraped by default.

controlplane-apiserver:

  • apiserver_request_total
  • apiserver_cache_list_fetched_objects_total
  • apiserver_cache_list_returned_objects_total
  • apiserver_flowcontrol_demand_seats_average
  • apiserver_flowcontrol_current_limit_seats
  • apiserver_request_sli_duration_seconds_bucket
  • apiserver_request_sli_duration_seconds_sum
  • apiserver_request_sli_duration_seconds_count
  • process_start_time_seconds
  • apiserver_request_duration_seconds_bucket
  • apiserver_request_duration_seconds_sum
  • apiserver_request_duration_seconds_count
  • apiserver_storage_list_fetched_objects_total
  • apiserver_storage_list_returned_objects_total
  • apiserver_current_inflight_requests

Note

apiserver_request_sli_duration_seconds_bucket and apiserver_request_duration_seconds_bucket are not collected now with a recent release. These are high cardinality metrics which may increase the number of metrics stored based on the number of custom resources in the cluster. If you would like to collect these bucket metrics, you can add it to the keep list. We highly recommend not turning off the minimal ingestion profile for the control plane components

controlplane-etcd:

  • etcd_server_has_leader
  • rest_client_requests_total
  • etcd_mvcc_db_total_size_in_bytes
  • etcd_mvcc_db_total_size_in_use_in_bytes
  • etcd_server_slow_read_indexes_total
  • etcd_server_slow_apply_total
  • etcd_network_client_grpc_sent_bytes_total
  • etcd_server_heartbeat_send_failures_total

Minimal ingestion for default OFF targets

The following are metrics that are allow-listed with minimalingestionprofile=true for default OFF targets. These metrics aren't collected by default. You can turn ON scraping for these targets using default-scrape-settings-enabled.<target-name>=true using the ama-metrics-settings-configmap under the default-scrape-settings-enabled section.

controlplane-kube-controller-manager:

  • workqueue_depth
  • rest_client_requests_total
  • rest_client_request_duration_seconds

controlplane-kube-scheduler:

  • scheduler_pending_pods
  • scheduler_unschedulable_pods
  • scheduler_queue_incoming_pods_total
  • scheduler_schedule_attempts_total
  • scheduler_preemption_attempts_total

controlplane-cluster-autoscaler:

  • rest_client_requests_total
  • cluster_autoscaler_last_activity
  • cluster_autoscaler_cluster_safe_to_autoscale
  • cluster_autoscaler_failed_scale_ups_total
  • cluster_autoscaler_scale_down_in_cooldown
  • cluster_autoscaler_scaled_up_nodes_total
  • cluster_autoscaler_unneeded_nodes_count
  • cluster_autoscaler_unschedulable_pods_count
  • cluster_autoscaler_nodes_count
  • cloudprovider_azure_api_request_errors
  • cloudprovider_azure_api_request_duration_seconds_bucket
  • cloudprovider_azure_api_request_duration_seconds_count

Note

The CPU and memory usage metrics for all control-plane targets are not exposed irrespective of the profile.

Metric dimensions

For information about what metric dimensions are, see Multi-dimensional metrics.

This service has the following dimensions associated with its metrics.

Dimension Name Description
requestKind Used by metrics such as Inflight Requests to split by type of request.
condition Used by metrics such as Statuses for various node conditions, Number of pods in Ready state to split by condition type.
status Used by metrics such as Statuses for various node conditions to split by status of the condition.
status2 Used by metrics such as Statuses for various node conditions to split by status of the condition.
node Used by metrics such as CPU Usage Millicores to split by the name of the node.
phase Used by metrics such as Number of pods by phase to split by the phase of the pod.
namespace Used by metrics such as Number of pods by phase to split by the namespace of the pod.
pod Used by metrics such as Number of pods by phase to split by the name of the pod.
nodepool Used by metrics such as Disk Used Bytes to split by the name of the nodepool.
device Used by metrics such as Disk Used Bytes to split by the name of the device.
3gppGen Used by metrics such as Number of Active PDU Sessions.
Cause Used by metrics such as User plane packet drop rate.
Direction Used by metrics such as User plane bandwidth.
Dnn Used by metrics such as PDU session establishment attempts rate.
Interface Used by metrics such as User plane bandwidth.
LUN Used by metrics such as Percentage of data disk bandwidth consumed.
PccpId Used by metrics such as Number of Active PDU Sessions.
Result Used by metrics such as Authentication failure rate.
SiteId Used by metrics such as Number of Active PDU Sessions.
Tai Used by metrics such as Service request failure rate.
VMName Used by metrics such as Amount of physical memory.

Resource logs

This section lists the types of resource logs you can collect for this service. The section pulls from the list of all resource logs category types supported in Azure Monitor.

The following table lists a few example operations related to AKS that might be created in the Activity log. Use the Activity log to track information such as when a cluster is created or had its configuration change. You can view this information in the portal or by using other methods. You can also use it to create an Activity log alert to be proactively notified when an event occurs.

Operation Description
Microsoft.ContainerService/managedClusters/write Create or update managed cluster
Microsoft.ContainerService/managedClusters/delete Delete Managed Cluster
Microsoft.ContainerService/managedClusters/listClusterMonitoringUserCredential/action List clusterMonitoringUser credential
Microsoft.ContainerService/managedClusters/listClusterAdminCredential/action List clusterAdmin credential
Microsoft.ContainerService/managedClusters/agentpools/write Create or Update Agent Pool