What is Azure Kubernetes Service (AKS) Network Observability?

Kubernetes is a powerful tool for managing containerized applications. As containerized environments grow in complexity, it can be difficult to identify and troubleshoot networking issues in a Kubernetes cluster.

Network observability is an important part of maintaining a healthy and performant Kubernetes cluster. By collecting and analyzing data about network traffic, you can gain insights into how your cluster is operating and identify potential problems before they cause outages or performance degradation.

Diagram of Network Observability components.

Overview of Network Observability add-on in AKS

Networking Observability add-on operates seamlessly on Non-Cilium and Cilium data-planes. It empowers customers with enterprise-grade capabilities for DevOps and SecOps. This solution offers a centralized way to monitor network issues in your cluster for cluster network administrators, cluster security administrators, and DevOps engineers.

When the Network Observability add-on is enabled, it allows for the collection and conversion of useful metrics into Prometheus format, which can then be visualized in Grafana.

  • Azure managed Prometheus and Grafana: A managed service provided by Azure, taking care of the infrastructure and maintenance of Prometheus and Grafana, allowing you to focus on configuring and visualizing your metrics.

  • Multi CNI Support: Network Observability add-on supports both Azure CNI and Kubenet network plugins.

Metrics

Network Observability add-on currently only supports node level metrics. Cilium and Non-Cilium dataplanes have different metrics, yet the Grafana dashboard works for seamlessly for both.

All metrics have the following labels:

  • cluster
  • instance (Node name)

On Non-Cilium dataplane, the Network Observability add-on provides metrics in both Linux and Windows platforms. The below table outlines the different metrics generated.

Metric Name Description Extra Labels Linux Windows
networkobservability_forward_count Total forwarded packet count direction
networkobservability_forward_bytes Total forwarded byte count direction
networkobservability_drop_count Total dropped packet count direction, reason
networkobservability_drop_bytes Total dropped byte count direction, reason
networkobservability_tcp_state TCP currently active socket count by TCP state. state
networkobservability_tcp_connection_remote TCP currently active socket count by remote IP/port. address (IP), port
networkobservability_tcp_connection_stats TCP connection statistics. (ex: Delayed ACKs, TCPKeepAlive, TCPSackFailures) statistic
networkobservability_tcp_flag_counters TCP packets count by flag. flag
networkobservability_ip_connection_stats IP connection statistics. statistic
networkobservability_udp_connection_stats UDP connection statistics. statistic
networkobservability_udp_active_sockets UDP currently active socket count
networkobservability_interface_stats Interface statistics. InterfaceName, statistic

Limitations

  • Pod level metrics aren't supported.

Scale

Certain scale limitations apply when you use Azure managed Prometheus and Grafana.

Next steps