Monitor and visualize AI inference metrics on Azure Kubernetes Service (AKS) with the AI toolchain operator (Preview)

Monitoring and observability play a key role in maintaining high performance and low cost of your AI workload deployments on AKS. Visibility into system and performance metrics can indicate the limits of your underlying infrastructure and motivate real-time adjustments and optimizations to reduce workload interruptions. Monitoring also provides valuable insights into resource utilization, enabling cost-effective management of computational resources and avoiding over-provisioning or under-provisioning.

The AI toolchain operator (KAITO) is a managed add-on for AKS that simplifies the deployment and operations for AI models on your AKS cluster. Starting with KAITO version 0.4.4, the vLLM inference runtime is enabled by default in the AKS managed add-on. vLLM surfaces key system performance, resource usage, and request processing Prometheus metrics that can be used to evaluate your KAITO inference deployments.

Important

AKS preview features are available on a self-service, opt-in basis. Previews are provided "as is" and "as available," and they're excluded from the service-level agreements and limited warranty. AKS previews are partially covered by customer support on a best-effort basis. As such, these features aren't meant for production use. For more information, see the following support articles:

In this article, you'll learn how to monitor and visualize vLLM inference metrics using the AI toolchain operator add-on (preview) with Azure Managed Prometheus and Azure Managed Grafana on your AKS cluster.

Before you begin

  • This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the Azure CLI, Azure PowerShell, or the Azure portal.
  • Azure CLI version 2.47.0 or later installed and configured. Run az --version to find the version. If you need to install or upgrade, see [Install Azure CLI][install-azure-cli].

Prerequisites

Deploy a KAITO inference service

  1. In this example, you collect metrics for the Qwen-2.5-coder-7B-instruct language model. Start by applying the following KAITO workspace CRD on your cluster:

    kubectl apply -f https://raw.githubusercontent.com/Azure/kaito/main/examples/inference/kaito_workspace_qwen_2.5_coder_7b-instruct.yaml
    
  2. Track the live resource changes in your KAITO workspace using the kubectl get command.

    kubectl get workspace workspace-qwen-2-5-coder-7b-instruct -w
    

    Note

    As you track the KAITO inference service deployment, note that machine readiness can take up to 10 minutes, and workspace readiness up to 20 minutes depending on the size of your model.

  3. Confirm that your inference service is running and get the service IP address using the kubectl get svc command.

    export SERVICE_IP=$(kubectl get svc workspace-qwen-2-5-coder-7b-instruct -o jsonpath='{.spec.clusterIP}')
    
    echo $SERVICE_IP
    

Surface KAITO inference metrics to Azure Managed Prometheus

Prometheus metrics are collected by default at the KAITO /metrics endpoint.

  1. Add the following label to your KAITO inference service so that it can be detected by a Kubernetes ServiceMonitor deployment:

    kubectl label svc workspace-qwen-2-5-coder-7b-instruct App=qwen-2-5-coder 
    
  2. Create a ServiceMonitor resource to define the inference service endpoints and configurations needed to scrape the vLLM Prometheus metrics. Export these metrics to Azure Managed Prometheus by deploying the following ServiceMonitor YAML manifest in the kube-system namespace :

    cat <<EOF | kubectl apply -n kube-system -f -
    apiVersion: azmonitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: prometheus-kaito-monitor
    spec:
      selector:
        matchLabels:
          App: qwen-2-5-coder
      endpoints:
      - port: http
        interval: 30s
        path: /metrics
        scheme: http
    EOF
    

    You should see the following output once the ServiceMonitor is created:

    servicemonitor.azmonitoring.coreos.com/prometheus-kaito-monitor created
    
  3. Confirm that your ServiceMonitor deployment is running successfully using the kubectl get command`.

    kubectl get servicemonitor prometheus-kaito-monitor -n kube-system
    
  4. Confirm that vLLM metrics are successfully collected in Azure Managed Prometheus on your Azure portal by navigating to the Prometheus explorer page under Managed Prometheus in your Azure Monitor Workspace.

  5. Select the Grid tab and confirm that there's a metrics item associated with the job named workspace-qwen-2-5-coder-7b-instruct.

    Note

    The up value of this item should be 1, indicating that Prometheus metrics are successfully being scraped from your AI inference service endpoint.

Visualize KAITO inference metrics in Azure Managed Grafana

  1. The vLLM project provides a Grafana dashboard configuration named grafana.json for inference workload monitoring. Navigate to the bottom of this page and copy the entire contents of the grafana.json file.

    Screenshot of vLLM Grafana dashboard configuration.

  2. Follow these steps to import the Grafana configurations into a new dashboard in Azure Managed Grafana.

  3. Navigate to your Managed Grafana endpoint, view the available dashboards. and select the new dashboard named vLLM.

    Screenshot of available dashboards in Azure Managed Grafana.

  4. To begin collecting data for your selected model deployment, confirm that the datasource shown at the top left of the Grafana dashboard is your Azure Managed Prometheus instance created for this example.

  5. Copy the inference preset name defined in your KAITO workspace into the model_name field in the Grafana dashboard. For this example, the model name is qwen2.5-coder-7b-instruct.

  6. In a few moments, the metrics for your KAITO inference service will populate in the vLLM Grafana dashboard.

    Screenshot of vLLM Grafana dashboard for example inference service deployment.

    Note

    The value of these inference metrics will remain 0 until the requests are submitted to the model inference server.

Next steps