Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Monitoring and observability play a key role in maintaining high performance and low cost of your AI workload deployments on AKS. Visibility into system and performance metrics can indicate the limits of your underlying infrastructure and motivate real-time adjustments and optimizations to reduce workload interruptions. Monitoring also provides valuable insights into resource utilization, enabling cost-effective management of computational resources and avoiding over-provisioning or under-provisioning.
The AI toolchain operator (KAITO) is a managed add-on for AKS that simplifies the deployment and operations for AI models on your AKS cluster. Starting with KAITO version 0.4.4, the vLLM inference runtime is enabled by default in the AKS managed add-on. vLLM surfaces key system performance, resource usage, and request processing Prometheus metrics that can be used to evaluate your KAITO inference deployments.
Important
AKS preview features are available on a self-service, opt-in basis. Previews are provided "as is" and "as available," and they're excluded from the service-level agreements and limited warranty. AKS previews are partially covered by customer support on a best-effort basis. As such, these features aren't meant for production use. For more information, see the following support articles:
In this article, you'll learn how to monitor and visualize vLLM inference metrics using the AI toolchain operator add-on (preview) with Azure Managed Prometheus and Azure Managed Grafana on your AKS cluster.
Before you begin
- This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the Azure CLI, Azure PowerShell, or the Azure portal.
- Azure CLI version 2.47.0 or later installed and configured. Run
az --version
to find the version. If you need to install or upgrade, see [Install Azure CLI][install-azure-cli].
Prerequisites
- The Kubernetes command-line client, kubectl, installed and configured. For more information, see Install kubectl.
- Enable the AI toolchain operator add-on on your AKS cluster.
- If you already have the AI toolchain operator add-on enabled, update your AKS cluster to the latest version to run KAITO v0.4.4 or above.
- Enable Azure Managed Prometheus and Azure Managed Grafana on your AKS cluster.
- Sufficient permission to create and/or update Grafana instances in your Azure subscription.
Deploy a KAITO inference service
In this example, you collect metrics for the Qwen-2.5-coder-7B-instruct language model. Start by applying the following KAITO workspace CRD on your cluster:
kubectl apply -f https://raw.githubusercontent.com/Azure/kaito/main/examples/inference/kaito_workspace_qwen_2.5_coder_7b-instruct.yaml
Track the live resource changes in your KAITO workspace using the
kubectl get
command.kubectl get workspace workspace-qwen-2-5-coder-7b-instruct -w
Note
As you track the KAITO inference service deployment, note that machine readiness can take up to 10 minutes, and workspace readiness up to 20 minutes depending on the size of your model.
Confirm that your inference service is running and get the service IP address using the
kubectl get svc
command.export SERVICE_IP=$(kubectl get svc workspace-qwen-2-5-coder-7b-instruct -o jsonpath='{.spec.clusterIP}') echo $SERVICE_IP
Surface KAITO inference metrics to Azure Managed Prometheus
Prometheus metrics are collected by default at the KAITO /metrics
endpoint.
Add the following label to your KAITO inference service so that it can be detected by a Kubernetes
ServiceMonitor
deployment:kubectl label svc workspace-qwen-2-5-coder-7b-instruct App=qwen-2-5-coder
Create a
ServiceMonitor
resource to define the inference service endpoints and configurations needed to scrape the vLLM Prometheus metrics. Export these metrics to Azure Managed Prometheus by deploying the followingServiceMonitor
YAML manifest in thekube-system
namespace :cat <<EOF | kubectl apply -n kube-system -f - apiVersion: azmonitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: prometheus-kaito-monitor spec: selector: matchLabels: App: qwen-2-5-coder endpoints: - port: http interval: 30s path: /metrics scheme: http EOF
You should see the following output once the
ServiceMonitor
is created:servicemonitor.azmonitoring.coreos.com/prometheus-kaito-monitor created
Confirm that your
ServiceMonitor
deployment is running successfully using thekubectl get
command`.kubectl get servicemonitor prometheus-kaito-monitor -n kube-system
Confirm that vLLM metrics are successfully collected in Azure Managed Prometheus on your Azure portal by navigating to the Prometheus explorer page under Managed Prometheus in your Azure Monitor Workspace.
Select the Grid tab and confirm that there's a metrics item associated with the job named
workspace-qwen-2-5-coder-7b-instruct
.Note
The
up
value of this item should be 1, indicating that Prometheus metrics are successfully being scraped from your AI inference service endpoint.
Visualize KAITO inference metrics in Azure Managed Grafana
The vLLM project provides a Grafana dashboard configuration named grafana.json for inference workload monitoring. Navigate to the bottom of this page and copy the entire contents of the
grafana.json
file.Follow these steps to import the Grafana configurations into a new dashboard in Azure Managed Grafana.
Navigate to your Managed Grafana endpoint, view the available dashboards. and select the new dashboard named
vLLM
.To begin collecting data for your selected model deployment, confirm that the
datasource
shown at the top left of the Grafana dashboard is your Azure Managed Prometheus instance created for this example.Copy the inference preset name defined in your KAITO workspace into the
model_name
field in the Grafana dashboard. For this example, the model name is qwen2.5-coder-7b-instruct.In a few moments, the metrics for your KAITO inference service will populate in the vLLM Grafana dashboard.
Note
The value of these inference metrics will remain 0 until the requests are submitted to the model inference server.
Next steps
- Monitor and visualize your AKS deployments at scale.
- Learn about AKS GPU workload deployment options on Linux