Troubleshoot platform issues for Azure Arc-enabled Kubernetes clusters
This document provides troubleshooting guides for issues with Azure Arc-enabled Kubernetes connectivity, permissions, and agents. It also provides troubleshooting guides for Azure GitOps, which can be used in either Azure Arc-enabled Kubernetes or Azure Kubernetes Service (AKS) clusters.
Azure CLI
Before using az connectedk8s
or az k8s-configuration
CLI commands, ensure that Azure CLI is set to work against the correct Azure subscription.
az account set --subscription 'subscriptionId'
az account show
Azure Arc agents
All agents for Azure Arc-enabled Kubernetes are deployed as pods in the azure-arc
namespace. All pods should be running and passing their health checks.
First, verify the Azure Arc Helm Chart release:
$ helm --namespace default status azure-arc
NAME: azure-arc
LAST DEPLOYED: Fri Apr 3 11:13:10 2020
NAMESPACE: default
STATUS: deployed
REVISION: 5
TEST SUITE: None
If the Helm Chart release isn't found or missing, try connecting the cluster to Azure Arc again.
If the Helm Chart release is present with STATUS: deployed
, check the status of the agents using kubectl
:
$ kubectl -n azure-arc get deployments,pods
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cluster-metadata-operator 1/1 1 1 3d19h
deployment.apps/clusterconnect-agent 1/1 1 1 3d19h
deployment.apps/clusteridentityoperator 1/1 1 1 3d19h
deployment.apps/config-agent 1/1 1 1 3d19h
deployment.apps/controller-manager 1/1 1 1 3d19h
deployment.apps/extension-events-collector 1/1 1 1 3d19h
deployment.apps/extension-manager 1/1 1 1 3d19h
deployment.apps/flux-logs-agent 1/1 1 1 3d19h
deployment.apps/kube-aad-proxy 1/1 1 1 3d19h
deployment.apps/metrics-agent 1/1 1 1 3d19h
deployment.apps/resource-sync-agent 1/1 1 1 3d19h
NAME READY STATUS RESTARTS AGE
pod/cluster-metadata-operator-74747b975-9phtz 2/2 Running 0 3d19h
pod/clusterconnect-agent-cf4c7849c-88fmf 3/3 Running 0 3d19h
pod/clusteridentityoperator-79bdfd945f-pt2rv 2/2 Running 0 3d19h
pod/config-agent-67bcb94b7c-d67t8 1/2 Running 0 3d19h
pod/controller-manager-559dd48b64-v6rmk 2/2 Running 0 3d19h
pod/extension-events-collector-85f4fbff69-55zmt 2/2 Running 0 3d19h
pod/extension-manager-7c7668446b-69gps 3/3 Running 0 3d19h
pod/flux-logs-agent-fc7c6c959-vgqvm 1/1 Running 0 3d19h
pod/kube-aad-proxy-84d668c44b-j457m 2/2 Running 0 3d19h
pod/metrics-agent-58fb8554df-5ll67 2/2 Running 0 3d19h
pod/resource-sync-agent-dbf5db848-c9lg8 2/2 Running 0 3d19h
All pods should show STATUS
as Running
with either 3/3
or 2/2
under the READY
column. Fetch logs and describe the pods returning an Error
or CrashLoopBackOff
. If any pods are stuck in Pending
state, there might be insufficient resources on cluster nodes. Scaling up your cluster can get these pods to transition to Running
state.
Resource provisioning failed/Service timeout error
If you see these errors, check Azure status to see if there are any active events impacting the status of the Azure Arc-enabled Kubernetes service. If so, wait until the service event has been resolved, then try onboarding again after deleting the existing connected cluster resource. If there are no service events, and you continue to face issues while onboarding, open a support ticket so we can investigate the problem.
Overage claims error
If you receive an overage claim, make sure that your service principal isn't part of more than 200 Microsoft Entra groups. If this is the case, you must create and use another service principal that isn't a member of more than 200 groups, or remove the original service principal from some of its groups and try again.
An overage claim may also occur if you have configured an outbound proxy environment without allowing the endpoint https://<region>.obo.arc.azure.cn:8084/
for outbound traffic.
If neither of these apply, open a support request so we can look into the issue.
Issues when connecting Kubernetes clusters to Azure Arc
Connecting clusters to Azure Arc requires access to an Azure subscription and cluster-admin
access to a target cluster. If you can't reach the cluster, or if you have insufficient permissions, connecting the cluster to Azure Arc will fail. Make sure you've met all of the prerequisites to connect a cluster.
Tip
For a visual guide to troubleshooting connection issues, see Diagnose connection issues for Arc-enabled Kubernetes clusters.
DNS resolution issues
Visit Debugging DNS Resolution for help resolving issues with DNS resolution on your cluster.
Outbound network connectivity issues
Issues with outbound network connectivity from the cluster may arise for different reasons. First make sure all of the network requirements have been met.
If you encounter connectivity issues, and your cluster is behind an outbound proxy server, make sure you've passed proxy parameters during the onboarding of your cluster and that the proxy is configured correctly. For more information, see Connect using an outbound proxy server.
You may see an error similar to the following:
An exception has occurred while trying to execute the cluster diagnostic checks in the cluster. Exception: Unable to pull cluster-diagnostic-checks helm chart from the registry 'mcr.microsoft.com/azurearck8s/helmchart/stable/clusterdiagnosticchecks:0.1.2': Error: failed to do request: Head "https://mcr.microsoft.com/v2/azurearck8s/helmchart/stable/clusterdiagnosticchecks/manifests/0.1.2": dial tcp xx.xx.xx.219:443: i/o timeout
This error occurs when the https://k8connecthelm.azureedge.net
endpoint is blocked. Be sure that your network allows connectivity to this endpoint and meets all of the other networking requirements.
Unable to retrieve MSI certificate
Problems retrieving the MSI certificate are usually due to network issues. Check to make sure all of the network requirements have been met, then try again.
Insufficient cluster permissions
If the provided kubeconfig file doesn't have sufficient permissions to install the Azure Arc agents, the Azure CLI command returns an error: Error: list: failed to list: secrets is forbidden: User "myuser" cannot list resource "secrets" in API group "" at the cluster scope
To resolve this issue, ensure that the user connecting the cluster to Azure Arc has the cluster-admin
role assigned.
Unable to connect OpenShift cluster to Azure Arc
If az connectedk8s connect
is timing out and failing when connecting an OpenShift cluster to Azure Arc:
Ensure that the OpenShift cluster meets the version prerequisites: 4.5.41+ or 4.6.35+ or 4.7.18+.
Before you run
az connectedk8s connnect
, run this command on the cluster:oc adm policy add-scc-to-user privileged system:serviceaccount:azure-arc:azure-arc-kube-aad-proxy-sa
Installation timeouts
Connecting a Kubernetes cluster to Azure Arc-enabled Kubernetes requires installation of Azure Arc agents on the cluster. If the cluster is running over a slow internet connection, the container image pull for agents may take longer than the Azure CLI timeouts.
Helm timeout error
You may see the error Unable to install helm release: Error: UPGRADE Failed: time out waiting for the condition
. To resolve this issue, try the following steps:
Run the following command:
kubectl get pods -n azure-arc
Check if the
clusterconnect-agent
or theconfig-agent
pods are showingcrashloopbackoff
, or if not all containers are running:NAME READY STATUS RESTARTS AGE cluster-metadata-operator-664bc5f4d-chgkl 2/2 Running 0 4m14s clusterconnect-agent-7cb8b565c7-wklsh 2/3 CrashLoopBackOff 0 1m15s clusteridentityoperator-76d645d8bf-5qx5c 2/2 Running 0 4m15s config-agent-65d5df564f-lffqm 1/2 CrashLoopBackOff 0 1m14s
If the
azure-identity-certificate
isn't present, the system assigned managed identity hasn't been installed.kubectl get secret -n azure-arc -o yaml | grep name:
name: azure-identity-certificate
To resolve this issue, try deleting the Arc deployment by running the
az connectedk8s delete
command and reinstalling it. If the issue continues to happen, it could be an issue with your proxy settings. In that case, try connecting your cluster to Azure Arc via a proxy to connect your cluster to Arc via a proxy. Also verify that all of the network prerequisites have been met.If the
clusterconnect-agent
and theconfig-agent
pods are running, but thekube-aad-proxy
pod is missing, check your pod security policies. This pod uses theazure-arc-kube-aad-proxy-sa
service account, which doesn't have admin permissions but requires the permission to mount host path.If the
kube-aad-proxy
pod is stuck inContainerCreating
state, check whether the kube-aad-proxy certificate has been downloaded onto the cluster.kubectl get secret -n azure-arc -o yaml | grep name:
name: kube-aad-proxy-certificate
If the certificate is missing, delete the deployment and try onboarding again, using a different name for the cluster. If the problem continues, open a support request.
CryptoHash module error
When attempting to onboard Kubernetes clusters to the Azure Arc platform, the local environment (for example, your client console) may return the following error message:
Cannot load native module 'Crypto.Hash._MD5'
Sometimes, dependent modules fail to download successfully when adding the extensions connectedk8s
and k8s-configuration
through Azure CLI or Azure PowerShell. To fix this problem, manually remove and then add the extensions in the local environment.
To remove the extensions, use:
az extension remove --name connectedk8s
az extension remove --name k8s-configuration
To add the extensions, use:
az extension add --name connectedk8s
az extension add --name k8s-configuration
Cluster connect issues
If your cluster is behind an outbound proxy or firewall, verify that websocket connections are enabled for *.servicebus.chinacloudapi.cn
, which is required specifically for the Cluster Connect feature. Additionally, make sure you're using the latest version of the connectedk8s
Azure CLI extension if you're experiencing problems using cluster connect.
If the clusterconnect-agent
and kube-aad-proxy
pods are missing, then the cluster connect feature is likely disabled on the cluster. If so, az connectedk8s proxy
will fail to establish a session with the cluster, and you may see an error reading Cannot connect to the hybrid connection because no agent is connected in the target arc resource.
To resolve this error, enable the cluster connect feature on your cluster:
az connectedk8s enable-features --features cluster-connect -n $CLUSTER_NAME -g $RESOURCE_GROUP
For more information, see Use cluster connect to securely connect to Azure Arc-enabled Kubernetes clusters.
Enable custom locations using service principal
When connecting your cluster to Azure Arc or enabling custom locations on an existing cluster, you may see the following warning:
Unable to fetch oid of 'custom-locations' app. Proceeding without enabling the feature. Insufficient privileges to complete the operation.
This warning occurs when you use a service principal to log into Azure, and the service principal doesn't have the necessary permissions. To avoid this error, follow these steps:
Sign in into Azure CLI using your user account. Retrieve the Object ID of the Microsoft Entra application used by Azure Arc service:
az ad sp show --id bc313c14-388c-4e7d-a58e-70017303ee3b --query objectId -o tsv
Sign in into Azure CLI using the service principal. Use the
<objectId>
value from the previous step to enable custom locations on the cluster:- To enable custom locations when connecting the cluster to Arc, run
az connectedk8s connect -n <cluster-name> -g <resource-group-name> --custom-locations-oid <objectId>
- To enable custom locations on an existing Azure Arc-enabled Kubernetes cluster, run
az connectedk8s enable-features -n <cluster-name> -g <resource-group-name> --custom-locations-oid <objectId> --features cluster-connect custom-locations
- To enable custom locations when connecting the cluster to Arc, run
Next steps
- Get a visual walkthrough of how to diagnose connection issues.