对用于容器的 Azure Monitor 进行故障排除Troubleshooting Azure Monitor for containers

使用用于容器的 Azure Monitor 配置 Azure Kubernetes 服务 (AKS) 群集的监视时,可能会遇到阻止数据收集或报告状态的问题。When you configure monitoring of your Azure Kubernetes Service (AKS) cluster with Azure Monitor for containers, you may encounter an issue preventing data collection or reporting status. 本文详细介绍了一些常见问题及其排查步骤。This article details some common issues and troubleshooting steps.

在执行载入或更新操作期间出现授权错误Authorization error during onboarding or update operation

启用用于容器的 Azure Monitor 或更新群集以支持收集指标时,可能会收到如下错误 - 对象 ID 为“<user’s objectId>”的客户端“<user’s Identity>”无权对作用域执行操作“Microsoft.Authorization/roleAssignments/write”While enabling Azure Monitor for containers or updating a cluster to support collecting metrics, you may receive an error resembling the following - The client <user’s Identity>' with object id '<user’s objectId>' does not have authorization to perform action 'Microsoft.Authorization/roleAssignments/write' over scope

在载入或更新过程中,将对群集资源尝试授予“监视指标发布服务器” 角色分配。During the onboarding or update process, granting the Monitoring Metrics Publisher role assignment is attempted on the cluster resource. 如果用户要启动为容器启用 Azure Monitor 的过程或用于支持收集指标的更新,则该用户必须可以访问 AKS 群集资源作用域上的 Microsoft.Authorization/roleAssignments/write 权限。The user initiating the process to enable Azure Monitor for containers or the update to support the collection of metrics must have access to the Microsoft.Authorization/roleAssignments/write permission on the AKS cluster resource scope. 只有所有者用户访问管理员内置角色的成员才被授权访问此权限。Only members of the Owner and User Access Administrator built-in roles are granted access to this permission. 如果安全策略需要分配粒度级别的权限,我们建议查看自定义角色,并将其分配给需要它的用户。If your security policies require assigning granular level permissions, we recommend you view custom roles and assign it to the users who require it.

此外,还可以通过执行以下步骤,在 Azure 门户中手动授予此角色:You can also manually grant this role from the Azure portal by performing the following steps:

  1. 登录 Azure 门户Sign in to the Azure portal.
  2. 在 Azure 门户中,单击左上角的“所有服务” 。In the Azure portal, click All services found in the upper left-hand corner. 在资源列表中,键入 KubernetesIn the list of resources, type Kubernetes. 开始键入时,会根据输入筛选该列表。As you begin typing, the list filters based on your input. 选择“Azure Kubernetes” 。Select Azure Kubernetes.
  3. 从 Kubernetes 群集列表中选择一个群集。In the list of Kubernetes clusters, select one from the list.
  4. 在左侧菜单中,单击“访问控制 (IAM)” 。From the left-hand menu, click Access control (IAM).
  5. 选择“+ 添加” 以添加角色分配,并选择“监视指标发布服务器”角色 ,然后在“选择”框 下键入 AKS,以仅根据订阅中定义的群集服务主体筛选结果。Select + Add to add a role assignment and select the Monitoring Metrics Publisher role and under the Select box type AKS to filter the results on just the clusters service principals defined in the subscription. 从列表中选择特定于该群集的角色。Select the one from the list that is specific to that cluster.
  6. 选择“保存” 完成角色分配。Select Save to finish assigning the role.

用于容器的 Azure Monitor 已启用,但未报告任何信息Azure Monitor for containers is enabled but not reporting any information

如果用于容器的 Azure Monitor 已成功启用和配置,但你无法查看状态信息或日志查询未返回任何结果,你可以按照以下步骤诊断问题:If Azure Monitor for containers is successfully enabled and configured, but you cannot view status information or no results are returned from a log query, you diagnose the problem by following these steps:

  1. 通过运行以下命令检查代理状态:Check the status of the agent by running the command:

    kubectl get ds omsagent --namespace=kube-system

    输出应类似于以下示例,指示已正确部署:The output should resemble the following example, which indicates that it was deployed properly:

    User@aksuser:~$ kubectl get ds omsagent --namespace=kube-system
    NAME       DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
    omsagent   2         2         2         2            2           beta.kubernetes.io/os=linux   1d
    
  2. 如果有 Windows Server 节点,请通过运行以下命令检查代理状态:If you have Windows Server nodes, then check the status of the agent by running the command:

    kubectl get ds omsagent-win --namespace=kube-system

    输出应类似于以下示例,指示已正确部署:The output should resemble the following example, which indicates that it was deployed properly:

    User@aksuser:~$ kubectl get ds omsagent-win --namespace=kube-system
    NAME                   DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                   AGE
    omsagent-win           2         2         2         2            2           beta.kubernetes.io/os=windows   1d
    
  3. 使用以下命令检查代理版本 06072018 或更高版本的部署状态:Check the deployment status with agent version 06072018 or later using the command:

    kubectl get deployment omsagent-rs -n=kube-system

    输出应类似于以下示例,指示已正确部署:The output should resemble the following example, which indicates that it was deployed properly:

    User@aksuser:~$ kubectl get deployment omsagent-rs -n=kube-system
    NAME       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE    AGE
    omsagent   1         1         1            1            3h
    
  4. 使用以下命令,检查 Pod 的状态,验证它是否正在运行:kubectl get pods --namespace=kube-systemCheck the status of the pod to verify that it is running using the command: kubectl get pods --namespace=kube-system

    输出应类似于以下示例,omsagent 状态为“正在运行”:The output should resemble the following example with a status of Running for the omsagent:

    User@aksuser:~$ kubectl get pods --namespace=kube-system
    NAME                                READY     STATUS    RESTARTS   AGE
    aks-ssh-139866255-5n7k5             1/1       Running   0          8d
    azure-vote-back-4149398501-7skz0    1/1       Running   0          22d
    azure-vote-front-3826909965-30n62   1/1       Running   0          22d
    omsagent-484hw                      1/1       Running   0          1d
    omsagent-fkq7g                      1/1       Running   0          1d
    omsagent-win-6drwq                  1/1       Running   0          1d
    

错误消息Error messages

下表汇总了使用适用于容器的 Azure Monitor 时可能会遇到的已知错误。The table below summarizes known errors you may encounter while using Azure Monitor for containers.

错误消息Error messages 操作Action
错误消息 No data for selected filtersError Message No data for selected filters 为新创建的群集建立监视数据流可能需要花费一些时间。It may take some time to establish monitoring data flow for newly created clusters. 群集的数据至少需要 10 到 15 分钟才能显示。Allow at least 10 to 15 minutes for data to appear for your cluster.
错误消息 Error retrieving dataError Message Error retrieving data 为 Azure Kubernetes 服务群集设置运行状况和性能监视时,会在群集与 Azure Log Analytics 工作区之间建立连接。While Azure Kubernetes Service cluster is setting up for health and performance monitoring, a connection is established between the cluster and Azure Log Analytics workspace. Log Analytics 工作区用于存储你的群集的所有监视数据。A Log Analytics workspace is used to store all monitoring data for your cluster. 当 Log Analytics 工作区已删除时,可能会发生此错误。This error may occur when your Log Analytics workspace has been deleted. 检查工作区是否已删除,如果已删除,则需要使用用于容器的 Azure Monitor 重新启用对群集的监视,并指定现有工作区或创建新工作区。Check if the workspace was deleted and if it was, you will need to re-enable monitoring of your cluster with Azure Monitor for containers and specify an existing or create a new workspace. 若要重新启用,将需要对该群集禁用监视,然后再次启用用于容器的 Azure Monitor。To re-enable, you will need to disable monitoring for the cluster and enable Azure Monitor for containers again.
通过 az aks cli 添加适用于容器的 Azure Monitor 后出现 Error retrieving dataError retrieving data after adding Azure Monitor for containers through az aks cli 当使用 az aks cli 启用监视时,可能无法正确部署用于容器的 Azure Monitor。When enable monitoring using az aks cli, Azure Monitor for containers may not be properly deployed. 请检查是否部署了该解决方案。Check whether the solution is deployed. 若要进行验证,请转到你的 Log Analytics 工作区,从左侧的面板中选择“解决方案”来查看该解决方案是否可用。To verify, go to your Log Analytics workspace and see if the solution is available by selecting Solutions from the pane on the left-hand side. 若要解决此问题,需要按照如何部署适用于容器的 Azure Monitor 中的说明重新部署该解决方案。To resolve this issue, you will need to redeploy the solution by following the instructions on how to deploy Azure Monitor for containers

为了帮助诊断问题,我们在此处提供了一个可用的故障排除脚本。To help diagnose the problem, we have provided a troubleshooting script available here.

未在非 Azure Kubernetes 群集上计划用于容器的 Azure Monitor 代理 ReplicaSet PodAzure Monitor for containers agent ReplicaSet Pods are not scheduled on non-Azure Kubernetes cluster

用于容器的 Azure Monitor 代理 ReplicaSet Pod 依赖于工作器(或代理)节点上的以下节点选择器进行计划:Azure Monitor for containers agent ReplicaSet Pods has a dependency on the following node selectors on the worker (or agent) nodes for the scheduling:

nodeSelector:
  beta.kubernetes.io/os: Linux
  kubernetes.io/role: agent

如果工作器节点未附加节点标签,则将不会计划代理 ReplicaSet Pod。If your worker nodes don’t have node labels attached, then agent ReplicaSet Pods will not get scheduled. 有关如何附加标签的说明,请参阅 Kubernetes 分配标签选择器Refer to Kubernetes assign label selectors for instructions on how to attach the label.

性能图表不显示非 Azure 群集上节点和容器的 CPU 或内存Performance charts don't show CPU or memory of nodes and containers on a non-Azure cluster

用于容器的 Azure Monitor 代理 Pod 使用节点代理上的 cAdvisor 终结点来收集性能指标。Azure Monitor for containers agent Pods uses the cAdvisor endpoint on the node agent to gather the performance metrics. 验证节点上的容器化代理是否配置为允许在群集中的所有节点上打开 cAdvisor port: 10255 以收集性能指标。Verify the containerized agent on the node is configured to allow cAdvisor port: 10255 to be opened on all nodes in the cluster to collect performance metrics.

非 Azure Kubernetes 群集未显示在用于容器的 Azure Monitor 中Non-Azure Kubernetes cluster are not showing in Azure Monitor for containers

若要在用于容器的 Azure Monitor 中查看非 Azure Kubernetes 群集,需要在支持此见解的 Log Analytics 工作区和容器见解解决方案资源 ContainerInsights(工作区) 上具有读取访问权限。To view the non-Azure Kubernetes cluster in Azure Monitor for containers, Read access is required on the Log Analytics workspace supporting this Insight and on the Container Insights solution resource ContainerInsights (workspace).

后续步骤Next steps

启用监视来捕获 AKS 群集节点和 Pod 的运行状况指标后,可在 Azure 门户中找到这些运行状况指标。With monitoring enabled to capture health metrics for both the AKS cluster nodes and pods, these health metrics are available in the Azure portal. 要了解如何将 Azure Monitor 用于容器,请参阅查看 Azure Kubernetes 服务运行状况To learn how to use Azure Monitor for containers, see View Azure Kubernetes Service health.