使用用于容器的 Azure Monitor 配置 GPU 监视Configure GPU monitoring with Azure Monitor for containers

从代理版本 ciprod03022019 开始,用于容器集成代理的 Azure Monitor 现已支持监视 GPU(图形处理单元)在 GPU 感知的 Kubernetes 集群节点上的使用情况,并监视请求和使用 GPU 资源的 Pod/容器。Starting with agent version ciprod03022019, Azure monitor for containers integrated agent now supports monitoring GPU (graphical processing units) usage on GPU-aware Kubernetes cluster nodes, and monitor pods/containers requesting and using GPU resources.

支持的 GPU 供应商Supported GPU vendors

用于容器的 Azure Monitor 支持监视以下 GPU 供应商提供的 GPU 群集:Azure Monitor for Containers supports monitoring GPU clusters from following GPU vendors:

通过以 60 秒的间隔收集以下指标并将其存储在 InsightMetric 表中,用于容器的 Azure Monitor 会自动开始监视节点上的 GPU 使用情况以及 GPU 请求 Pod 和工作负载的情况:Azure Monitor for containers automatically starts monitoring GPU usage on nodes, and GPU requesting pods and workloads by collecting the following metrics at 60sec intervals and storing them in the InsightMetrics table:

指标名称Metric name 指标维度(标记)Metric dimension (tags) 说明Description
containerGpuDutyCyclecontainerGpuDutyCycle container.azm.ms/clusterId、container.azm.ms/clusterName、containerName、gpuId、gpuModel、gpuVendorcontainer.azm.ms/clusterId, container.azm.ms/clusterName, containerName, gpuId, gpuModel, gpuVendor 在刚过去的采样周期(60 秒)中,GPU 处于繁忙/积极处理容器的状态的时间百分比。Percentage of time over the past sample period (60 seconds) during which GPU was busy/actively processing for a container. 占空比是 1 到 100 之间的数字。Duty cycle is a number between 1 and 100.
containerGpuLimitscontainerGpuLimits container.azm.ms/clusterId、container.azm.ms/clusterName、containerNamecontainer.azm.ms/clusterId, container.azm.ms/clusterName, containerName 每个容器可以将限值指定为一个或多个 GPU。Each container can specify limits as one or more GPUs. 不能请求或限制为 GPU 的一部分。It is not possible to request or limit a fraction of a GPU.
containerGpuRequestscontainerGpuRequests container.azm.ms/clusterId、container.azm.ms/clusterName、containerNamecontainer.azm.ms/clusterId, container.azm.ms/clusterName, containerName 每个容器可以请求一个或多个 GPU。Each container can request one or more GPUs. 不能请求或限制为 GPU 的一部分。It is not possible to request or limit a fraction of a GPU.
containerGpumemoryTotalBytescontainerGpumemoryTotalBytes container.azm.ms/clusterId、container.azm.ms/clusterName、containerName、gpuId、gpuModel、gpuVendorcontainer.azm.ms/clusterId, container.azm.ms/clusterName, containerName, gpuId, gpuModel, gpuVendor 可用于特定容器的 GPU 内存量(以字节为单位)。Amount of GPU Memory in bytes available to use for a specific container.
containerGpumemoryUsedBytescontainerGpumemoryUsedBytes container.azm.ms/clusterId、container.azm.ms/clusterName、containerName、gpuId、gpuModel、gpuVendorcontainer.azm.ms/clusterId, container.azm.ms/clusterName, containerName, gpuId, gpuModel, gpuVendor 特定容器使用的 GPU 内存量(以字节为单位)。Amount of GPU Memory in bytes used by a specific container.
nodeGpuAllocatablenodeGpuAllocatable container.azm.ms/clusterId、container.azm.ms/clusterName、gpuVendorcontainer.azm.ms/clusterId, container.azm.ms/clusterName, gpuVendor 节点中可供 Kubernetes 使用的 GPU 数。Number of GPUs in a node that can be used by Kubernetes.
nodeGpuCapacitynodeGpuCapacity container.azm.ms/clusterId、container.azm.ms/clusterName、gpuVendorcontainer.azm.ms/clusterId, container.azm.ms/clusterName, gpuVendor 节点中的 GPU 总数。Total Number of GPUs in a node.

GPU 性能图表GPU performance charts

用于容器的 Azure Monitor 包含表中先前列出的指标的预配置图表,作为每个集群的 GPU 工作簿。Azure Monitor for containers includes pre-configured charts for the metrics listed earlier in the table as a GPU workbook for every cluster. 可以通过在左侧窗格中选择“工作簿”直接从 AKS 群集中找到 GPU 工作簿“节点 GPU”,也可以通过 Insight 中的“查看工作簿”下拉列表找到 。You can find the GPU workbook Node GPU directly from an AKS cluster by selecting Workbooks from the left-hand pane, and from the View Workbooks drop-down list in the Insight.

后续步骤Next steps