监视 Azure 机器学习Monitor Azure Machine Learning

当你的关键应用程序和业务流程依赖于 Azure 资源时,你需要监视这些资源的可用性、性能和操作。When you have critical applications and business processes relying on Azure resources, you want to monitor those resources for their availability, performance, and operation. 本文介绍了 Azure 机器学习生成的监视数据,以及如何使用 Azure Monitor 对此数据进行分析和发出警报。This article describes the monitoring data generated by Azure Machine Learning and how to analyze and alert on this data with Azure Monitor.

提示

本文档中的信息主要面向管理员,因为它介绍针对 Azure 机器学习服务和关联的 Azure 服务的监视。The information in this document is primarily for administrators, as it describes monitoring for the Azure Machine Learning service and associated Azure services. 如果你是一名数据科学家或开发人员,并且想要监视与模型训练运行相关的信息,请参阅以下文档 :If you are a data scientist or developer, and want to monitor information specific to your model training runs, see the following documents:

如果要监视部署为 Web 服务或 IoT Edge 模块的模型生成的信息,请参阅收集模型数据使用 Application Insights 进行监视If you want to monitor information generated by models deployed as web services or IoT Edge modules, see Collect model data and Monitor with Application Insights.

说明是 Azure Monitor?What is Azure Monitor?

Azure 机器学习使用 Azure Monitor 创建监视数据,这是 Azure 中的一个完整堆栈监视服务。Azure Machine Learning creates monitoring data using Azure Monitor, which is a full stack monitoring service in Azure. Azure Monitor 提供了一组用于监视 Azure 资源的完整功能。Azure Monitor provides a complete set of features to monitor your Azure resources. 它还可以监视其他云和本地的资源。It can also monitor resources in other clouds and on-premises.

一开始可以阅读使用 Azure Monitor 监视 Azure 资源一文,其中介绍了以下概念:Start with the article Monitoring Azure resources with Azure Monitor, which describes the following concepts:

  • 说明是 Azure Monitor?What is Azure Monitor?
  • 与监视相关的成本Costs associated with monitoring
  • 监视 Azure 中收集的数据Monitoring data collected in Azure
  • 配置数据收集Configuring data collection
  • Azure 中用于分析监视数据并就其发出警报的标准工具Standard tools in Azure for analyzing and alerting on monitoring data

本文中的以下各部分将介绍从 Azure 机器学习收集的特定数据。The following sections build on this article by describing the specific data gathered for Azure Machine Learning. 这些部分还提供了使用 Azure 工具配置数据收集和分析此数据的示例。These sections also provide examples for configuring data collection and analyzing this data with Azure tools.

提示

若想了解与 Azure Monitor 相关的成本,请参阅使用情况和估计成本To understand costs associated with Azure Monitor, see Usage and estimated costs. 若要了解数据在 Azure Monitor 中显示需要花多长时间,请参阅 日志数据引入时间To understand the time it takes for your data to appear in Azure Monitor, see Log data ingestion time.

Azure 机器学习的监视数据Monitoring data from Azure Machine Learning

Azure 机器学习收集的监视数据的类型与 Azure 资源的监视数据中所述的其他 Azure 资源相同。Azure Machine Learning collects the same kinds of monitoring data as other Azure resources that are described in Monitoring data from Azure resources.

请参阅 Azure 机器学习监视数据参考,详细了解 Azure 机器学习创建的日志和指标。See Azure Machine Learning monitoring data reference for a detailed reference of the logs and metrics created by Azure Machine Learning.

收集和路由Collection and routing

平台指标和活动日志会自动收集和存储,但你可以使用诊断设置将其路由到其他位置。Platform metrics and the Activity log are collected and stored automatically, but can be routed to other locations by using a diagnostic setting.

在创建诊断设置并将其路由到一个或多个位置之前,不会收集和存储资源日志。Resource Logs are not collected and stored until you create a diagnostic setting and route them to one or more locations.

有关使用 Azure 门户、CLI 或 PowerShell 创建诊断设置的详细过程,请参阅创建诊断设置以收集 Azure 中的平台日志和指标See Create diagnostic setting to collect platform logs and metrics in Azure for the detailed process for creating a diagnostic setting using the Azure portal, CLI, or PowerShell. 创建诊断设置时,请指定要收集的日志类别。When you create a diagnostic setting, you specify which categories of logs to collect. Azure 机器学习的类别在 Azure 机器学习监视数据参考中列出。The categories for Azure Machine Learning are listed in Azure Machine Learning monitoring data reference.

重要

启用这些设置需要额外的 Azure 服务(存储帐户、事件中心或 Log Analytics),这可能会增加成本。Enabling these settings requires additional Azure services (storage account, event hub, or Log Analytics), which may increase your cost. 若要估算成本,请访问 Azure 定价计算器To calculate an estimated cost, visit the Azure pricing calculator.

可以为 Azure 机器学习配置以下日志:You can configure the following logs for Azure Machine Learning:

类别Category 说明Description
AmlComputeClusterEventAmlComputeClusterEvent Azure 机器学习计算群集的事件。Events from Azure Machine Learning compute clusters.
AmlComputeClusterNodeEventAmlComputeClusterNodeEvent Azure 机器学习计算群集内节点的事件。Events from nodes within an Azure Machine Learning compute cluster.
AmlComputeJobEventAmlComputeJobEvent Azure 机器学习计算上运行的作业的事件。Events from jobs running on Azure Machine Learning compute.

备注

启用诊断设置中的指标时,当前发送到存储帐户、事件中心或 log analytics 的信息中并不包含维度信息。When you enable metrics in a diagnostic setting, dimension information is not currently included as part of the information sent to a storage account, event hub, or log analytics.

以下部分将讨论可以收集的指标和日志。The metrics and logs you can collect are discussed in the following sections.

分析指标Analyzing metrics

可以从“Azure Monitor”菜单中打开“指标”,以分析 Azure 机器学习指标以及来自其他 Azure 服务的指标 。You can analyze metrics for Azure Machine Learning, along with metrics from other Azure services, by opening Metrics from the Azure Monitor menu. 有关使用此工具的详细信息,请参阅 Azure 指标资源管理器入门See Getting started with Azure Metrics Explorer for details on using this tool.

有关收集的平台指标的列表,请参阅监视 Azure 机器学习数据引用指标For a list of the platform metrics collected, see Monitoring Azure Machine Learning data reference metrics.

Azure 机器学习的所有指标都位于命名空间 机器学习服务工作区 中。All metrics for Azure Machine Learning are in the namespace Machine Learning Service Workspace.

机器学习服务工作区处于选定状态的指标资源管理器

若要参考,可以查看 Azure Monitor 中所有受支持的资源指标列表。For reference, you can see a list of all resource metrics supported in Azure Monitor.

提示

Azure Monitor 指标数据有效期为 90 天。Azure Monitor metrics data is available for 90 days. 但在创建图表时,只直观显示 30 天的数据。However, when creating charts only 30 days can be visualized. 例如,如果想要直观显示 90 天的数据,必须将它拆分成 90 天内包含 30 天的数据的三个图表。For example, if you want to visualize a 90 day period, you must break it into three charts of 30 days within the 90 day period.

筛选和拆分Filtering and splitting

对于支持维度的指标,应用筛选器时可以使用维度值。For metrics that support dimensions, you can apply filters using a dimension value. 例如,筛选“群集名称”为 cpu-cluster 的“活动核心”。For example, filtering Active Cores for a Cluster Name of cpu-cluster.

还可以按维度来拆分指标,将指标的不同部分进行直观比较。You can also split a metric by dimension to visualize how different segments of the metric compare with each other. 例如,拆分 管道步骤类型,查看管道中使用的步骤类型的计数。For example, splitting out the Pipeline Step Type to see a count of the types of steps used in the pipeline.

有关筛选和拆分的详细信息,请参阅 Azure Monitor 的高级功能For more information of filtering and splitting, see Advanced features of Azure Monitor.

分析日志Analyzing logs

使用 Azure Monitor Log Analytics 需要创建诊断配置,并启用 将信息发送到 Log AnalyticsUsing Azure Monitor Log Analytics requires you to create a diagnostic configuration and enable Send information to Log Analytics. 有关详细信息,请参阅收集和路由部分。For more information, see the Collection and routing section.

Azure Monitor 日志中的数据以表形式存储,每个表包含自己独有的属性集。Data in Azure Monitor Logs is stored in tables, with each table having its own set of unique properties. Azure 机器学习将数据存储在以下表格中:Azure Machine Learning stores data in the following tables:

Table 说明Description
AmlComputeClusterEventAmlComputeClusterEvent Azure 机器学习计算群集的事件。Events from Azure Machine Learning compute clusters.
AmlComputeClusterNodeEventAmlComputeClusterNodeEvent Azure 机器学习计算群集内节点的事件。Events from nodes within an Azure Machine Learning compute cluster.
AmlComputeJobEventAmlComputeJobEvent Azure 机器学习计算上运行的作业的事件。Events from jobs running on Azure Machine Learning compute.

重要

在 Azure 机器学习菜单中选择“日志”时,Log Analytics 随即打开,其查询范围设置为当前工作区。When you select Logs from the Azure Machine Learning menu, Log Analytics is opened with the query scope set to the current workspace. 这意味着日志查询只包含来自该资源的数据。This means that log queries will only include data from that resource. 如果希望运行的查询包含其他数据库或其他 Azure 服务的数据,请从“Azure Monitor”菜单中选择“日志”。If you want to run a query that includes data from other databases or data from other Azure services, select Logs from the Azure Monitor menu. 有关详细信息,请参阅 Azure Monitor Log Analytics 中的日志查询范围和时间范围See Log query scope and time range in Azure Monitor Log Analytics for details.

请参阅 Azure 机器学习监视数据参考,详细了解相关日志和指标。For a detailed reference of the logs and metrics, see Azure Machine Learning monitoring data reference.

示例 Kusto 查询Sample Kusto queries

重要

在 [service-name] 菜单中选择“日志”时,Log Analytics 随即打开,其查询范围设置为当前 Azure 机器学习工作区。When you select Logs from the [service-name] menu, Log Analytics is opened with the query scope set to the current Azure Machine Learning workspace. 这意味着日志查询只包含来自该资源的数据。This means that log queries will only include data from that resource. 如果希望运行包含其他工作区或其他 Azure 服务数据的查询,请从“Azure Monitor”菜单中选择“日志” 。If you want to run a query that includes data from other workspaces or data from other Azure services, select Logs from the Azure Monitor menu. 有关详细信息,请参阅 Azure Monitor Log Analytics 中的日志查询范围和时间范围See Log query scope and time range in Azure Monitor Log Analytics for details.

下面是一些可用于帮助监视 Azure 机器学习资源的查询:Following are queries that you can use to help you monitor your Azure Machine Learning resources:

  • 获取过去五天内失败的作业:Get failed jobs in the last five days:

    AmlComputeJobEvent
    | where TimeGenerated > ago(5d) and EventType == "JobFailed"
    | project  TimeGenerated , ClusterId , EventType , ExecutionState , ToolType
    
  • 获取特定名称的作业的记录:Get records for a specific job name:

    AmlComputeJobEvent
    | where JobName == "automl_a9940991-dedb-4262-9763-2fd08b79d8fb_setup"
    | project  TimeGenerated , ClusterId , EventType , ExecutionState , ToolType
    
  • 为 VM 大小为 Standard_D1_V2 的群集获取过去五天内的群集事件:Get cluster events in the last five days for clusters where the VM size is Standard_D1_V2:

    AmlComputeClusterEvent
    | where TimeGenerated > ago(4d) and VmSize == "STANDARD_D1_V2"
    | project  ClusterName , InitialNodeCount , MaximumNodeCount , QuotaAllocated , QuotaUtilized
    
  • 获取最近 8 天内分配的节点:Get nodes allocated in the last eight days:

    AmlComputeClusterNodeEvent
    | where TimeGenerated > ago(8d) and NodeAllocationTime  > ago(8d)
    | distinct NodeId
    

警报Alerts

可以通过从“Azure Monitor”菜单中打开“警报”,来访问 Azure 机器学习的警报。You can access alerts for Azure Machine Learning by opening Alerts from the Azure Monitor menu. 请参阅使用 Azure Monitor 创建、查看和管理指标警报,详细了解如何创建警报。See Create, view, and manage metric alerts using Azure Monitor for details on creating alerts.

下表列出了常见和推荐使用的 Azure 机器学习的指标警报规则:The following table lists common and recommended metric alert rules for Azure Machine Learning:

警报类型Alert type 条件Condition 描述Description
失败的模型部署数Model Deploy Failed 聚合类型:总计,运算符:大于,阈值:0Aggregation type: Total, Operator: Greater than, Threshold value: 0 当一个或多个模型部署失败时When one or more model deployments have failed
配额利用率百分比Quota Utilization Percentage 聚合类型:平均,运算符:大于,阈值:90Aggregation type: Average, Operator: Greater than, Threshold value: 90 当配额使用率百分比大于 90% 时When the quota utilization percentage is greater than 90%
不可用的节点数Unusable Nodes 聚合类型:总计,运算符:大于,阈值:0Aggregation type: Total, Operator: Greater than, Threshold value: 0 当存在一个或多个不可用的节点时When there are one or more unusable nodes

后续步骤Next steps