监视 Azure 机器学习Monitoring Azure Machine Learning

本文介绍 Azure 机器学习生成的监视数据。This article describes the monitoring data generated by Azure Machine Learning. 还介绍了如何使用 Azure Monitor 分析数据并定义警报。It also describes how you can use the Azure Monitor to analyze your data and define alerts.

提示

本文档中的信息主要面向管理员,因为它主要介绍针对 Azure 机器学习的监视。The information in this document is primarily for administrators, as it describes monitoring for the Azure Machine Learning. 如果你是一名数据科学家或开发人员,并且想要监视与模型训练运行相关的信息,请参阅以下文档:If you are a data scientist or developer, and want to monitor information specific to your model training runs, see the following documents:

Azure MonitorAzure Monitor

Azure 机器学习使用 Azure Monitor 来记录监视数据,这是 Azure 中的一个完整堆栈监视服务。Azure Machine Learning logs monitoring data using Azure Monitor, which is a full stack monitoring service in Azure. Azure Monitor 提供了一组用于监视 Azure 资源的完整功能。Azure Monitor provides a complete set of features to monitor your Azure resources. 它还可以监视其他云和本地的资源。It can also monitor resources in other clouds and on-premises.

请先阅读文章 Azure Monitor 概述,对监视功能有一个大致了解。Start with the article Azure Monitor overview, which provides an overview of the monitoring capabilities. 以下各节内容在此信息的基础上编写,详细介绍了如何将 Azure Monitor 与 Azure 机器学习配合使用。The following sections build on this information by providing specifics of using Azure Monitor with Azure Machine Learning.

若想了解与 Azure Monitor 相关的成本,请参阅使用情况和估计成本To understand costs associated with Azure Monitor, see Usage and estimated costs. 若要了解数据在 Azure Monitor 中显示需要花多长时间,请参阅日志数据引入时间To understand the time it takes for your data to appear in Azure Monitor, see Log data ingestion time.

Azure 机器学习的监视数据Monitoring data from Azure Machine Learning

Azure 机器学习收集的监视数据的类型与其他 Azure 资源相同,Azure 机器学习的监视数据中介绍了这一点。Azure Machine Learning collects the same kinds of monitoring data as other Azure resources, which are described in Monitoring data from Azure resources. 请参阅 Azure 机器学习监视数据参考,详细了解 Azure 机器学习创建的日志和指标。See Azure Machine Learning monitoring data reference for a detailed reference of the logs and metrics created by Azure Machine Learning.

分析指标数据Analyzing metric data

可以通过从“Azure Monitor” 菜单中打开“指标” ,来分析 Azure 机器学习的指标。You can analyze metrics for Azure Machine Learning by opening Metrics from the Azure Monitor menu. 有关使用此工具的详细信息,请参阅 Azure 指标资源管理器入门See Getting started with Azure Metrics Explorer for details on using this tool.

Azure 机器学习的所有指标都位于命名空间机器学习服务工作区中。All metrics for Azure Machine Learning are in the namespace Machine Learning Service Workspace.

机器学习服务工作区处于选定状态的指标资源管理器

筛选和拆分Filtering and splitting

对于支持维度的指标,应用筛选器时可以使用维度值。For metrics that support dimensions, you can apply filters using a dimension value. 例如,筛选“群集名称” 为 cpu-cluster 的“活动核心” 。For example, filtering Active Cores for a Cluster Name of cpu-cluster.

还可以按维度来拆分指标,将指标的不同部分进行直观比较。You can also split a metric by dimension to visualize how different segments of the metric compare with each other. 例如,拆分管道步骤类型,查看管道中使用的步骤类型的计数。For example, splitting out the Pipeline Step Type to see a count of the types of steps used in the pipeline.

有关筛选和拆分的详细信息,请参阅 Azure Monitor 的高级功能For more information of filtering and splitting, see Advanced features of Azure Monitor.

警报Alerts

可以通过从“Azure Monitor” 菜单中打开“警报” ,来访问 Azure 机器学习的警报。You can access alerts for Azure Machine Learning by opening Alerts from the Azure Monitor menu. 请参阅使用 Azure Monitor 创建、查看和管理指标警报,详细了解如何创建警报。See Create, view, and manage metric alerts using Azure Monitor for details on creating alerts.

下表列出了常见和推荐使用的 Azure 机器学习的指标警报规则:The following table lists common and recommended metric alert rules for Azure Machine Learning:

警报类型Alert type 条件Condition 说明Description
失败的模型部署数Model Deploy Failed 聚合类型:总计,运算符:大于,阈值:0Aggregation type: Total, Operator: Greater than, Threshold value: 0 当一个或多个模型部署失败时When one or more model deployments have failed
配额利用率百分比Quota Utilization Percentage 聚合类型:平均,运算符:大于,阈值:90Aggregation type: Average, Operator: Greater than, Threshold value: 90 当配额使用率百分比大于 90% 时When the quota utilization percentage is greater than 90%
不可用的节点数Unusable Nodes 聚合类型:总计,运算符:大于,阈值:0Aggregation type: Total, Operator: Greater than, Threshold value: 0 当存在一个或多个不可用的节点时When there are one or more unusable nodes

配置Configuration

重要

无需配置 Azure 机器学习的指标,系统会自动收集指标,并在指标资源管理器中用于监视和警报。Metrics for Azure Machine Learning do not need to be configured, they are collected automatically and are available in the Metrics Explorer for monitoring and alerting.

可以添加诊断设置来配置以下功能:You can add a diagnostic setting to configure the following functionality:

  • 将日志和指标信息存档到 Azure 存储帐户。Archive log and metrics information to an Azure storage account.
  • 将日志和指标信息流式传输到 Azure 事件中心。Stream log and metrics information to an Azure Event Hub.
  • 将日志和指标信息发送到 Azure Monitor Log Analytics。Send log and metrics information to Azure Monitor Log Analytics.

启用这些设置需要额外的 Azure 服务(存储帐户、事件中心或 Log Analytics),这可能会增加成本。Enabling these settings requires additional Azure services (storage account, event hub, or Log Analytics), which may increase your cost. 若要估算成本,请访问 Azure 定价计算器To calculate an estimated cost, visit the Azure pricing calculator.

有关创建诊断设置的详细信息,请参阅创建诊断设置以收集 Azure 中的平台日志和指标For more information on creating a diagnostic setting, see Create diagnostic setting to collect platform logs and metrics in Azure.

可以为 Azure 机器学习配置以下日志:You can configure the following logs for Azure Machine Learning:

CategoryCategory 说明Description
AmlComputeClusterEventAmlComputeClusterEvent Azure 机器学习计算群集的事件。Events from Azure Machine Learning compute clusters.
AmlComputeClusterNodeEventAmlComputeClusterNodeEvent Azure 机器学习计算群集内节点的事件。Events from nodes within an Azure Machine Learning compute cluster.
AmlComputeJobEventAmlComputeJobEvent Azure 机器学习计算上运行的作业的事件。Events from jobs running on Azure Machine Learning compute.

备注

启用诊断设置中的指标时,当前发送到存储帐户、事件中心或 log analytics 的信息中并不包含维度信息。When you enable metrics in a diagnostic setting, dimension information is not currently included as part of the information sent to a storage account, event hub, or log analytics.

分析日志数据Analyzing log data

使用 Azure Monitor Log Analytics 需要创建诊断配置,并启用__将信息发送到 Log Analytics__。Using Azure Monitor Log Analytics requires you to create a diagnostic configuration and enable Send information to Log Analytics. 有关详细信息,请参阅配置部分。For more information, see the Configuration section.

Azure Monitor 日志中的数据以表形式存储,每个表包含自己独有的属性集。Data in Azure Monitor Logs is stored in tables, with each table having its own set of unique properties. Azure 机器学习将数据存储在以下表格中:Azure Machine Learning stores data in the following tables:

Table 说明Description
AmlComputeClusterEventAmlComputeClusterEvent Azure 机器学习计算群集的事件。Events from Azure Machine Learning compute clusters.
AmlComputeClusterNodeEventAmlComputeClusterNodeEvent Azure 机器学习计算群集内节点的事件。Events from nodes within an Azure Machine Learning compute cluster.
AmlComputeJobEventAmlComputeJobEvent Azure 机器学习计算上运行的作业的事件。Events from jobs running on Azure Machine Learning compute.

重要

在 Azure 机器学习菜单中选择“日志” 时,Log Analytics 随即打开,其查询范围设置为当前工作区。When you select Logs from the Azure Machine Learning menu, Log Analytics is opened with the query scope set to the current workspace. 这意味着日志查询只包含来自该资源的数据。This means that log queries will only include data from that resource. 如果希望运行的查询包含其他数据库或其他 Azure 服务的数据,请从“Azure Monitor” 菜单中选择“日志” 。If you want to run a query that includes data from other databases or data from other Azure services, select Logs from the Azure Monitor menu. 请查看 Azure Monitor Log Analytics 中的日志查询范围和时间范围了解详细信息。See Log query scope and time range in Azure Monitor Log Analytics for details.

请参阅 Azure 机器学习监视数据参考,详细了解相关日志和指标。For a detailed reference of the logs and metrics, see Azure Machine Learning monitoring data reference.

示例查询Sample queries

下面是一些可用于帮助监视 Azure 机器学习资源的查询:Following are queries that you can use to help you monitor your Azure Machine Learning resources:

  • 获取过去五天内失败的作业:Get failed jobs in the last five days:

    AmlComputeJobEvent
    | where TimeGenerated > ago(5d) and EventType == "JobFailed"
    | project  TimeGenerated , ClusterId , EventType , ExecutionState , ToolType
    
  • 获取特定名称的作业的记录:Get records for a specific job name:

    AmlComputeJobEvent
    | where JobName == "automl_a9940991-dedb-4262-9763-2fd08b79d8fb_setup"
    | project  TimeGenerated , ClusterId , EventType , ExecutionState , ToolType
    
  • 为 VM 大小为 Standard_D1_V2 的群集获取过去五天内的群集事件:Get cluster events in the last five days for clusters where the VM size is Standard_D1_V2:

    AmlComputeClusterEvent
    | where TimeGenerated > ago(4d) and VmSize == "STANDARD_D1_V2"
    | project  ClusterName , InitialNodeCount , MaximumNodeCount , QuotaAllocated , QuotaUtilized
    
  • 获取最近 8 天内分配的节点:Get nodes allocated in the last eight days:

    AmlComputeClusterNodeEvent
    | where TimeGenerated > ago(8d) and NodeAllocationTime  > ago(8d)
    | distinct NodeId
    

后续步骤Next steps