Azure 机器学习监视数据引用Azure machine learning monitoring data reference

了解 Azure Monitor 从 Azure 机器学习工作区收集的数据和资源。Learn about the data and resources collected by Azure Monitor from your Azure Machine Learning workspace. 有关收集和分析监视数据的详细信息,请参阅监视 Azure 机器学习See Monitoring Azure Machine Learning for details on collecting and analyzing monitoring data.

资源日志Resource logs

下表列出了在 Azure Monitor 日志或 Azure 存储中收集 Azure 机器学习资源日志时这些资源日志的属性。The following table lists the properties for Azure Machine Learning resource logs when they're collected in Azure Monitor Logs or Azure Storage.

AmlComputeJobEvents 表AmlComputeJobEvents table

propertiesProperty 说明Description
TimeGeneratedTimeGenerated 生成日志项目的时间Time when the log entry was generated
OperationNameOperationName 与日志事件关联的操作的名称Name of the operation associated with the log event
类别Category 日志事件的名称,AmlComputeClusterNodeEventName of the log event, AmlComputeClusterNodeEvent
JobIdJobId 已提交作业的 IDID of the Job submitted
ExperimentIdExperimentId 试验的 IDID of the Experiment
ExperimentNameExperimentName 试验的名称Name of the Experiment
CustomerSubscriptionIdCustomerSubscriptionId 已提交的试验和作业的 SubscriptionIdSubscriptionId where Experiment and Job as submitted
WorkspaceNameWorkspaceName 机器学习工作区的名称Name of the machine learning workspace
ClusterNameClusterName 群集的名称Name of the Cluster
ProvisioningStateProvisioningState 作业提交的状态State of the Job submission
ResourceGroupNameResourceGroupName 资源组的名称Name of the resource group
JobNameJobName 作业的名称Name of the Job
ClusterIdClusterId 群集的 IDID of the cluster
EventTypeEventType 作业事件的类型,例如 JobSubmitted、JobRunning、JobFailed、JobSucceeded,等等。Type of the Job event, e.g., JobSubmitted, JobRunning, JobFailed, JobSucceeded, etc.
ExecutionStateExecutionState 作业(运行)的状态,例如已排队、正在运行、成功、失败State of the job (the Run), e.g., Queued, Running, Succeeded, Failed
ErrorDetailsErrorDetails 作业错误的详细信息Details of job error
CreationApiVersionCreationApiVersion 用于创建作业的 Api 版本Api version used to create the job
ClusterResourceGroupNameClusterResourceGroupName 群集的资源组名称Resource group name of the cluster
TFWorkerCountTFWorkerCount TF 辅助角色的计数Count of TF workers
TFParameterServerCountTFParameterServerCount TF 参数服务器的计数Count of TF parameter server
ToolTypeToolType 使用的工具类型Type of tool used
RunInContainerRunInContainer 描述作业是否应在容器中运行的标志Flag describing if job should be run inside a container
JobErrorMessageJobErrorMessage 作业错误的详细消息detailed message of Job error
NodeIdNodeId 作业运行时所创建节点的 IDID of the node created where job is running

AmlComputeClusterEvents 表AmlComputeClusterEvents table

propertiesProperty 说明Description
TimeGeneratedTimeGenerated 生成日志项目的时间Time when the log entry was generated
OperationNameOperationName 与日志事件关联的操作的名称Name of the operation associated with the log event
类别Category 日志事件的名称,AmlComputeClusterNodeEventName of the log event, AmlComputeClusterNodeEvent
ProvisioningStateProvisioningState 群集的预配状态Provisioning state of the cluster
ClusterNameClusterName 群集的名称Name of the cluster
ClusterTypeClusterType 群集的类型Type of the cluster
CreatedByCreatedBy 创建群集的用户User who created the cluster
CoreCountCoreCount 群集中的核心计数Count of the cores in the cluster
VmSizeVmSize 群集的 VM 大小Vm size of the cluster
VmPriorityVmPriority 在群集内所创建节点的优先级 Dedicated/LowPriorityPriority of the nodes created inside a cluster Dedicated/LowPriority
ScalingTypeScalingType 群集缩放的类型手动/自动Type of cluster scaling manual/auto
InitialNodeCountInitialNodeCount 群集的初始节点计数Initial node count of the cluster
MinimumNodeCountMinimumNodeCount 群集的最小节点计数Minimum node count of the cluster
MaximumNodeCountMaximumNodeCount 群集的最大节点计数Maximum node count of the cluster
NodeDeallocationOptionNodeDeallocationOption 解除分配节点的方法How the node should be deallocated
发布者Publisher 群集类型的发布服务器Publisher of the cluster type
产品/服务Offer 用于创建群集的产品/服务Offer with which the cluster is created
SKUSku 群集内所创建节点/VM 的 SkuSku of the Node/VM created inside cluster
版本Version 创建节点/VM 时使用的映像版本Version of the image used while Node/VM is created
SubnetIdSubnetId 群集的 SubnetIdSubnetId of the cluster
AllocationStateAllocationState 群集分配状态Cluster allocation state
CurrentNodeCountCurrentNodeCount 群集的当前节点计数Current node count of the cluster
TargetNodeCountTargetNodeCount 群集纵向扩展/减少时的目标节点计数Target node count of the cluster while scaling up/down
EventTypeEventType 群集创建期间的事件类型。Type of event during cluster creation.
NodeIdleTimeSecondsBeforeScaleDownNodeIdleTimeSecondsBeforeScaleDown 群集纵向缩减之前的空闲时间(以秒为单位)Idle time in seconds before cluster is scaled down
PreemptedNodeCountPreemptedNodeCount 群集的已占用节点计数Preempted node count of the cluster
IsResizeGrowIsResizeGrow 指示群集正在纵向扩展的标志Flag indicating that cluster is scaling up
VmFamilyNameVmFamilyName 可在群集内创建的节点的 VM 系列名称Name of the VM family of the nodes that can be created inside cluster
LeavingNodeCountLeavingNodeCount 群集的正在离开节点计数Leaving node count of the cluster
UnusableNodeCountUnusableNodeCount 群集的不可用节点计数Unusable node count of the cluster
IdleNodeCountIdleNodeCount 群集的空闲节点计数Idle node count of the cluster
RunningNodeCountRunningNodeCount 群集的正在运行节点计数Running node count of the cluster
PreparingNodeCountPreparingNodeCount 群集的正在准备节点计数Preparing node count of the cluster
QuotaAllocatedQuotaAllocated 群集的已分配配额Allocated quota to the cluster
QuotaUtilizedQuotaUtilized 群集的已利用配额Utilized quota of the cluster
AllocationStateTransitionTimeAllocationStateTransitionTime 将时间从一种状态转换为另一种状态Transition time from one state to another
ClusterErrorCodesClusterErrorCodes 群集创建或缩放期间收到的错误代码Error code received during cluster creation or scaling
CreationApiVersionCreationApiVersion 创建群集时使用的 API 版本Api version used while creating the cluster

AmlComputeClusterNodeEvents 表AmlComputeClusterNodeEvents table

propertiesProperty 说明Description
TimeGeneratedTimeGenerated 生成日志项目的时间Time when the log entry was generated
OperationNameOperationName 与日志事件关联的操作的名称Name of the operation associated with the log event
类别Category 日志事件的名称,AmlComputeClusterNodeEventName of the log event, AmlComputeClusterNodeEvent
ClusterNameClusterName 群集的名称Name of the cluster
NodeIdNodeId 创建的群集节点的 IDID of the cluster node created
VmSizeVmSize 节点的 VM 大小Vm size of the node
VmFamilyNameVmFamilyName 节点所属的 VM 系列Vm family to which the node belongs
VmPriorityVmPriority 已创建的节点的优先级 Dedicated/LowPriorityPriority of the node created Dedicated/LowPriority
发布者Publisher VM 映像的发布服务器,例如 microsoft-dsvmPublisher of the vm image, e.g., microsoft-dsvm
产品/服务Offer 与 VM 创建相关联的产品/服务Offer associated with the VM creation
SKUSku 已创建的节点/VM 的 SKUSku of the Node/VM created
版本Version 创建节点/VM 时使用的映像版本Version of the image used while Node/VM is created
ClusterCreationTimeClusterCreationTime 创建群集的时间Time when cluster was created
ResizeStartTimeResizeStartTime 群集开始纵向扩展/缩减的时间Time when cluster scale up/down started
ResizeEndTimeResizeEndTime 群集结束纵向扩展/缩减的时间Time when cluster scale up/down ended
NodeAllocationTimeNodeAllocationTime 分配节点的时间Time when Node was allocated
NodeBootTimeNodeBootTime 节点启动的时间Time when Node was booted up
StartTaskStartTimeStartTaskStartTime 向节点分配任务并启动任务的时间Time when task was assigned to a node and started
StartTaskEndTimeStartTaskEndTime 向节点分配任务并结束任务的时间Time when task assigned to a node ended
TotalE2ETimeInSecondsTotalE2ETimeInSeconds 总时间节点处于活动状态Total time node was active

度量值Metrics

下表列出了为 Azure 机器学习收集的平台指标,所有度量值都存储在“Azure 机器学习工作区”命名空间中 。The following tables list the platform metrics collected for Azure Machine Learning All metrics are stored in the namespace Azure Machine Learning Workspace.

ModelModel

指标Metric 单位Unit 说明Description
模型部署失败Model deploy failed CountCount 失败的模型部署数。The number of model deployments that failed.
模型部署开始Model deploy started CountCount 开始的模型部署数。The number of model deployments started.
模型部署成功Model deploy succeeded CountCount 成功的模型部署数。The number of model deployments that succeeded.
模型注册失败Model register failed CountCount 失败的模型注册数。The number of model registrations that failed.
模型注册成功Model register succeeded CountCount 成功的模型注册数。The number of model registrations that succeeded.

配额Quota

配额信息仅用于 Azure 机器学习计算。Quota information is for Azure Machine Learning compute only.

指标Metric 单位Unit 说明Description
活动核心数Active cores CountCount 活动计算核心的数量。The number of active compute cores.
活动节点数Active nodes CountCount 活动节点的数量。The number of active nodes.
空闲核心数Idle cores CountCount 空闲计算核心的数量。The number of idle compute cores.
空闲节点数Idle nodes CountCount 空闲计算节点的数量。The number of idle compute nodes.
正在离开核心数Leaving cores CountCount 正在离开核心的数量。The number of leaving cores.
正在离开节点数Leaving nodes CountCount 正在离开节点的数量。The number of leaving nodes.
已占用核心数Preempted cores CountCount 已占用核心的数量。The number of preempted cores.
已占用节点Preempted nodes CountCount 已占用节点的数量。The number of preempted nodes.
配额使用率百分比Quota utilization percentage 百分比Percent 已使用配额的百分比。The percentage of quota used.
核心总数Total cores CountCount 核心总数。The total cores.
节点总数Total nodes CountCount 节点总数。The total nodes.
不可用核心数Unusable cores CountCount 不可用核心的数量。The number of unusable cores.
不可用节点数Unusable nodes CountCount 不可用节点的数量。The number of unusable nodes.

下面是可用于筛选配额指标的维度:The following are dimensions that can be used to filter quota metrics:

维度Dimension 适用指标Metric(s) available with 说明Description
群集名称Cluster Name 所有配额指标All quota metrics 计算实例的名称。The name of the compute instance.
VM 系列名称Vm Family Name 配额使用率百分比Quota utilization percentage 群集使用的 VM 系列的名称。The name of the VM family used by the cluster.
VM 优先级Vm Priority 配额使用率百分比Quota utilization percentage VM 的优先级。The priority of the VM.

资源Resource

指标Metric 计价单位Unit 说明Description
CpuUtilizationCpuUtilization 百分比Percent 在运行/作业过程中,给定节点使用的 CPU 百分比。How much percent of CPU was utilized for a given node during a run/job. 仅当作业在节点上运行时,才会发布此指标。This metric is published only when a job is running on a node. 一个作业可以使用一个或多个节点。One job may use one or more nodes. 此指标按节点发布。This metric is published per node.
GpuUtilizationGpuUtilization 百分比Percent 在运行/作业过程中,给定节点使用的 GPU 百分比。How much percentage of GPU was utilized for a given node during a run/job. 一个节点可以有一个或多个 GPU。One node can have one or more GPUs. 此指标按每个节点的 GPU 发布。This metric is published per GPU per node.

下面是可用于筛选资源指标的维度:The following are dimensions that can be used to filter resource metrics:

维度Dimension 说明Description
CreatedTimeCreatedTime
DeviceIdDeviceId 设备 (GPU) 的 ID。ID of the device (GPU). 仅适用于 GpuUtilization。Only available for GpuUtilization.
NodeIdNodeId 作业运行时所在的已创建节点的 ID。ID of the node created where job is running.
RunIdRunId 运行/作业的 ID。ID of the run/job.

RunRun

有关训练运行的信息。Information on training runs.

指标Metric 计价单位Unit 说明Description
已完成的运行数Completed runs 计数Count 已完成运行的数量。The number of completed runs.
失败运行数Failed runs 计数Count 失败运行的数量。The number of failed runs.
已开始运行Started runs 计数Count 已开始运行的数量。The number of started runs.

下面是可用于筛选运行指标的维度:The following are dimensions that can be used to filter run metrics:

维度Dimension 说明Description
ComputeTypeComputeType 运行时使用的计算类型。The compute type that the run used.
PipelineStepTypePipelineStepType 运行时使用的 PipelineStep 类型。The type of PipelineStep used in the run.
PublishedPipelineIdPublishedPipelineId 运行时使用的已发布管道的 ID。The ID of the published pipeline used in the run.
RunTypeRunType 运行的类型。The type of run.

RunType 维度的有效值为:The valid values for the RunType dimension are:

ValueValue 说明Description
试验Experiment 非管道运行。Non-pipeline runs.
PipelineRunPipelineRun 管道运行,它是 StepRun 的父级。A pipeline run, which is the parent of a StepRun.
StepRunStepRun 管道步骤的运行。A run for a pipeline step.
ReusedStepRunReusedStepRun 重用上次运行的管道步骤的运行。A run for a pipeline step that reuses a previous run.

另请参阅See Also