用于诊断评估和监视的 Batch 指标、警报和日志Batch metrics, alerts, and logs for diagnostic evaluation and monitoring

本文介绍如何使用 Azure Monitor 的功能监视 Batch 帐户。This article explains how to monitor a Batch account using features of Azure Monitor. Azure Monitor 收集 Batch 帐户中资源的指标诊断日志Azure Monitor collects metrics and diagnostic logs for resources in your Batch account. 以各种方法收集和使用此数据可以监视 Batch 帐户及诊断问题。Collect and consume this data in a variety of ways to monitor your Batch account and diagnose issues. 还可以配置指标警报,以便在某项指标达到指定值时收到通知。You can also configure metric alerts so you receive notifications when a metric reaches a specified value.

Batch 指标Batch metrics

指标是 Azure Monitor 服务使用的 Azure 资源发出的 Azure 遥测数据(也称为性能计数器)。Metrics are Azure telemetry data (also called performance counters) emitted by your Azure resources which are consumed by the Azure Monitor service. Batch 帐户中的示例指标包括:“池创建事件”、“低优先级节点计数”和“任务完成事件”。Example metrics in a Batch account include: Pool Create Events, Low-Priority Node Count, and Task Complete Events.

请参阅支持的 Batch 指标列表See the list of supported Batch metrics.

指标:Metrics are:

  • 无需经过额外的配置,便已在每个 Batch 帐户中默认启用Enabled by default in each Batch account without additional configuration
  • 每分钟生成一次Generated every 1 minute
  • 不会自动保留,但有 30 天的历史记录滚动更新周期。Not persisted automatically, but have a 30-day rolling history. 可将活动指标保留为诊断日志记录的一部分。You can persist activity metrics as part of diagnostic logging.

查看指标View metrics

在 Azure 门户中查看 Batch 帐户的指标。View metrics for your Batch account in the Azure portal. 帐户的“概述”页默认显示关键的节点、核心和任务指标。The Overview page for the account by default shows key node, core, and task metrics.

查看所有 Batch 帐户指标:To view all Batch account metrics:

  1. 在门户中,单击“所有服务” > “Batch 帐户”,然后单击 Batch 帐户的名称。 In the portal, click All services > Batch accounts, and then click the name of your Batch account.
  2. 在“监视”下,单击“指标”。 Under Monitoring, click Metrics.
  3. 选择一个或多个指标。Select one or more of the metrics. 如果需要,请使用“订阅”、“资源组”、“资源类型”和“资源”下拉菜单选择其他资源指标。 If you want, select additional resource metrics by using the Subscriptions, Resource group, Resource type, and Resource dropdowns.
    • 对于基于计数的指标(如“专用核心计数”或“低优先级节点计数”),请使用“平均”聚合。For count-based metrics (like "Dedicated Core Count" or "Low-Priority Node Count"), use the "Average" aggregation. 对于基于事件的指标(如“池重设大小完成事件数”),请使用“计数”聚合。For event-based metrics (like "Pool Resize Complete Events"), use the "Count" aggregation.

Warning

请勿使用“求和”聚合,该聚合会将图表生存期间接收到的所有数据点的值相加Do not use the "Sum" aggregation, which adds up the values of all data points received over the period of the chart

![Batch metrics](media/batch-diagnostics/metrics-portal.png)

若要以编程方式检索指标,请使用 Azure Monitor API。To retrieve metrics programmatically, use the Azure Monitor APIs. 有关示例,请参阅使用 .NET 检索 Azure Monitor 指标For example, see Retrieve Azure Monitor metrics with .NET.

Batch 指标可靠性Batch metric reliability

指标用于确定趋势和进行数据分析。Metrics are intended to be used for trending and data analysis. 指标不保证送达,并且可能会出现乱序送达、数据丢失和/或数据重复。Metric delivery is not guaranteed and is subject to out-of-order delivery, data loss, and/or duplication. 建议不要使用单一事件来发出警报或触发函数。Using single events to alert or trigger functions is not recommended. 有关如何为警报设置阈值的更多详细信息,请参阅 Batch 指标警报部分。See the Batch metric alerts section for more details on how to set thresholds for alerting.

过去 3 分钟内发出的指标可能仍在聚合。Metrics emitted in the last 3 minutes may still be aggregating. 在此时间范围内,指标值可能会被少报。During this time frame, the metric values may be underreported.

Batch 指标警报Batch metric alerts

(可选)配置准实时指标警报。当指定指标的值超过分配的阈值时,会触发这些警报。Optionally, configure near real-time metric alerts that trigger when the value of a specified metric crosses a threshold that you assign. 当警报状态为“已激活”(超过阈值并满足警报条件)以及“已解决”(再次超过阈值,并且不再满足条件)时,警报将生成所选的通知The alert generates a notification you choose when the alert is "Activated" (when the threshold is crossed and the alert condition is met) as well as when it is "Resolved" (when the threshold is crossed again and the condition is no longer met). 建议不要使用基于单一数据点的警报,因为指标可能会出现乱序送达、数据丢失和/或数据重复。Alerting based on single data points is not recommended as metrics are subject to out-of-order delivery, data loss, and/or duplication. 警报应当使用阈值来应对这些不一致。Alerting should make use of thresholds to account for these inconsistencies.

例如,你可能想要配置一个当低优先级核心计数降到特定级别时触发的指标警报,以便能够调整池的组成部分。For example, you might want to configure a metric alert when your low priority core count falls to a certain level, so you can adjust the composition of your pools. 建议设置 10 分钟或 10 分钟以上的周期,如果平均低优先级核心计数在整个周期内低于阈值,则触发警报。It is recommended to set a period of 10 or more minutes where alerts trigger if the average low priority core count falls below the threshold value for the entire period. 建议不要基于 1-5 分钟的周期发出警报,因为指标可能仍在聚合。It is not recommended to alert on a 1-5 minute period as metrics may still be aggregating.

在门户中配置指标警报:To configure a metric alert in the portal:

  1. 单击“所有服务” > “Batch 帐户”,然后单击 Batch 帐户的名称。Click All services > Batch accounts, and then click the name of your Batch account.
  2. 在“监视”下,单击“警报规则” > “添加指标警报”。 Under Monitoring, click Alert rules > Add metric alert.
  3. 选择一个指标、一个警报条件(例如,在某个时间段内当某个指标超过特定的值时)和一个或多个通知。Select a metric, an alert condition (such as when a metric exceeds a particular value during a period), and one or more notifications.

还可以使用 REST API 配置准实时警报。You can also configure a near real-time alert using the REST API. 有关详细信息,请参阅警报概述For more information, see Alerts Overview. 若要在警报中包含特定于作业、任务或池的信息,请参阅借助 Azure Monitor 警报对事件做出响应中有关搜索查询的信息To include job, task, or pool-specific information in your alerts, see the information on search queries in Respond to events with Azure Monitor Alerts

Batch 诊断Batch diagnostics

诊断日志包含 Azure 资源发出的、描述每个资源的操作的信息。Diagnostic logs contain information emitted by Azure resources that describe the operation of each resource. 对于 Batch,可以收集以下日志:For Batch, you can collect the following logs:

  • Azure Batch 服务在单个 Batch 资源(例如池或任务)的生存期内发出的服务日志事件。Service Logs events emitted by the Azure Batch service during the lifetime of an individual Batch resource like a pool or task.

  • 帐户级别的指标日志。Metrics logs at the account level.

用于启用诊断日志收集的设置默认未启用。Settings to enable collection of diagnostic logs are not enabled by default. 请针对想要监视的每个 Batch 帐户显式启用诊断日志。Explicitly enable diagnostic settings for each Batch account you want to monitor.

日志目标Log destinations

一种常见场景是选择 Azure 存储帐户作为日志目标。A common scenario is to select an Azure Storage account as the log destination. 若要在 Azure 存储中存储日志,请在启用日志收集之前创建帐户。To store logs in Azure Storage, create the account before enabling collection of logs. 如果已将某个存储帐户关联到了 Batch 帐户,可以选择该帐户作为日志目标。If you associated a storage account with your Batch account, you can choose that account as the log destination.

诊断日志的其他可选目标:Other optional destinations for diagnostic logs:

  • 将 Batch 诊断日志事件流式传输到 Azure 事件中心Stream Batch diagnostic log events to an Azure Event Hub. 数据中心每秒可以接受数百万事件,用户可以使用任何实时分析提供程序转换并存储这些事件。Event Hubs can ingest millions of events per second, which you can then transform and store using any real-time analytics provider.

Note

使用 Azure 服务存储或处理诊断日志数据可能会产生额外的费用。You may incur additional costs to store or process diagnostic log data with Azure services.

启用 Batch 诊断日志的收集Enable collection of Batch diagnostic logs

  1. 在门户中,单击“所有服务” > “Batch 帐户”,然后单击 Batch 帐户的名称。 In the portal, click All services > Batch accounts, and then click the name of your Batch account.

  2. 在“监视”下,单击“诊断日志” > “启用诊断”。 Under Monitoring, click Diagnostic logs > Turn on diagnostics.

  3. 在“诊断设置”中,输入设置的名称,并选择日志目标(现有存储帐户、事件中心或 Azure Monitor 日志)。In Diagnostic settings, enter a name for the setting, and choose a log destination (existing Storage account, Event Hub, or Azure Monitor logs). 选择“ServiceLog”和/或“AllMetrics”。 Select either or both ServiceLog and AllMetrics.

    选择存储帐户时,请选择性地设置保留策略。When you select a storage account, optionally set a retention policy. 如果未指定保留天数,则数据在存储帐户的生存期内会一直保留。If you don't specify a number of days for retention, data is retained during the life of the storage account.

  4. 单击“保存” 。Click Save.

    Batch 诊断

用于启用日志收集的其他选项包括:在门户中使用 Azure Monitor 配置诊断设置、使用资源管理器模板,或者使用 Azure PowerShell 或 Azure CLI。Other options to enable log collection include: use Azure Monitor in the portal to configure diagnostic settings, use a Resource Manager template, or use Azure PowerShell or the Azure CLI. 请参阅从 Azure 资源收集和使用日志数据see Collect and consume log data from your Azure resources.

访问存储中的诊断日志Access diagnostics logs in storage

如果在存储帐户中存档 Batch 诊断日志,则在发生相关的事件后,会立即在存储帐户中创建一个存储容器。If you archive Batch diagnostic logs in a storage account, a storage container is created in the storage account as soon as a related event occurs. 根据以下命名模式创建 Blob:Blobs are created according to the following naming pattern:

insights-{log category name}/resourceId=/SUBSCRIPTIONS/{subscription ID}/
RESOURCEGROUPS/{resource group name}/PROVIDERS/MICROSOFT.BATCH/
BATCHACCOUNTS/{Batch account name}/y={four-digit numeric year}/
m={two-digit numeric month}/d={two-digit numeric day}/
h={two-digit 24-hour clock hour}/m=00/PT1H.json

示例:Example:

insights-metrics-pt1m/resourceId=/SUBSCRIPTIONS/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/
RESOURCEGROUPS/MYRESOURCEGROUP/PROVIDERS/MICROSOFT.BATCH/
BATCHACCOUNTS/MYBATCHACCOUNT/y=2018/m=03/d=05/h=22/m=00/PT1H.json

每个 PT1H.json Blob 文件包含 JSON 格式的事件,这些事件是在 Blob URL 中指定的小时(例如 h=12)内发生的。Each PT1H.json blob file contains JSON-formatted events that occurred within the hour specified in the blob URL (for example, h=12). 在当前的小时内发生的事件将追加​​到 PT1H.json 文件。During the present hour, events are appended to the PT1H.json file as they occur. 分钟值 (m=00) 始终为 00,因为诊断日志事件按小时细分成单个 blob。The minute value (m=00) is always 00, since diagnostic log events are broken into individual blobs per hour. (所有时间均是 UTC 时间。)(All times are in UTC.)

以下是 PT1H.json 日志文件中 PoolResizeCompleteEvent 条目的示例。Below is an example of a PoolResizeCompleteEvent entry in a PT1H.json log file. 它包括有关专用和低优先级节点的当前和目标数量以及操作的开始和结束时间的信息:It includes information about the current and target number of dedicated and low-priority nodes, as well as the start and end time of the operation:

{ "Tenant": "65298bc2729a4c93b11c00ad7e660501", "time": "2019-08-22T20:59:13.5698778Z", "resourceId": "/SUBSCRIPTIONS/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/RESOURCEGROUPS/MYRESOURCEGROUP/PROVIDERS/MICROSOFT.BATCH/BATCHACCOUNTS/MYBATCHACCOUNT/", "category": "ServiceLog", "operationName": "PoolResizeCompleteEvent", "operationVersion": "2017-06-01", "properties": {"id":"MYPOOLID","nodeDeallocationOption":"Requeue","currentDedicatedNodes":10,"targetDedicatedNodes":100,"currentLowPriorityNodes":0,"targetLowPriorityNodes":0,"enableAutoScale":false,"isAutoPool":false,"startTime":"2019-08-22 20:50:59.522","endTime":"2019-08-22 20:59:12.489","resultCode":"Success","resultMessage":"The operation succeeded"}}

有关存储帐户中诊断日志的架构的详细信息,请参阅存档 Azure 诊断日志For more information about the schema of diagnostic logs in the storage account, see Archive Azure Diagnostic Logs. 若要以编程方式访问存储帐户中的日志,请使用存储 API。To access the logs in your storage account programmatically, use the Storage APIs.

服务日志事件Service Log events

Azure Batch 服务日志(如果已收集)包含 Azure Batch 服务在单个 Batch 资源(例如池或任务)的生存期内发出的事件。Azure Batch Service Logs, if collected, contain events emitted by the Azure Batch service during the lifetime of an individual Batch resource like a pool or task. Batch 发出的每个事件以 JSON 格式记录。Each event emitted by Batch is logged in JSON format. 例如,下面是一个池创建事件样本的正文:For example, this is the body of a sample pool create event:

{
    "poolId": "myPool1",
    "displayName": "Production Pool",
    "vmSize": "Small",
    "cloudServiceConfiguration": {
        "osFamily": "5",
        "targetOsVersion": "*"
    },
    "networkConfiguration": {
        "subnetId": " "
    },
    "resizeTimeout": "300000",
    "targetDedicatedComputeNodes": 2,
    "maxTasksPerNode": 1,
    "vmFillType": "Spread",
    "enableAutoscale": false,
    "enableInterNodeCommunication": false,
    "isAutoPool": false
}

Batch 服务当前会生成以下服务日志事件。The Batch service currently emits the following Service Log events. 此列表可能不完整,因为自本文最后更新以来可能又添加了其他事件。This list may not be exhaustive, since additional events may have been added since this article was last updated.

服务日志事件Service Log events
池创建Pool create
池删除启动Pool delete start
池删除完成Pool delete complete
池调整大小启动Pool resize start
池调整大小完成Pool resize complete
任务启动Task start
任务完成Task complete
任务失败Task fail

后续步骤Next steps