用于诊断评估和监视的 Batch 指标、警报和日志Batch metrics, alerts, and logs for diagnostic evaluation and monitoring

本文介绍如何使用 Azure Monitor 的功能监视 Batch 帐户。This article explains how to monitor a Batch account using features of Azure Monitor. Azure Monitor 收集 Batch 帐户中资源的指标诊断日志Azure Monitor collects metrics and diagnostic logs for resources in your Batch account. 以各种方法收集和使用此数据可以监视 Batch 帐户及诊断问题。Collect and consume this data in a variety of ways to monitor your Batch account and diagnose issues. 还可以配置指标警报,以便在某项指标达到指定值时收到通知。You can also configure metric alerts so you receive notifications when a metric reaches a specified value.

Batch 指标Batch metrics

指标是 Azure 资源发出的并由 Azure Monitor 服务使用的 Azure 遥测数据(也称为性能计数器)。Metrics are Azure telemetry data (also called performance counters) that are emitted by your Azure resources and consumed by the Azure Monitor service. Batch 帐户中的指标示例包括“池创建事件”和“任务完成事件”。Examples of metrics in a Batch account are Pool Create Events, and Task Complete Events.

请参阅支持的 Batch 指标列表See the list of supported Batch metrics.

指标:Metrics are:

  • 无需经过额外的配置,便已在每个 Batch 帐户中默认启用Enabled by default in each Batch account without additional configuration
  • 每分钟生成一次Generated every 1 minute
  • 不会自动保留,但有 30 天的历史记录滚动更新周期。Not persisted automatically, but have a 30-day rolling history. 可将活动指标保留为诊断日志记录的一部分。You can persist activity metrics as part of diagnostic logging.

查看 Batch 指标View Batch metrics

在 Azure 门户中,帐户的“概览”页会默认显示关键的节点、核心和任务指标。In the Azure portal, the Overview page for the account will show key node, core, and task metrics by default.

若要在 Azure 门户中查看所有 Batch 帐户指标,请执行以下操作:To view all Batch account metrics in the Azure portal:

  1. 在 Azure 门户中,选择“所有服务” > “Batch 帐户”,然后选择你的 Batch 帐户的名称。In the Azure portal, select All services > Batch accounts, and then select the name of your Batch account.

  2. 在“监视”下,选择“指标”。 Under Monitoring, select Metrics.

  3. 选择“添加指标”,然后从下拉列表中选择一个指标。Select Add metric and then choose a metric from the dropdown list.

  4. 为指标选择“聚合”选项。Select an Aggregation option for the metric. 对于基于计数的指标(如“专用核心计数”),请使用“平均”聚合。For count-based metrics (like "Dedicated Core Count"), use the Average aggregation. 对于基于事件的指标(如“池重设大小完成事件数”),请使用“计数”聚合。For event-based metrics (like "Pool Resize Complete Events"), use the Count" aggregation.

    警告

    请勿使用“求和”聚合,该聚合会将图表生存期间接收到的所有数据点的值相加。Do not use the "Sum" aggregation, which adds up the values of all data points received over the period of the chart.

  5. 若要添加其他指标,请重复步骤 3 和 4。To add additional metrics, repeat steps 3 and 4.

还可以使用 Azure Monitor API 以编程方式检索指标。You can also retrieve metrics programmatically with the Azure Monitor APIs. 有关示例,请参阅使用 .NET 检索 Azure Monitor 指标For an example, see Retrieve Azure Monitor metrics with .NET.

Batch 指标可靠性Batch metric reliability

指标可帮助识别趋势,并可用于数据分析。Metrics can help identify trends and can be used for data analysis. 请务必注意,指标不保证送达,并且可能会出现乱序送达、数据丢失和/或数据重复的情况。It's important to note that metric delivery is not guaranteed, and may be subject to out-of-order delivery, data loss, and/or duplication. 因此,建议不要使用单一事件来发出警报或触发函数。Because of this, using single events to alert or trigger functions is not recommended. 若要更详细地了解如何为警报设置阈值,请参阅下一部分。See the next section for more details on how to set thresholds for alerting.

过去 3 分钟内发出的指标可能仍然正在聚合,因此在此时间范围内,指标值可能会因记录不全而少报。Metrics emitted in the last 3 minutes may still be aggregating, so metric values may be underreported during this timeframe.

Batch 指标警报Batch metric alerts

你可以配置准实时指标警报。当指定指标的值超过分配的阈值时,会触发这些警报。You can configure near real-time metric alerts that trigger when the value of a specified metric crosses a threshold that you assign. 当警报“激活”(当阈值越过并满足警报条件时)以及“已解决”(当阈值再次超过并且不再满足条件)时,警报将生成通知。The alert generates a notification when the alert is "Activated" (when the threshold is crossed and the alert condition is met) as well as when it is "Resolved" (when the threshold is crossed again and the condition is no longer met).

建议不要使用基于单一数据点触发的警报,因为指标可能会出现乱序送达、数据丢失和/或数据重复的情况。Alerts that trigger on a single data point is not recommended, as metrics are subject to out-of-order delivery, data loss, and/or duplication. 创建警报时,可以使用阈值来应对这些不一致。When creating your alerts, you can use thresholds to account for these inconsistencies.

若要在 Azure 门户中配置指标警报,请执行以下操作:To configure a metric alert in the Azure portal:

  1. 选择“所有服务” > “Batch 帐户”,然后选择 Batch 帐户的名称。Select All services > Batch accounts, and then select the name of your Batch account.
  2. 在“监视”下,选择“警报”,然后选择“新建警报规则”。 Under Monitoring, select Alerts, then select New alert rule.
  3. 单击“选择条件”,然后选择一个指标。Click Select condition, then choose a metric. 确认“图表期间”、“阈值类型”、“运算符”和“聚合类型”的值,然后输入一个阈值Confirm the values for Chart period, Threshold type, Operator, and Aggregation type, and enter a Threshold value. 然后选择“完成”。Then select Done.
  4. 通过选择现有操作组或创建新的操作组,将一个操作组添加到警报中。Add an action group to the alert either by selecting an existing action group or creating a new action group.
  5. 在“警报规则详细信息”部分中,输入警报规则名称说明并选择严重性In the Alert rule details section, enter an Alert rule name and Description and select the Severity
  6. 选择“创建警报规则”。Select Create alert rule.

有关创建指标警报的详细信息,请参阅了解指标警报在 Azure Monitor 中的工作方式使用 Azure Monitor 创建、查看和管理指标警报For more information about creating metric alerts, see Understand how metric alerts work in Azure Monitor and Create, view, and manage metric alerts using Azure Monitor.

还可以使用 Azure Monitor REST API 配置准实时警报。You can also configure a near real-time alert using the Azure Monitor REST API. 有关详细信息,请参阅 Azure 中的警报概述For more information, see Overview of Alerts in Azure. 若要在警报中包含特定于作业、任务或池的信息,请参阅借助 Azure Monitor 警报对事件做出响应中有关搜索查询的信息。To include job, task, or pool-specific information in your alerts, see the information on search queries in Respond to events with Azure Monitor Alerts.

Batch 诊断Batch diagnostics

诊断日志包含 Azure 资源发出的、描述每个资源的操作的信息。Diagnostic logs contain information emitted by Azure resources that describe the operation of each resource. 对于 Batch,可以收集以下日志:For Batch, you can collect the following logs:

  • Azure Batch 服务在单个 Batch 资源(例如池或任务)的生存期内发出的服务日志事件。Service Logs events emitted by the Azure Batch service during the lifetime of an individual Batch resource like a pool or task.
  • 帐户级别的指标日志。Metrics logs at the account level.

用于启用诊断日志收集的设置默认未启用。Settings to enable collection of diagnostic logs are not enabled by default. 请针对想要监视的每个 Batch 帐户显式启用诊断日志。Explicitly enable diagnostic settings for each Batch account you want to monitor.

日志目标Log destinations

一种常见场景是选择 Azure 存储帐户作为日志目标。A common scenario is to select an Azure Storage account as the log destination. 若要在 Azure 存储中存储日志,请在启用日志收集之前创建帐户。To store logs in Azure Storage, create the account before enabling collection of logs. 如果已将某个存储帐户关联到了 Batch 帐户,可以选择该帐户作为日志目标。If you associated a storage account with your Batch account, you can choose that account as the log destination.

此外,还可以:Alternately, you can:

  • 将 Batch 诊断日志事件流式传输到 Azure 事件中心Stream Batch diagnostic log events to an Azure Event Hub. 数据中心每秒可以接受数百万事件,用户可以使用任何实时分析提供程序转换并存储这些事件。Event Hubs can ingest millions of events per second, which you can then transform and store using any real-time analytics provider.
  • 将诊断日志发送到 Azure Monitor 日志(可用于分析这些日志),或者导出诊断日志以在 Power BI 或 Excel 中进行分析。Send diagnostic logs to Azure Monitor logs, where you can analyze them or export them for analysis in Power BI or Excel.

备注

使用 Azure 服务存储或处理诊断日志数据可能会产生额外的费用。You may incur additional costs to store or process diagnostic log data with Azure services.

启用 Batch 诊断日志的收集Enable collection of Batch diagnostic logs

若要在 Azure 门户中创建新的诊断设置,请执行以下步骤。To create a new diagnostic setting in the Azure portal, follow the steps below.

  1. 在 Azure 门户中,选择“所有服务” > “Batch 帐户”,然后选择你的 Batch 帐户的名称。In the Azure portal, select All services > Batch accounts, and then select the name of your Batch account.
  2. 在“监视”下,选择“诊断设置” 。Under Monitoring, select Diagnostic settings.
  3. 在“诊断设置”中,选择“添加诊断设置” 。In Diagnostic settings, select Add diagnostic setting.
  4. 输入设置名称。Enter a name for the setting.
  5. 选择目标:“发送到 Log Analytics”、“存档到存储帐户”或“流式传输到事件中心”。Select a destination: Send to Log Analytics, Archive to a storage account, or Stream to an Event Hub. 如果选择了存储帐户,则还可设置保留策略。If you select a storage account, you can optionally set a retention policy. 如果未指定保留天数,则数据在存储帐户的生存期内会一直保留。If you don't specify a number of days for retention, data is retained during the life of the storage account.
  6. 选择“ServiceLog”和/或“AllMetrics”。Select ServiceLog, AllMetrics, or both.
  7. 选择“保存”以创建诊断设置。Select Save to create the diagnostic setting.

还可以使用资源管理器模板、Azure PowerShell 或 Azure CLI 在 Azure 门户中通过 Azure Monitor 来启用收集功能,以便配置诊断设置。You can also enable collection through Azure Monitor in the Azure portal to configure diagnostic settings, by using a Resource Manager template, or with Azure PowerShell or the Azure CLI. 有关详细信息,请参阅 Azure 平台日志概述For more information, see Overview of Azure platform logs.

访问存储中的诊断日志Access diagnostics logs in storage

如果在存储帐户中存档 Batch 诊断日志,则在发生相关的事件后,会立即在存储帐户中创建一个存储容器。If you archive Batch diagnostic logs in a storage account, a storage container is created in the storage account as soon as a related event occurs. 根据以下命名模式创建 Blob:Blobs are created according to the following naming pattern:

insights-{log category name}/resourceId=/SUBSCRIPTIONS/{subscription ID}/
RESOURCEGROUPS/{resource group name}/PROVIDERS/MICROSOFT.BATCH/
BATCHACCOUNTS/{Batch account name}/y={four-digit numeric year}/
m={two-digit numeric month}/d={two-digit numeric day}/
h={two-digit 24-hour clock hour}/m=00/PT1H.json

例如:For example:

insights-metrics-pt1m/resourceId=/SUBSCRIPTIONS/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/
RESOURCEGROUPS/MYRESOURCEGROUP/PROVIDERS/MICROSOFT.BATCH/
BATCHACCOUNTS/MYBATCHACCOUNT/y=2018/m=03/d=05/h=22/m=00/PT1H.json

每个 PT1H.json Blob 文件包含 JSON 格式的事件,这些事件是在 Blob URL 中指定的小时(例如 h=12)内发生的。Each PT1H.json blob file contains JSON-formatted events that occurred within the hour specified in the blob URL (for example, h=12). 在当前的小时内发生的事件将追加​​到 PT1H.json 文件。During the present hour, events are appended to the PT1H.json file as they occur. 分钟值 (m=00) 始终为 00,因为诊断日志事件按小时细分成单个 blob。The minute value (m=00) is always 00, since diagnostic log events are broken into individual blobs per hour. (所有时间均是 UTC 时间。)(All times are in UTC.)

以下是 PT1H.json 日志文件中 PoolResizeCompleteEvent 条目的示例。Below is an example of a PoolResizeCompleteEvent entry in a PT1H.json log file. 它包括有关专用的当前和目标数量以及操作的开始和结束时间的信息:It includes information about the current and target number of dedicated , as well as the start and end time of the operation:

{ "Tenant": "65298bc2729a4c93b11c00ad7e660501", "time": "2019-08-22T20:59:13.5698778Z", "resourceId": "/SUBSCRIPTIONS/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/RESOURCEGROUPS/MYRESOURCEGROUP/PROVIDERS/MICROSOFT.BATCH/BATCHACCOUNTS/MYBATCHACCOUNT/", "category": "ServiceLog", "operationName": "PoolResizeCompleteEvent", "operationVersion": "2017-06-01", "properties": {"id":"MYPOOLID","nodeDeallocationOption":"Requeue","currentDedicatedNodes":10,"targetDedicatedNodes":100,"currentLowPriorityNodes":0,"targetLowPriorityNodes":0,"enableAutoScale":false,"isAutoPool":false,"startTime":"2019-08-22 20:50:59.522","endTime":"2019-08-22 20:59:12.489","resultCode":"Success","resultMessage":"The operation succeeded"}}

若要详细了解存储帐户中诊断日志的架构,请参阅将 Azure 资源日志存档到存储帐户For more information about the schema of diagnostic logs in the storage account, see Archive Azure resource logs to storage account. 若要以编程方式访问存储帐户中的日志,请使用存储 API。To access the logs in your storage account programmatically, use the Storage APIs.

服务日志事件Service log events

Azure Batch 服务日志(如果已收集)包含 Azure Batch 服务在单个 Batch 资源(例如池或任务)的生存期内发出的事件。Azure Batch service logs, if collected, contain events emitted by the Azure Batch service during the lifetime of an individual Batch resource, such as a pool or task. Batch 发出的每个事件以 JSON 格式记录。Each event emitted by Batch is logged in JSON format. 例如,下面是一个池创建事件样本的正文:For example, this is the body of a sample pool create event:

{
    "poolId": "myPool1",
    "displayName": "Production Pool",
    "vmSize": "Small",
    "cloudServiceConfiguration": {
        "osFamily": "5",
        "targetOsVersion": "*"
    },
    "networkConfiguration": {
        "subnetId": " "
    },
    "resizeTimeout": "300000",
    "targetDedicatedComputeNodes": 2,
    "maxTasksPerNode": 1,
    "vmFillType": "Spread",
    "enableAutoscale": false,
    "enableInterNodeCommunication": false,
    "isAutoPool": false
}

Batch 服务发出的服务日志事件包括以下各项:Service log events emitted by the Batch service include the following:

后续步骤Next steps