Azure Log Analytics 中的警报管理解决方案Alert Management solution in Azure Log Analytics

警报管理图标

Note

Azure Monitor 现在支持大规模管理警报的增强功能,包括由监视工具(如 SCOM、Zabbix 或 Nagios)生成的警报。Azure Monitor now supports enhanced capabilities for managing your alerts at scale, including those generated by monitoring tools like SCOM, Zabbix or Nagios.

警报管理解决方案有助于分析 Log Analytics 存储库中的所有警报。The Alert Management solution helps you analyze all of the alerts in your Log Analytics repository. 这些警报可能来自各种源,包括 Log Analytics 创建或是从 Nagios 或 Zabbix 导入的源。These alerts may have come from a variety of sources including those sources created by Log Analytics or imported from Nagios or Zabbix.

先决条件Prerequisites

解决方案处理 Log Analytics 存储库中具有 Alert 类型的任何记录,因此必须执行收集这些记录所需的任何配置。The solution works with any records in the Log Analytics repository with a type of Alert, so you must perform whatever configuration is required to collect these records.

配置Configuration

使用“添加解决方案”中所述的流程,将警报管理解决方案添加到 Log Analytics 工作区。Add the Alert Management solution to your Log Analytics workspace using the process described in Add solutions. 无需进一步的配置。There is no further configuration required.

管理包Management packs

如果 System Center Operations Manager 管理组已连接到 Log Analytics 工作区,则添加此解决方案时将在 System Center Operations Manager 中安装以下管理包。If your System Center Operations Manager management group is connected to your Log Analytics workspace, then the following management packs are installed in System Center Operations Manager when you add this solution. 无需对管理包进行任何配置或维护。There is no configuration or maintenance of the management packs required.

  • Azure System Center Advisor 警报管理 (Microsoft.IntelligencePacks.AlertManagement)Azure System Center Advisor Alert Management (Microsoft.IntelligencePacks.AlertManagement)

数据收集Data collection

代理Agents

下表介绍了该解决方案支持的连接的源。The following table describes the connected sources that are supported by this solution.

连接的源Connected Source 支持Support 说明Description
Windows 代理Windows agents No 直接 Windows 代理不会生成警报。Direct Windows agents do not generate alerts. 可以通过从 Windows 代理收集的事件和性能数据来创建 Log Analytics 警报。Log Analytics alerts can be created from events and performance data collected from Windows agents.
Linux 代理Linux agents No 直接 Linux 代理不会生成警报。Direct Linux agents do not generate alerts. 可以通过从 Linux 代理收集的事件和性能数据来创建 Log Analytics 警报。Log Analytics alerts can be created from events and performance data collected from Linux agents. 从需要 Linux 代理的服务器中收集 Nagios 和 Zabbix 警报。Nagios and Zabbix alerts are collected from those servers that require the Linux agent.

收集频率Collection frequency

  • 警报记录存储在存储库中之后,便可立即供解决方案使用。Alert records are available to the solution as soon as they are stored in the repository.
  • 警报数据每 3 分钟从 Operations Manager 管理组发送到 Log Analytics。Alert data is sent from the Operations Manager management group to Log Analytics every three minutes.

使用解决方案Using the solution

在 Log Analytics 工作区中添加警报管理解决方案时,“警报管理”磁贴将添加到仪表板。When you add the Alert Management solution to your Log Analytics workspace, the Alert Management tile is added to your dashboard. 此磁贴显示在过去 24 小时内生成的当前活动警报的数目的计数与图形表示。This tile displays a count and graphical representation of the number of currently active alerts that were generated within the last 24 hours. 不能更改此时间范围。You cannot change this time range.

警报管理磁贴

单击“警报管理”磁贴打开“警报管理”仪表板。Click on the Alert Management tile to open the Alert Management dashboard. 仪表板包含下表中的列。The dashboard includes the columns in the following table. 每列按计数列出了指定范围和时间范围内符合该列条件的前十个警报。Each column lists the top 10 alerts by count matching that column's criteria for the specified scope and time range. 可通过以下方式运行提供整个列表的日志搜索:单击该列底部的“查看全部”或单击列标题。You can run a log search that provides the entire list by clicking See all at the bottom of the column or by clicking the column header.

Column 说明Description
严重警报Critical Alerts 按警报名称分组并且严重级别为“严重”的所有警报。All alerts with a severity of Critical grouped by alert name. 单击某个警报名称,以运行会返回该警报所有记录的日志搜索。Click on an alert name to run a log search returning all records for that alert.
警告警报Warning Alerts 按警报名称分组并且严重级别为“警告”的所有警报。All alerts with a severity of Warning grouped by alert name. 单击某个警报名称,以运行会返回该警报所有记录的日志搜索。Click on an alert name to run a log search returning all records for that alert.
活动 SCOM 警报Active SCOM Alerts 按生成警报的源分组并且状态为非“已关闭”的从 Operations Manager 收集的所有警报。All alerts collected from Operations Manager with any state other than Closed grouped by source that generated the alert.
所有活动警报All Active Alerts 按警报名称分组并且具有任意严重级别的所有警报。All alerts with any severity grouped by alert name. 仅包括状态为非“已关闭”的 Operations Manager 警报。Only includes Operations Manager alerts with any state other than Closed.

向右滚动时,仪表板会列出几个常见查询,可以单击这些查询执行日志搜索以获取警报数据。If you scroll to the right, the dashboard lists several common queries that you can click on to perform a log search for alert data.

警报管理仪表板

Log Analytics 记录Log Analytics records

警报管理解决方案会分析类型为 Alert 的任何记录。The Alert Management solution analyzes any record with a type of Alert. 解决方案不直接收集由 Log Analytics 创建或是从 Nagios 或 Zabbix 收集的警报。Alerts created by Log Analytics or collected from Nagios or Zabbix are not directly collected by the solution.

解决方案会从 System Center Operations Manager 导入警报,并为类型为 Alert 且 SourceSystem 为 OpsManager 的每个警报创建相应的记录。The solution does import alerts from System Center Operations Manager and creates a corresponding record for each with a type of Alert and a SourceSystem of OpsManager. 这些记录的属性在下表中列出:These records have the properties in the following table:

属性Property 说明Description
Type AlertAlert
SourceSystem OpsManagerOpsManager
AlertContext 导致生成警报的数据项的详细信息(XML 格式)。Details of the data item that caused the alert to be generated in XML format.
AlertDescription 警报的详细说明。Detailed description of the alert.
AlertId 警报的 GUID。GUID of the alert.
AlertName 警报的名称。Name of the alert.
AlertPriority 警报的优先级。Priority level of the alert.
AlertSeverity 警报的严重级别。Severity level of the alert.
AlertState 警报最新的解决状态。Latest resolution state of the alert.
LastModifiedBy 上次修改警报的用户的名称。Name of the user who last modified the alert.
ManagementGroupName 生成警报的管理组的名称。Name of the management group where the alert was generated.
RepeatCount 针对同一个监视对象生成的相同警报的次数(自该警报解决之后)。Number of times the same alert was generated for the same monitored object since being resolved.
ResolvedBy 解决警报的用户的名称。Name of the user who resolved the alert. 空(如果警报尚未解决)。Empty if the alert has not yet been resolved.
SourceDisplayName 已生成警报的监视对象的显示名称。Display name of the monitoring object that generated the alert.
SourceFullName 已生成警报的监视对象的完整名称。Full name of the monitoring object that generated the alert.
TicketId 警报的票证 ID(如果 System Center Operations Manager 环境与分配警报票证的过程集成)。Ticket ID for the alert if the System Center Operations Manager environment is integrated with a process for assigning tickets for alerts. 空(如果未分配任何票证 ID)。Empty of no ticket ID is assigned.
TimeGenerated 警报的创建日期和时间。Date and time that the alert was created.
TimeLastModified 上次更改警报的日期和时间。Date and time that the alert was last changed.
TimeRaised 警报的生成日期和时间。Date and time that the alert was generated.
TimeResolved 警报的解决日期和时间。Date and time that the alert was resolved. 空(如果警报尚未解决)。Empty if the alert has not yet been resolved.

示例日志搜索Sample log searches

下表提供了此解决方案收集的警报记录的示例日志搜索:The following table provides sample log searches for alert records collected by this solution:

查询Query 说明Description
Alert | where SourceSystem == "OpsManager" and AlertSeverity == "error" and TimeRaised > ago(24h)Alert | where SourceSystem == "OpsManager" and AlertSeverity == "error" and TimeRaised > ago(24h) 过去 24 小时引发的严重警报Critical alerts raised during the past 24 hours
Alert | where AlertSeverity == "warning" and TimeRaised > ago(24h)Alert | where AlertSeverity == "warning" and TimeRaised > ago(24h) 过去 24 小时引发的警告警报Warning alerts raised during the past 24 hours
Alert | where SourceSystem == "OpsManager" and AlertState != "Closed" and TimeRaised > ago(24h) | summarize Count = count() by SourceDisplayNameAlert | where SourceSystem == "OpsManager" and AlertState != "Closed" and TimeRaised > ago(24h) | summarize Count = count() by SourceDisplayName 过去 24 小时引发的活动警报的源Sources with active alerts raised during the past 24 hours
Alert | where SourceSystem == "OpsManager" and AlertSeverity == "error" and TimeRaised > ago(24h) and AlertState != "Closed"Alert | where SourceSystem == "OpsManager" and AlertSeverity == "error" and TimeRaised > ago(24h) and AlertState != "Closed" 过去 24 小时引发的严重警报(这些警报仍处于活动状态)Critical alerts raised during the past 24 hours that are still active
Alert | where SourceSystem == "OpsManager" and TimeRaised > ago(24h) and AlertState == "Closed"Alert | where SourceSystem == "OpsManager" and TimeRaised > ago(24h) and AlertState == "Closed" 过去 24 小时引发的警报(这些警报现已解决)Alerts raised during the past 24 hours that are now closed
Alert | where SourceSystem == "OpsManager" and TimeRaised > ago(1d) | summarize Count = count() by AlertSeverityAlert | where SourceSystem == "OpsManager" and TimeRaised > ago(1d) | summarize Count = count() by AlertSeverity 过去 1 天引发的警报(这些警报按其严重程度分组)Alerts raised during the past 1 day grouped by their severity
Alert | where SourceSystem == "OpsManager" and TimeRaised > ago(1d) | sort by RepeatCount descAlert | where SourceSystem == "OpsManager" and TimeRaised > ago(1d) | sort by RepeatCount desc 过去 1 天引发的警报(这些警报按其重复计数值排序)Alerts raised during the past 1 day sorted by their repeat count value

后续步骤Next steps