如何在 HDInsight 中使用 Azure Monitor 日志监视群集可用性How to monitor cluster availability with Azure Monitor logs in HDInsight

HDInsight 群集包括 Azure Monitor 日志集成,它提供可查询的指标和日志,以及可配置的警报。HDInsight clusters include Azure Monitor logs integration, which provides queryable metrics and logs, as well as configurable alerts. 本文介绍如何使用 Azure Monitor 来监视群集。This article shows how to use Azure Monitor to monitor your cluster.

Azure Monitor 日志集成Azure Monitor logs integration

使用 Azure Monitor 日志可在一个位置收集与聚合多个资源(例如 HDInsight 群集)生成的数据,以实现统一监视体验。Azure Monitor logs enable data generated by multiple resources, such as HDInsight clusters, to be collected and aggregated in one place to achieve a unified monitoring experience.

作为先决条件,需要创建一个 Log Analytics 工作区来存储收集的数据。As a prerequisite, you'll need a Log Analytics Workspace to store the collected data. 如果尚未创建,可按照以下说明创建:创建 Log Analytics 工作区If you haven't already created one, you can follow instructions here: Create a Log Analytics Workspace.

启用 HDInsight Azure Monitor 日志集成Enable HDInsight Azure Monitor logs integration

在门户上的 HDInsight 群集资源页中,选择“Azure Monitor”。From the HDInsight cluster resource page in the portal, select Azure Monitor. 然后选择“启用”并从下拉列表中选择你的 Log Analytics 工作区。Then, select enable and select your Log Analytics workspace from the drop-down.

HDInsight Operations Management Suite

默认情况下,这会将 OMS 代理安装在除边缘节点外的所有群集节点上。By default, this installs the OMS agent on all of the cluster nodes except for edge nodes. 由于群集边缘节点上未安装 OMS 代理,因此默认情况下,Log Analytics 中没有关于边缘节点的遥测数据。Because no OMS agent is installed on cluster edge nodes, there is no telemetry on edge nodes present in Log Analytics by default.

查询指标和日志表Query metrics and logs tables

启用 Azure Monitor 日志集成后(这可能需要几分钟时间),导航到“Log Analytics 工作区”资源并选择“日志”。 Once Azure Monitor log integration is enabled (this may take a few minutes), navigate to your Log Analytics Workspace resource and select Logs.

Log Analytics 工作区日志

日志列出了多个示例查询,例如:Logs list a number of sample queries, such as:

查询名称Query name 说明Description
目前的计算机可用性Computers availability today 每小时绘制发送日志的计算机数的图表Chart the number of computers sending logs, each hour
列出检测信号List heartbeats 列出过去一小时的所有计算机检测信号List all computer heartbeats from the last hour
每台计算机的最后一个检测信号Last heartbeat of each computer 显示每台计算机发送的最后一个检测信号Show the last heartbeat sent by each computer
不可用的计算机Unavailable computers 列出过去 5 小时未发送检测信号的所有已知计算机List all known computers that didn't send a heartbeat in the last 5 hours
可用率Availability rate 计算每台已连接计算机的可用率Calculate the availability rate of each connected computer

例如,选择该查询对应的“运行”以运行“可用率”示例查询,如以上屏幕截图中所示。 As an example, run the Availability rate sample query by selecting Run on that query, as shown in the screenshot above. 这会以百分比显示群集中每个节点的可用率。This will show the availability rate of each node in your cluster as a percentage. 如果启用了多个 HDInsight 群集以将指标发送到相同的 Log Analytics 工作区,则会显示这些群集中所有节点(不包括边缘节点)的可用率。If you have enabled multiple HDInsight clusters to send metrics to the same Log Analytics workspace, you'll see the availability rate for all nodes (excluding edge nodes) in those clusters displayed.

Log Analytics 工作区日志的“可用率”示例查询

备注

可用率是按 24 小时期限测量的,因此,群集至少需要运行 24 小时才能显示准确的可用率。Availability rate is measured over a 24-hour period, so your cluster will need to run for at least 24 hours before you see accurate availability rates.

可以单击右上角的“固定”将此表固定到共享仪表板。You can pin this table to a shared dashboard by clicking Pin in the upper-right corner. 如果没有任何可写的共享仪表板,可在以下文章中了解如何创建共享仪表板:在 Azure 门户中创建和共享仪表板If you don't have any writable shared dashboards, you can see how to create one here: Create and share dashboards in the Azure portal.

Azure Monitor 警报Azure Monitor alerts

还可以设置当某个指标的值或某个查询的结果符合特定条件时要触发的 Azure Monitor 警报。You can also set up Azure Monitor alerts that will trigger when the value of a metric or the results of a query meet certain conditions. 例如,让我们创建一个警报,以便在一个或多个节点在 5 小时内未发送检测信号时(即,假设这些节点不可用)发送电子邮件。As an example, let's create an alert to send an email when one or more nodes hasn't sent a heartbeat in 5 hours (i.e. is presumed to be unavailable).

在“日志”中,选择该查询对应的“运行”以运行“不可用的计算机”示例查询,如下所示。 From Logs, run the Unavailable computers sample query by selecting Run on that query, as shown below.

Log Analytics 工作区日志中的“不可用的计算机”示例

如果所有节点可用,此查询应返回零个结果。If all nodes are available, this query should return zero results for now. 单击“新建警报规则”开始为此查询配置警报。Click New alert rule to begin configuring your alert for this query.

Log Analytics 工作区 - 新建警报规则

警报有三个组成部分:要为其创建规则的资源(在本例中为 Log Analytics 工作区)、触发该警报的条件,以及确定触发警报时发生的操作的操作组。 There are three components to an alert: the resource for which to create the rule (the Log Analytics workspace in this case), the condition to trigger the alert, and the action groups that determine what will happen when the alert is triggered. 单击如下所示的 条件标题 完成信号逻辑配置。Click the condition title, as shown below, to finish configuring the signal logic.

门户警报 - 创建规则条件

此时会打开“配置信号逻辑”。This will open Configure signal logic.

按如下所示设置“警报逻辑”部分:Set the Alert logic section as follows:

依据:结果数,条件:大于,阈值:0。Based on: Number of results, Condition: Greater than, Threshold: 0.

由于此查询只返回不可用的节点作为结果,如果结果数大于 0,应会激发警报。Since this query only returns unavailable nodes as results, if the number of results is ever greater than 0, the alert should fire.

在“计算依据”部分设置“时段”,并根据检查不可用节点的频率设置“频率”。 In the Evaluated based on section, set the period and frequency based on how often you want to check for unavailable nodes.

对于此警报,请确保“时段”值与“频率”值相同。For the purpose of this alert, you want to make sure Period=Frequency. 可在此处找到有关时段、频率和其他警报参数的详细信息。More information about period, frequency, and other alert parameters can be found here.

完成信号逻辑配置后,选择“完成”。Select Done when you're finished configuring the signal logic.

警报规则 - 配置信号逻辑

如果没有现有的操作组,请单击“操作组”部分下的“新建”。 If you don't already have an existing action group, click Create New under the Action Groups section.

警报规则 - 创建新的操作组

此时会打开“添加操作组”。This will open Add action group. 选择 操作组名称短名称订阅资源组Choose an Action group name, Short name, Subscription, and Resource group. 在“操作”部分下,选择一个 操作名称,并选择“电子邮件/短信/推送/语音”作为“操作类型”。 Under the Actions section, choose an Action Name and select Email/SMS/Push/Voice as the Action Type.

备注

除了“电子邮件/短信/推送/语音”以外,警报还可以触发其他几个操作,例如 Azure 函数、逻辑应用、Webhook、ITSM 和自动化 Runbook。There are several other actions an alert can trigger besides an Email/SMS/Push/Voice, such as an Azure Function, LogicApp, Webhook, ITSM, and Automation Runbook. 了解详细信息。Learn More.

此时会打开“电子邮件/短信/推送/语音”。This will open Email/SMS/Push/Voice. 选择收件人的 姓名选中“电子邮件”框,然后键入要将警报发送到的电子邮件地址。Choose a Name for the recipient, check the Email box, and type an email address to which you want the alert sent. 在“电子邮件/短信/推送/语音”中选择“确定”,然后在“添加操作组”中完成操作组的配置。 Select OK in Email/SMS/Push/Voice, then in Add action group to finish configuring your action group.

创建警报规则 - 添加操作组

关闭这些边栏选项卡后,应会看到你的操作组已列在“操作组”部分下。After these blades close, you should see your action group listed under the Action Groups section. 最后,键入 警报规则名称说明 并选择一种 严重性 来完成“警报详细信息”部分中的操作。Finally, complete the Alert Details section by typing an Alert Rule Name and Description and choosing a Severity. 单击“创建警报规则”以完成操作。Click Create Alert Rule to finish.

门户 - 完成创建警报规则

提示

指定“严重性”是一个强大的功能,可在创建多个警报时使用它。The ability to specify Severity is a powerful tool that can be used when creating multiple alerts. 例如,可以创建一个警报以便在一个头节点出现故障时引发“警告”警报(严重性 1),并创建另一个警报以便在两个头节点同时出现故障时(这种情况很少见)引发“严重”警报(严重性 0)。For example, you could create one alert to raise a Warning (Sev 1) if a single head node goes down and another alert that raises Critical (Sev 0) in the unlikely event that both head nodes go down.

如果符合此警报的条件,则会激发该警报,你会收到一封电子邮件,其中包含如下所示的警报详细信息:When the condition for this alert is met, the alert will fire and you'll receive an email with the alert details like this:

Azure Monitor 警报电子邮件示例

还可以转到 Log Analytics 工作区 中的“警报”,查看所有已激发的警报(按严重性分组)。You can also view all alerts that have fired, grouped by severity, by going to Alerts in your Log Analytics Workspace.

Log Analytics 工作区警报

选择某个严重性分组(例如,上图中突出显示的“严重性 1”)会显示具有该严重性的所有已激发警报的记录,如下所示:Selecting on a severity grouping (i.e. Sev 1, as highlighted above) will show records for all alerts of that severity that have fired like below:

Log Analytics 工作区严重性 1 警报

后续步骤Next steps