方案:Azure HDInsight 中的 Apache Ambari 过时警报Scenario: Apache Ambari stale alerts in Azure HDInsight

本文介绍在与 Azure HDInsight 群集交互时出现的问题的故障排除步骤和可能的解决方案。This article describes troubleshooting steps and possible resolutions for issues when interacting with Azure HDInsight clusters.

问题Issue

在 Apache Ambari UI 中,你可能会看到如下所示的警报:In the Apache Ambari UI, you might see an alert like this:

Apache Ambari 过时警报示例

原因Cause

Ambari 代理会持续监视许多资源的运行状况。Ambari agents continuously monitor the health of many resources. 可以将警报配置为通知你特定群集属性是否在预先确定的阈值范围内。Alerts can be configured to notify you whether specific cluster properties are within predetermined thresholds. 每次运行资源检查后,如果满足警报条件,Ambari 代理会将状态报告回 Ambari 服务器,并触发警报。After each resource check runs, if the alert condition is met, Ambari agents report the status back to the Ambari server and trigger an alert. 如果未根据警报配置文件中的间隔检查警报,则服务器将触发“Ambari 服务器警报过时”警报。If an alert isn't checked according to the interval in its Alert Profile, the server triggers an Ambari Server Stale Alerts alert.

运行状况检查不按定义的间隔运行的原因有多种:There are various reasons why a health check might not run at its defined interval:

  • 主机的使用量较高(CPU 使用率高),因此 Ambari 代理无法获得足够的系统资源来按时运行警报。The hosts are under heavy use (high CPU usage), so that the Ambari agent can't get enough system resources to run the alerts on time.

  • 在负载繁重的期间内,群集正忙于执行许多作业或服务。The cluster is busy executing many jobs or services during a period of heavy load.

  • 群集中的少量主机承载许多组件,因此需要运行许多警报。A small number of hosts in the cluster are hosting many components and so are required to run many alerts. 如果组件数量很大,警报作业可能会错过其计划的间隔。If the number of components is large, alert jobs might miss their scheduled intervals.

解决方法Resolution

请尝试以下方法来解决 Ambari 警报过时的问题。Try the following methods to resolve problems with Ambari stale alerts.

增大警报间隔时间Increase the alert interval time

可以根据群集的响应时间及其负载,增大单个警报间隔的值:You can increase the value of an individual alert interval, based on your cluster's response time and load:

  1. 在 Apache Ambari UI 中,选择“警报”选项卡。In the Apache Ambari UI, select the Alerts tab.
  2. 选择所需的警报定义名称。Select the alert definition name that you want.
  3. 在定义中选择“编辑”。From the definition, select Edit.
  4. 增大“检查间隔”值,然后选择“保存”。 Increase the Check Interval value, and then select Save.

增大 Ambari 服务器警报的警报间隔时间Increase the alert interval time for Ambari Server Alerts

  1. 在 Apache Ambari UI 中,选择“警报”选项卡。In the Apache Ambari UI, select the Alerts tab.
  2. 在“组”下拉列表中,选择“AMBARI 默认值”。 From the Groups drop-down list, select AMBARI Default.
  3. 选择警报“Ambari 服务器警报”。Select the Ambari Server Alerts alert.
  4. 在定义中选择“编辑”。From the definition, select Edit.
  5. 增大“检查间隔”值。Increase the Check Interval value.
  6. 增大“间隔乘数”值,然后选择“保存”。 Increase the Interval Multiplier value, and then select Save.

禁用警报,然后重新启用警报Disable and reenable the alert

若要放弃过时警报,请禁用它,然后重新启用它:To discard a stale alert, disable and then reenable it:

  1. 在 Apache Ambari UI 中,选择“警报”选项卡。In the Apache Ambari UI, select the Alerts tab.
  2. 选择所需的警报定义名称。Select the alert definition name that you want.
  3. 在定义中,选择位于 UI 最右侧的“已启用”。From the definition, select Enabled on the far right part of the UI.
  4. 在“确认”弹出窗口中,选择“确认禁用”。 In the Confirmation pop-up window, select Confirm Disable.
  5. 等待几秒,以便页面上显示的所有警报“实例”均被清除。Wait a few seconds for all the alert "instances" shown on the page to be cleared.
  6. 在定义中,选择位于 UI 最右侧的“已禁用”。From the definition, select Disabled on the far right part of the UI.
  7. 在“确认”弹出窗口中,选择“确认启用”。 In the Confirmation pop-up window, select Confirm Enable.

增大警报宽限期Increase the alert grace period

在 Ambari 代理报告已配置的警报错过了其计划之前,会有一个宽限期。There's a grace period before an Ambari agent reports that a configured alert missed its schedule. 如果警报错过了其计划时间,但在宽限期内运行,则不会产生过时警报。If the alert missed its scheduled time but ran within the grace period, the stale alert isn't generated.

默认的 alert_grace_period 值为 5 秒。The default alert_grace_period value is 5 seconds. 可以在 /etc/ambari-agent/conf/ambari-agent.ini 中配置此设置。You can configure this setting in /etc/ambari-agent/conf/ambari-agent.ini. 对于定期发生过时警报的主机,请尝试将该值增大到 10。For hosts on which stale alerts occur at regular intervals, try increasing the value to 10. 然后,重启 Ambari 代理。Then, restart the Ambari agent.

后续步骤Next steps

如果你的问题未在本文中列出,或者你无法解决问题,请访问以下渠道以获取更多支持:If your problem wasn't mentioned here or you're unable to solve it, visit the following channel for more support:

  • 如果需要更多帮助,可以从 Azure 门户提交支持请求。If you need more help, you can submit a support request from the Azure portal. 从菜单栏中选择“支持”,或打开“帮助 + 支持”中心。Select Support from the menu bar or open the Help + support hub. 有关更多详细信息,请参阅如何创建 Azure 支持请求For more detailed information, review How to create an Azure support request. Microsoft Azure 订阅包含对订阅管理和计费支持的访问权限,并且通过 Azure 支持计划之一提供技术支持。Access to Subscription Management and billing support is included with your Microsoft Azure subscription, and Technical Support is provided through one of the Azure Support Plans.