Scenario: Apache Ambari stale alerts in Azure HDInsight

This article describes troubleshooting steps and possible resolutions for issues when interacting with Azure HDInsight clusters.

Issue

In the Apache Ambari UI, you might see an alert like this:

Apache Ambari stale alert example

Cause

Ambari agents continuously monitor the health of many resources. Alerts can be configured to notify you whether specific cluster properties are within predetermined thresholds. After each resource check runs, if the alert condition is met, Ambari agents report the status back to the Ambari server and trigger an alert. If an alert isn't checked according to the interval in its Alert Profile, the server triggers an Ambari Server Stale Alerts alert.

There are various reasons why a health check might not run at its defined interval:

  • The hosts are under heavy use (high CPU usage), so that the Ambari agent can't get enough system resources to run the alerts on time.

  • The cluster is busy executing many jobs or services during a period of heavy load.

  • A few of hosts in the cluster are hosting many components and so are required to run many alerts. If the number of components is large, alert jobs might miss their scheduled intervals.

Resolution

Try the following methods to resolve problems with Ambari stale alerts.

Increase the alert interval time

You can increase the value of an individual alert interval, based on your cluster's response time and load:

  1. In the Apache Ambari UI, select the Alerts tab.
  2. Select the alert definition name that you want.
  3. From the definition, select Edit.
  4. Increase the Check Interval value, and then select Save.

Increase the alert interval time for Ambari Server Alerts

  1. In the Apache Ambari UI, select the Alerts tab.
  2. From the Groups drop-down list, select AMBARI Default.
  3. Select the Ambari Server Alerts alert.
  4. From the definition, select Edit.
  5. Increase the Check Interval value.
  6. Increase the Interval Multiplier value, and then select Save.

Disable and reenable the alert

To discard a stale alert, disable and then reenable it:

  1. In the Apache Ambari UI, select the Alerts tab.
  2. Select the alert definition name that you want.
  3. From the definition, select Enabled on the far right part of the UI.
  4. In the Confirmation pop-up window, select Confirm Disable.
  5. Wait a few seconds for all the alert "instances" shown on the page to be cleared.
  6. From the definition, select Disabled on the far right part of the UI.
  7. In the Confirmation pop-up window, select Confirm Enable.

Increase the alert grace period

There's a grace period before an Ambari agent reports that a configured alert missed its schedule. If the alert missed its scheduled time but ran within the grace period, the stale alert isn't generated.

The default alert_grace_period value is 5 seconds. You can configure this setting in /etc/ambari-agent/conf/ambari-agent.ini. For hosts on which stale alerts occur at regular intervals, try increasing the value to 10. Then, restart the Ambari agent.