Apache ZooKeeper 服务器无法在 Azure HDInsight 中形成仲裁Apache ZooKeeper server fails to form a quorum in Azure HDInsight

本文介绍与 Azure HDInsight 群集中的 Zookeeper 相关的问题的故障排除步骤和可能的解决方案。This article describes troubleshooting steps and possible resolutions for issues related to Zookeepers in Azure HDInsight clusters.


  • 两个资源管理器都进入备用模式Both the resource managers go to standby mode
  • Namenode 都处于备用模式Namenodes are both in standby mode
  • 由于 Zookeeper 连接失败,Spark、Hive 和 Yarn 作业失败,或 Hive 查询失败Spark, Hive, and Yarn jobs or Hive queries fail because of Zookeeper connection failures
  • LLAP 守护程序无法在安全 Spark 或安全交互式 Hive 群集上启动LLAP daemons fail to start on secure Spark or secure interactive Hive clusters

示例日志Sample log

可能会在 yarn 日志(头节点上的 /var/log/hadoop-yarn/yarn/yarn-yarn*.log)中看到类似于以下内容的错误消息:You may see an error message similar to the following in yarn logs (/var/log/hadoop-yarn/yarn/yarn-yarn*.log on the headnodes):

2020-05-05 03:17:18.3916720|Lost contact with Zookeeper. Transitioning to standby in 10000 ms if connection is not reestablished.
2020-05-05 03:17:07.7924490|Received RMFatalEvent of type STATE_STORE_FENCED, caused by org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
2020-05-05 03:17:08.3890350|State store operation failed 
2020-05-05 03:17:08.3890350|Transitioning to standby state
  • Yarn、NameNode 和 Livy 等高可用性服务可能会因为诸多原因而停止运行。High availability services like Yarn, NameNode, and Livy can go down for many reasons.
  • 通过日志确认它与 Zookeeper 连接有关Confirm from the logs that it is related to Zookeeper connections
  • 确保问题反复发生(如果只是个别情况,请勿使用这些解决方案)Make sure that the issue happens repeatedly (do not use these solutions for one off cases)
  • 由于 Zookeeper 连接问题,作业可能会暂时失败Jobs can fail temporarily due to Zookeeper connection issues

Zookeeper 失败的常见原因Common causes for Zookeeper failure

  • Zookeeper 服务器上的 CPU 使用率较高High CPU usage on the zookeeper servers
    • 在 Ambari UI 中,如果看到 Zookeeper 服务器上的 CPU 持续使用率接近 100%,则在这段时间内打开的 Zookeeper 会话可能会过期并超时In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the zookeeper sessions open during that time can expire and time out
  • Zookeeper 客户端报告频繁超时Zookeeper clients are reporting frequent timeouts
    • 在资源管理器、Namenode 等的日志中,将经常看到客户端连接超时In the logs for Resource Manager, Namenode and others, you will see frequent client connection timeouts
    • 这可能导致仲裁丢失、频繁的故障转移和其他问题This could result in quorum loss, frequent failovers, and other issues

检查 Zookeeper 状态Check for zookeeper status

  • 从 /etc/hosts 文件或 Ambari UI 中找到 Zookeeper 服务器Find the zookeeper servers from the /etc/hosts file or from Ambari UI
  • 运行以下命令Run the following command
    • echo stat | nc <ZOOKEEPER_HOST_IP> 2181(或 2182)echo stat | nc <ZOOKEEPER_HOST_IP> 2181 (or 2182)
    • 端口 2181 是 Apache Zookeeper 实例Port 2181 is the apache zookeeper instance
    • HDInsight Zookeeper 使用端口 2182(为本身并非 HA 的服务提供 HA)Port 2182 is used by the HDInsight zookeeper (to provide HA for services that are not natively HA)
    • 如果命令未显示任何输出,则表明 Zookeeper 服务器未运行If the command shows no output, then it means that the zookeeper servers are not running
    • 如果服务器正在运行,则结果将包括客户端连接的静态变量和其他统计信息If the servers are running, the result will include statics of client connections and other statistics
Zookeeper version: 3.4.6-8--1, built on 12/05/2019 12:55 GMT

Latency min/avg/max: 0/3/14865
Received: 238606078
Sent: 239139381
Connections: 18
Outstanding: 0
Zxid: 0x1004f99be
Mode: follower
Node count: 133212

每小时 CPU 负载达到峰值CPU load peaks up every hour

  • 登录到 Zookeeper 服务器并检查 /etc/crontabLog in to the zookeeper server and check the /etc/crontab
  • 如果此时有任何每小时作业正在运行,请在不同的 Zookeeper 服务器之间随机分配启动时间。If there are any hourly jobs running at this time, randomize the start time across different zookeeper servers.

清除旧快照Purging old snapshots

  • 将 Zookeeper 配置为自动清除旧快照Zookeepers are configured to auto purge old snapshots
  • 默认情况下,保留最近的 30 个快照By default, the last 30 snapshots are retained
  • 保留的快照数量由配置项 autopurge.snapRetainCount 控制。The number of snapshots that are retained, is controlled by the configuration key autopurge.snapRetainCount. 可在以下文件中找到此属性:This property can be found in the following files:
    • /etc/zookeeper/conf/zoo.cfg(适用于 Hadoop Zookeeper)/etc/zookeeper/conf/zoo.cfg for Hadoop zookeeper
    • /etc/hdinsight-zookeeper/conf/zoo.cfg(适用于 HDInsight Zookeeper)/etc/hdinsight-zookeeper/conf/zoo.cfg for HDInsight zookeeper
  • autopurge.snapRetainCount 设置为值 3,然后重启 Zookeeper 服务器Set autopurge.snapRetainCount to a value of 3 and restart the zookeeper servers
    • 可以更新 Hadoop Zookeeper 配置,并且可以通过 Ambari 重启服务Hadoop zookeeper config can be updated and the service can be restarted through Ambari
    • 手动停止并重启 HDInsight ZookeeperStop and restart HDInsight zookeeper manually
      • sudo lsof -i :2182 将显示要终止的进程 IDsudo lsof -i :2182 will give you the process ID to kill
      • sudo python /opt/startup_scripts/startup_hdinsight_zookeeper.py
  • 请勿手动清除快照,手动删除快照可能会导致数据丢失Do not purge snapshots manually - deleting snapshots manually could result in data loss

Zookeeper 服务器日志中的 CancelledKeyException 不需要清理快照CancelledKeyException in the zookeeper server log doesn't require snapshot cleanup

  • 此异常将在 zookeeper 服务器(/var/log/zookeeper/zookeeper-zookeeper-* 或 /var/log/hdinsight-zookeeper/zookeeper* 文件)上出现This exception will be seen on the zookeeper servers (/var/log/zookeeper/zookeeper-zookeeper-* or /var/log/hdinsight-zookeeper/zookeeper* files)
  • 此异常通常表示客户端不再处于活动状态,并且服务器无法发送消息This exception usually means that the client is no longer active and the server is unable to send a message
  • 此异常还表示 Zookeeper 客户端过早结束了会话This exception also indicates that the zookeeper client is ending sessions prematurely
  • 查找本文档中所述的其他症状Look for the other symptoms outlined in this document

后续步骤Next steps

如果你的问题未在本文中列出,或者无法解决问题,请访问以下渠道以获取更多支持:If you didn't see your problem or are unable to solve your issue, visit the following channel for more support: