Apache ZooKeeper 服务器无法在 Azure HDInsight 中形成仲裁Apache ZooKeeper server fails to form a quorum in Azure HDInsight
本文介绍与 Azure HDInsight 群集中的 Zookeeper 相关的问题的故障排除步骤和可能的解决方案。This article describes troubleshooting steps and possible resolutions for issues related to Zookeepers in Azure HDInsight clusters.
症状Symptoms
- 两个资源管理器都进入备用模式Both the resource managers go to standby mode
- Namenode 都处于备用模式Namenodes are both in standby mode
- 由于 Zookeeper 连接失败,Spark、Hive 和 Yarn 作业失败,或 Hive 查询失败Spark, Hive, and Yarn jobs or Hive queries fail because of Zookeeper connection failures
- LLAP 守护程序无法在安全 Spark 或安全交互式 Hive 群集上启动LLAP daemons fail to start on secure Spark or secure interactive Hive clusters
示例日志Sample log
可能会在 yarn 日志(头节点上的 /var/log/hadoop-yarn/yarn/yarn-yarn*.log)中看到类似于以下内容的错误消息:You may see an error message similar to the following in yarn logs (/var/log/hadoop-yarn/yarn/yarn-yarn*.log on the headnodes):
2020-05-05 03:17:18.3916720|Lost contact with Zookeeper. Transitioning to standby in 10000 ms if connection is not reestablished.
Message
2020-05-05 03:17:07.7924490|Received RMFatalEvent of type STATE_STORE_FENCED, caused by org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
...
2020-05-05 03:17:08.3890350|State store operation failed
2020-05-05 03:17:08.3890350|Transitioning to standby state
相关问题Related issues
- Yarn、NameNode 和 Livy 等高可用性服务可能会因为诸多原因而停止运行。High availability services like Yarn, NameNode, and Livy can go down for many reasons.
- 通过日志确认它与 Zookeeper 连接有关Confirm from the logs that it is related to Zookeeper connections
- 确保问题反复发生(如果只是个别情况,请勿使用这些解决方案)Make sure that the issue happens repeatedly (do not use these solutions for one off cases)
- 由于 Zookeeper 连接问题,作业可能会暂时失败Jobs can fail temporarily due to Zookeeper connection issues
Zookeeper 失败的常见原因Common causes for Zookeeper failure
- Zookeeper 服务器上的 CPU 使用率较高High CPU usage on the zookeeper servers
- 在 Ambari UI 中,如果看到 Zookeeper 服务器上的 CPU 持续使用率接近 100%,则在这段时间内打开的 Zookeeper 会话可能会过期并超时In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the zookeeper sessions open during that time can expire and time out
- Zookeeper 客户端报告频繁超时Zookeeper clients are reporting frequent timeouts
- 在资源管理器、Namenode 等的日志中,将经常看到客户端连接超时In the logs for Resource Manager, Namenode and others, you will see frequent client connection timeouts
- 这可能导致仲裁丢失、频繁的故障转移和其他问题This could result in quorum loss, frequent failovers, and other issues
检查 Zookeeper 状态Check for zookeeper status
- 从 /etc/hosts 文件或 Ambari UI 中找到 Zookeeper 服务器Find the zookeeper servers from the /etc/hosts file or from Ambari UI
- 运行以下命令Run the following command
echo stat | nc <ZOOKEEPER_HOST_IP> 2181
(或 2182)echo stat | nc <ZOOKEEPER_HOST_IP> 2181
(or 2182)- 端口 2181 是 Apache Zookeeper 实例Port 2181 is the apache zookeeper instance
- HDInsight Zookeeper 使用端口 2182(为本身并非 HA 的服务提供 HA)Port 2182 is used by the HDInsight zookeeper (to provide HA for services that are not natively HA)
- 如果命令未显示任何输出,则表明 Zookeeper 服务器未运行If the command shows no output, then it means that the zookeeper servers are not running
- 如果服务器正在运行,则结果将包括客户端连接的静态变量和其他统计信息If the servers are running, the result will include statics of client connections and other statistics
Zookeeper version: 3.4.6-8--1, built on 12/05/2019 12:55 GMT
Clients:
/10.2.0.57:50988[1](queued=0,recved=715,sent=715)
/10.2.0.57:46632[1](queued=0,recved=138340,sent=138347)
/10.2.0.34:14688[1](queued=0,recved=264653,sent=353420)
/10.2.0.52:49680[1](queued=0,recved=134812,sent=134814)
/10.2.0.57:50614[1](queued=0,recved=19812,sent=19812)
/10.2.0.56:35034[1](queued=0,recved=2586,sent=2586)
/10.2.0.52:63982[1](queued=0,recved=72215,sent=72217)
/10.2.0.57:53024[1](queued=0,recved=19805,sent=19805)
/10.2.0.57:45126[1](queued=0,recved=19621,sent=19621)
/10.2.0.56:41270[1](queued=0,recved=1348743,sent=1348788)
/10.2.0.53:59097[1](queued=0,recved=72215,sent=72217)
/10.2.0.56:41088[1](queued=0,recved=788,sent=802)
/10.2.0.34:10246[1](queued=0,recved=19575,sent=19575)
/10.2.0.56:40944[1](queued=0,recved=717,sent=717)
/10.2.0.57:45466[1](queued=0,recved=19861,sent=19861)
/10.2.0.57:59634[0](queued=0,recved=1,sent=0)
/10.2.0.34:14704[1](queued=0,recved=264622,sent=353355)
/10.2.0.57:42244[1](queued=0,recved=49245,sent=49248)
Latency min/avg/max: 0/3/14865
Received: 238606078
Sent: 239139381
Connections: 18
Outstanding: 0
Zxid: 0x1004f99be
Mode: follower
Node count: 133212
每小时 CPU 负载达到峰值CPU load peaks up every hour
- 登录到 Zookeeper 服务器并检查 /etc/crontabLog in to the zookeeper server and check the /etc/crontab
- 如果此时有任何每小时作业正在运行,请在不同的 Zookeeper 服务器之间随机分配启动时间。If there are any hourly jobs running at this time, randomize the start time across different zookeeper servers.
清除旧快照Purging old snapshots
- 将 Zookeeper 配置为自动清除旧快照Zookeepers are configured to auto purge old snapshots
- 默认情况下,保留最近的 30 个快照By default, the last 30 snapshots are retained
- 保留的快照数量由配置项
autopurge.snapRetainCount
控制。The number of snapshots that are retained, is controlled by the configuration keyautopurge.snapRetainCount
. 可在以下文件中找到此属性:This property can be found in the following files:/etc/zookeeper/conf/zoo.cfg
(适用于 Hadoop Zookeeper)/etc/zookeeper/conf/zoo.cfg
for Hadoop zookeeper/etc/hdinsight-zookeeper/conf/zoo.cfg
(适用于 HDInsight Zookeeper)/etc/hdinsight-zookeeper/conf/zoo.cfg
for HDInsight zookeeper
- 将
autopurge.snapRetainCount
设置为值 3,然后重启 Zookeeper 服务器Setautopurge.snapRetainCount
to a value of 3 and restart the zookeeper servers- 可以更新 Hadoop Zookeeper 配置,并且可以通过 Ambari 重启服务Hadoop zookeeper config can be updated and the service can be restarted through Ambari
- 手动停止并重启 HDInsight ZookeeperStop and restart HDInsight zookeeper manually
sudo lsof -i :2182
将显示要终止的进程 IDsudo lsof -i :2182
will give you the process ID to killsudo python /opt/startup_scripts/startup_hdinsight_zookeeper.py
- 请勿手动清除快照,手动删除快照可能会导致数据丢失Do not purge snapshots manually - deleting snapshots manually could result in data loss
Zookeeper 服务器日志中的 CancelledKeyException 不需要清理快照CancelledKeyException in the zookeeper server log doesn't require snapshot cleanup
- 此异常将在 zookeeper 服务器(/var/log/zookeeper/zookeeper-zookeeper-* 或 /var/log/hdinsight-zookeeper/zookeeper* 文件)上出现This exception will be seen on the zookeeper servers (/var/log/zookeeper/zookeeper-zookeeper-* or /var/log/hdinsight-zookeeper/zookeeper* files)
- 此异常通常表示客户端不再处于活动状态,并且服务器无法发送消息This exception usually means that the client is no longer active and the server is unable to send a message
- 此异常还表示 Zookeeper 客户端过早结束了会话This exception also indicates that the zookeeper client is ending sessions prematurely
- 查找本文档中所述的其他症状Look for the other symptoms outlined in this document
后续步骤Next steps
如果你的问题未在本文中列出,或者无法解决问题,请访问以下渠道以获取更多支持:If you didn't see your problem or are unable to solve your issue, visit the following channel for more support:
- 如果需要更多帮助,可以从 Azure 门户提交支持请求。If you need more help, you can submit a support request from the Azure portal. 从菜单栏中选择“支持”,或打开“帮助 + 支持”中心 。Select Support from the menu bar or open the Help + support hub. 有关更多详细信息,请参阅如何创建 Azure 支持请求。For more detailed information, review How to create an Azure support request.