Apache HBase Master (HMaster) 无法在 Azure HDInsight 中启动Apache HBase Master (HMaster) fails to start in Azure HDInsight

本文介绍在与 Azure HDInsight 群集交互时出现的问题的故障排除步骤和可能的解决方法。This article describes troubleshooting steps and possible resolutions for issues when interacting with Azure HDInsight clusters.

方案:原子重命名失败Scenario: Atomic renaming failure

问题Issue

启动过程中识别到意外的文件。Unexpected files identified during startup process.

原因Cause

在启动过程中,HMaster 会执行许多初始化步骤,包括将数据从暂存文件夹 (.tmp) 移到数据文件夹。During the startup process, HMaster performs many initialization steps, including moving data from scratch (.tmp) folder to data folder. HMaster 还会查看预写日志 (WAL) 文件夹中是否存在任何无响应的区域服务器。HMaster also looks at the write-ahead logs (WAL) folder to see if there are any unresponsive region servers.

HMaster 会对 WAL 文件夹执行一个基本的 list 命令。HMaster does a basic list command on the WAL folders. 在任何时候,如果 HMaster 发现其中任何一个文件夹中存在意外的文件,则会引发异常并且不会启动。If at any time, HMaster sees an unexpected file in any of these folders, it throws an exception and doesn't start.

解决方法Resolution

检查调用堆栈并尝试确定哪个文件夹可能导致问题(例如,是 WAL 文件夹还是 .tmp 文件夹)。Check the call stack and try to determine which folder might be causing the problem (for instance, it might be the WAL folder or the .tmp folder). 然后,通过 Cloud Explorer 或 HDFS 命令尝试找到问题文件。Then, in Cloud Explorer or by using HDFS commands, try to locate the problem file. 通常这是一个 *-renamePending.json 文件。Usually, this is a *-renamePending.json file. *-renamePending.json 文件是用于在 WASB 驱动程序中实现原子重命名操作的日记文件。(The *-renamePending.json file is a journal file that's used to implement the atomic rename operation in the WASB driver. 由于此实现中的 bug,在发生进程崩溃之类的问题后,这些文件可能会保留。)可以通过 Cloud Explorer 或 HDFS 命令强制删除此文件。Due to bugs in this implementation, these files can be left over after process crashes, and so on.) Force-delete this file either in Cloud Explorer or by using HDFS commands.

有时,此位置还可能存在名称类似于 $$$.$$$ 的临时文件。Sometimes, there might also be a temporary file named something like $$$.$$$ at this location. 必须使用 HDFS ls 命令查看此文件,而不能在 Cloud Explorer 中查看。You have to use HDFS ls command to see this file; you cannot see the file in Cloud Explorer. 若要删除此文件,请使用 HDFS 命令 hdfs dfs -rm /\<path>\/\$\$\$.\$\$\$To delete this file, use the HDFS command hdfs dfs -rm /\<path>\/\$\$\$.\$\$\$.

运行这些命令后,HMaster 应会立即启动。After you've run these commands, HMaster should start immediately.


方案:未列出服务器地址Scenario: No server address listed

问题Issue

可能会看到一条消息,指示 hbase: meta 表未联机。You might see a message that indicates that the hbase: meta table is not online. 运行 hbck 可能会报告 hbase: meta table replicaId 0 is not found on any region. 在 HMaster 日志中,可能会看到以下消息:No server address listed in hbase: meta for region hbase: backup <region name>Running hbck might report that hbase: meta table replicaId 0 is not found on any region. In the HMaster logs, you might see the message: No server address listed in hbase: meta for region hbase: backup <region name>.

原因Cause

重启 HBase 后 HMaster 无法初始化。HMaster could not initialize after restarting HBase.

解决方法Resolution

  1. 在 HBase shell 中输入以下命令(根据情况更改实际值):In the HBase shell, enter the following commands (change actual values as applicable):

    scan 'hbase:meta'
    delete 'hbase:meta','hbase:backup <region name>','<column name>'
    
  2. 删除 hbase: namespace 条目。Delete the hbase: namespace entry. 此条目可能是扫描 hbase: namespace 表时报告的相同错误。This entry might be the same error that's being reported when the hbase: namespace table is scanned.

  3. 从 Ambari UI 重启活动 HMaster,使 HBase 恢复运行状态。Restart the active HMaster from Ambari UI to bring up HBase in running state.

  4. 在 HBase shell 中,运行以下命令使所有脱机表联机:In the HBase shell, to bring up all offline tables, run the following command:

    hbase hbck -ignorePreCheckPermission -fixAssignments
    

场景:java.io.IOException:超时Scenario: java.io.IOException: Timedout

问题Issue

HMaster 超时并出现如下所示的严重异常:java.io.IOException: Timedout 300000ms waiting for namespace table to be assignedHMaster times out with fatal exception similar to: java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned.

原因Cause

如果有大量表和区域且在重启 HMaster 服务时未刷新,则会遇到此问题。You might experience this issue if you have many tables and regions that have not been flushed when you restart your HMaster services. 超时是 HMaster 中的一个已知缺陷。The time-out is a known defect with HMaster. 一般群集启动任务可能会花费很长时间。General cluster startup tasks can take a long time. 如果命名空间表尚未分配,HMaster 将会关闭。HMaster shuts down if the namespace table isn’t yet assigned. 如果存在大量未刷新的数据,而五分钟超时值不足以完成启动任务,则启动任务会长时间运行。The lengthy startup tasks happen where large amount of unflushed data exists and a timeout of five minutes is not sufficient.

解决方法Resolution

  1. 在 Apache Ambari UI 中,转到“HBase” > “配置” 。From the Apache Ambari UI, go to HBase > Configs. 在自定义 hbase-site.xml 文件中添加以下设置:In the custom hbase-site.xml file, add the following setting:

    Key: hbase.master.namespace.init.timeout Value: 2400000  
    
  2. 重启所需的服务(HMaster,可能还有其他 HBase 服务)。Restart the required services (HMaster, and possibly other HBase services).


方案:区域服务器频繁重启Scenario: Frequent region server restarts

问题Issue

节点定期重新启动。Nodes reboot periodically. 在区域服务器日志中,可能会看到如下所示的条目:From the region server logs you may see entries similar to:

2017-05-09 17:45:07,683 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 31000ms
2017-05-09 17:45:07,683 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 31000ms
2017-05-09 17:45:07,683 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 31000ms

原因Cause

regionserver JVM GC 长时间暂停。Long regionserver JVM GC pause. 暂停会导致 regionserver 无响应,并且无法在 40 秒 zk 会话超时期限内向 HMaster 发送检测信号。The pause will cause regionserver to be unresponsive and not able to send heart beat to HMaster within the zk session timeout 40s. HMaster 将认为 regionserver 已死机,因此会中止 regionserver 并重启。HMaster will believe regionserver is dead and will abort the regionserver and restart.

解决方法Resolution

更改 Zookeeper 会话超时,同时,不仅需要更改 hbase-site 设置 zookeeper.session.timeout,而且还要更改 Zookeeper zoo.cfg 设置 maxSessionTimeoutChange the Zookeeper session timeout, not only hbase-site setting zookeeper.session.timeout but also Zookeeper zoo.cfg setting maxSessionTimeout need to be changed.

  1. 访问 Ambari UI,转到“HBase”->“配置”->“设置”,在“超时”部分,更改 Zookeeper 会话超时值。 Access Ambari UI, go to HBase -> Configs -> Settings, in Timeouts section, change the value of Zookeeper Session Timeout.

  2. 访问 Ambari UI,转到“Zookeeper”->“配置”->“自定义”-> zoo.cfg,并添加/更改以下设置。 Access Ambari UI, go to Zookeeper -> Configs -> Custom zoo.cfg, add/change the following setting. 确保该值与 HBase zookeeper.session.timeout 相同。Make sure the value is the same as HBase zookeeper.session.timeout.

    Key: maxSessionTimeout Value: 120000  
    
  3. 重启所需的服务。Restart required services.


方案:日志拆分失败Scenario: Log splitting failure

问题Issue

HMasters 无法在 HBase 群集上启动。HMasters failed to come up on a HBase cluster.

原因Cause

错误地配置了辅助存储帐户的 HDFS 和 HBase 设置。Misconfigured HDFS and HBase settings for a secondary storage account.

解决方法Resolution

设置 hbase.rootdir: wasb://@.blob.core.chinacloudapi.cn/hbase,并在 Ambari 中重启服务。set hbase.rootdir: wasb://@.blob.core.chinacloudapi.cn/hbase and restart services on Ambari.


后续步骤Next steps

如果你的问题未在本文中列出,或者无法解决问题,请访问以下渠道获取更多支持:If you didn't see your problem or are unable to solve your issue, visit the following channel for more support:

  • 如果需要更多帮助,可以从 Azure 门户提交支持请求。If you need more help, you can submit a support request from the Azure portal. 从菜单栏中选择“支持” ,或打开“帮助 + 支持” 中心。Select Support from the menu bar or open the Help + support hub.