HDInsight Spark 集群出现异常警报

Hortonworks 在近期发布的 Spark 版本中出现了一个错误配置,监测脚本 alert_spark2_thrift_port.py 错误的监听了 10016 端口(在 Ambari 上看到错误提示说 Spark Thrift 非正常工作)。如果在 HDInsight 的 Spark 集群 Ambari 界面中发现了以下 Alert 信息,但是集群能够连接且工作正常,那么可以参照此篇文章提供的方法尝试进行修复。微软 HDInsight Spark 产品组已经对此问题做了修复,会在下一个 release 进行部署。

备注

Azure HDInsight 是 Hortonworks Data Platform (HDP) 提供的 Hadoop 组件的云发行版

Alert 报错信息

alert

Connection failed on host xxx-xxx-xx.tmu2a1fyxcvebfe0i4avtrjj5b.ax.internal.chinacloudapp.cn:10016 (Traceback (most recent call last):
    File "/var/lib/ambari-agent/cache/common-services/SPARK2/2.0.0/package/scripts/alerts/alert_spark2_thrift_port.py", line 144, in execute
        Execute(cmd, user=hiveruser, path=[beeline_cmd], timeout=CHECK_COMMAND_TIMEOUT_DEFAULT)
    File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 155, in __init__
        self.env.run()
    File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 160, in run
        self.run_action(resource, action)
    File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 124, in run_action
        provider_action()
    File "/usr/lib/python2.6/site-packages/resource_management/core/providers/system.py", line 262, in action_run
        tries=self.resource.tries, try_sleep=self.resource.try_sleep)
    File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 72, in inner
        result = function(command, **kwargs)
    File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 102, in checked_call
        tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy)
    File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 150, in _call_wrapper
        result = _call(command, **kwargs_copy)
    File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 303, in _call
        raise ExecutionFailed(err_msg, code, out, err)
    ExecutionFailed: Execution of '! beeline -u 'jdbc:hive2://xxx-xxx-xx.tmu2a1fyxcvebfe0i4avtrjj5b.ax.internal.chinacloudapp.cn:10016/default' transportMode=http  -e '' 2>&1| awk '{print}'|grep -i -e 'Connection refused' -e 'Invalid URL'' returned 1. Error: Could not open client transport with JDBC Uri: jdbc:hive2://xxx-xxx-xx.tmu2a1fyxcvebfe0i4avtrjj5b.ax.internal.chinacloudapp.cn:10016/default: java.net.ConnectException: Connection refused (Connection refused) (state=08S01,code=0)
    
    Error: Could not open client transport with JDBC Uri: jdbc:hive2://xxx-xxx-xx.tmu2a1fyxcvebfe0i4avtrjj5b.ax.internal.chinacloudapp.cn:10016/default: java.net.ConnectException: Connection refused (Connection refused) (state=08S01,code=0)
)

修复方法

在两个 headnode 节点上修改文件 : /var/lib/ambari-agent/cache/common-services/SPARK2/2.0.0/package/scripts/alerts/alert_spark2_thrift_port.py

原脚本内容为:

THRIFT_PORT_DEFAULT = 10016
HIVE_SERVER_TRANSPORT_MODE_DEFAULT = 'binary'

改成以下值:

THRIFT_PORT_DEFAULT = 10002
HIVE_SERVER_TRANSPORT_MODE_DEFAULT = 'http'

修改完之后,再次刷新 Ambari,查看 Alert 是否消失。