HDInsight Spark 集群出现异常警报
Hortonworks 在近期发布的 Spark 版本中出现了一个错误配置,监测脚本 alert_spark2_thrift_port.py 错误的监听了 10016 端口(在 Ambari 上看到错误提示说 Spark Thrift 非正常工作)。如果在 HDInsight 的 Spark 集群 Ambari 界面中发现了以下 Alert 信息,但是集群能够连接且工作正常,那么可以参照此篇文章提供的方法尝试进行修复。微软 HDInsight Spark 产品组已经对此问题做了修复,会在下一个 release 进行部署。
备注
Azure HDInsight 是 Hortonworks Data Platform (HDP) 提供的 Hadoop 组件的云发行版
Alert 报错信息
Connection failed on host xxx-xxx-xx.tmu2a1fyxcvebfe0i4avtrjj5b.ax.internal.chinacloudapp.cn:10016 (Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/SPARK2/2.0.0/package/scripts/alerts/alert_spark2_thrift_port.py", line 144, in execute
Execute(cmd, user=hiveruser, path=[beeline_cmd], timeout=CHECK_COMMAND_TIMEOUT_DEFAULT)
File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 155, in __init__
self.env.run()
File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 160, in run
self.run_action(resource, action)
File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 124, in run_action
provider_action()
File "/usr/lib/python2.6/site-packages/resource_management/core/providers/system.py", line 262, in action_run
tries=self.resource.tries, try_sleep=self.resource.try_sleep)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 72, in inner
result = function(command, **kwargs)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 102, in checked_call
tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 150, in _call_wrapper
result = _call(command, **kwargs_copy)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 303, in _call
raise ExecutionFailed(err_msg, code, out, err)
ExecutionFailed: Execution of '! beeline -u 'jdbc:hive2://xxx-xxx-xx.tmu2a1fyxcvebfe0i4avtrjj5b.ax.internal.chinacloudapp.cn:10016/default' transportMode=http -e '' 2>&1| awk '{print}'|grep -i -e 'Connection refused' -e 'Invalid URL'' returned 1. Error: Could not open client transport with JDBC Uri: jdbc:hive2://xxx-xxx-xx.tmu2a1fyxcvebfe0i4avtrjj5b.ax.internal.chinacloudapp.cn:10016/default: java.net.ConnectException: Connection refused (Connection refused) (state=08S01,code=0)
Error: Could not open client transport with JDBC Uri: jdbc:hive2://xxx-xxx-xx.tmu2a1fyxcvebfe0i4avtrjj5b.ax.internal.chinacloudapp.cn:10016/default: java.net.ConnectException: Connection refused (Connection refused) (state=08S01,code=0)
)
修复方法
在两个 headnode 节点上修改文件 : /var/lib/ambari-agent/cache/common-services/SPARK2/2.0.0/package/scripts/alerts/alert_spark2_thrift_port.py
原脚本内容为:
THRIFT_PORT_DEFAULT = 10016
HIVE_SERVER_TRANSPORT_MODE_DEFAULT = 'binary'
改成以下值:
THRIFT_PORT_DEFAULT = 10002
HIVE_SERVER_TRANSPORT_MODE_DEFAULT = 'http'
修改完之后,再次刷新 Ambari,查看 Alert 是否消失。