排查 Azure HDInsight 中的脚本操作问题Troubleshoot script actions in Azure HDInsight

本文介绍在与 Azure HDInsight 群集交互时出现的问题的故障排除步骤和可能的解决方案。This article describes troubleshooting steps and possible resolutions for issues when interacting with Azure HDInsight clusters.

查看日志Viewing logs

可以使用 Apache Ambari web UI 查看脚本操作记录的信息。You can use the Apache Ambari web UI to view information logged by script actions. 如果在创建群集期间脚本失败,可在默认的群集存储帐户中查看相关日志。If the script fails during cluster creation, logs are in the default cluster storage account. 本部分提供有关如何使用这两个选项检索日志的信息。This section provides information on how to retrieve the logs by using both these options.

Apache Ambari web UIApache Ambari web UI

  1. 在 Web 浏览器中,导航到 https://CLUSTERNAME.azurehdinsight.cn,其中 CLUSTERNAME 是群集的名称。From a web browser, navigate to https://CLUSTERNAME.azurehdinsight.cn, where CLUSTERNAME is the name of your cluster.

  2. 从页面顶部栏中选择“操作”条目 。From the bar at the top of the page, select the ops entry. 此时会显示通过 Ambari 在群集上执行的当前操作和以前操作的列表。A list displays current and previous operations done on the cluster through Ambari.

    选中了“操作”的 Ambari Web UI 栏

  3. 查找“操作” 列中包含 run_customscriptaction 的条目。Find the entries that have run_customscriptaction in the Operations column. 这些条目是在运行脚本操作时创建的。These entries are created when the script actions run.

    “Apache Ambari 脚本操作”操作

    若要查看 STDOUTSTDERR 输出,请选择 run\customscriptaction 条目,并通过链接向下钻取。To view the STDOUT and STDERR output, select the run\customscriptaction entry and drill down through the links. 此输出是在脚本运行时生成的,可能包含有用的信息。This output is generated when the script runs and might have useful information.

默认存储器帐户Default storage account

如果因脚本错误导致群集创建失败,则日志会保存在群集存储帐户中。If cluster creation fails because of a script error, the logs are kept in the cluster storage account.

  • 存储日志位于 \STORAGE_ACCOUNT_NAME\DEFAULT_CONTAINER_NAME\custom-scriptaction-logs\CLUSTER_NAME\DATEThe storage logs are available at \STORAGE_ACCOUNT_NAME\DEFAULT_CONTAINER_NAME\custom-scriptaction-logs\CLUSTER_NAME\DATE.

    脚本操作日志

    在此目录下,日志分别针对头节点工作器节点Zookeeper 节点进行组织。Under this directory, the logs are organized separately for headnode, worker node, and zookeeper node. 请看以下示例:See the following examples:

    • 头节点<ACTIVE-HEADNODE-NAME>.chinacloudapp.cnHeadnode: <ACTIVE-HEADNODE-NAME>.chinacloudapp.cn

    • 工作节点<ACTIVE-WORKERNODE-NAME>.chinacloudapp.cnWorker node: <ACTIVE-WORKERNODE-NAME>.chinacloudapp.cn

    • Zookeeper 节点<ACTIVE-ZOOKEEPERNODE-NAME>.chinacloudapp.cnZookeeper node: <ACTIVE-ZOOKEEPERNODE-NAME>.chinacloudapp.cn

  • 相应主机的所有 stdoutstderr 将上传到存储帐户。All stdout and stderr of the corresponding host is uploaded to the storage account. 每个脚本操作各有一个 output-*.txterrors-*.txtThere's one output-*.txt and errors-*.txt for each script action. output-*.txt 文件包含有关在主机上运行的脚本的 URI 信息。The output-*.txt file contains information about the URI of the script that was run on the host. 以下文本是此信息的示例:The following text is an example of this information:

      'Start downloading script locally: ', u'https://hdiconfigactions.blob.core.windows.net/linuxrconfigactionv01/r-installer-v01.sh'
    
  • 有可能重复创建了同名的脚本操作群集。It's possible that you repeatedly create a script action cluster with the same name. 在这种情况下,可以根据 DATE 文件夹名称来区分相关的日志。In that case, you can distinguish the relevant logs based on the DATE folder name. 例如,在不同日期创建的群集 mycluster 的文件夹结构类似于以下日志条目:For example, the folder structure for a cluster, mycluster, created on different dates appears similar to the following log entries:

    \STORAGE_ACCOUNT_NAME\DEFAULT_CONTAINER_NAME\custom-scriptaction-logs\mycluster\2015-10-04 \STORAGE_ACCOUNT_NAME\DEFAULT_CONTAINER_NAME\custom-scriptaction-logs\mycluster\2015-10-05\STORAGE_ACCOUNT_NAME\DEFAULT_CONTAINER_NAME\custom-scriptaction-logs\mycluster\2015-10-04 \STORAGE_ACCOUNT_NAME\DEFAULT_CONTAINER_NAME\custom-scriptaction-logs\mycluster\2015-10-05

  • 如果在同一天创建同名的脚本操作群集,可以使用唯一的前缀来标识相关日志。If you create a script action cluster with the same name on the same day, you can use the unique prefix to identify the relevant log files.

  • 如果在临近晚上 12:00(午夜)时创建群集,则日志可能跨越两天。If you create a cluster near 12:00 AM, midnight, it's possible that the log files span across two days. 在这种情况下,会看到同一群集有两个不同的日期文件夹。In that case, you see two different date folders for the same cluster.

  • 将日志上传到默认容器可能需要 5 分钟,特别是对于大型群集。Uploading log files to the default container can take up to five minutes, especially for large clusters. 因此,如果想要访问日志,则不应在脚本操作失败时立即删除群集。So if you want to access the logs, you shouldn't immediately delete the cluster if a script action fails.

Ambari 监视器Ambari watchdog

不要在基于 Linux 的 HDInsight 群集上更改 Ambari 监视程序 (hdinsightwatchdog) 的密码。Don't change the password for the Ambari watchdog, hdinsightwatchdog, on your Linux-based HDInsight cluster. 密码更改会破坏在 HDInsight 群集上运行新脚本操作的能力。A password change breaks the ability to run new script actions on the HDInsight cluster.

无法导入名称 BlobServiceCan't import name BlobService

症状Symptoms. 脚本操作失败。The script action fails. 在 Ambari 中查看该操作时,显示类似于以下错误的文本:Text similar to the following error displays when you view the operation in Ambari:

Traceback (most recent call list):
  File "/var/lib/ambari-agent/cache/custom_actions/scripts/run_customscriptaction.py", line 21, in <module>
    from azure.storage.blob import BlobService
ImportError: cannot import name BlobService

原因Cause. 如果升级 HDInsight 群集中随附的 Python Azure 存储客户端,则会发生此错误。This error occurs if you upgrade the Python Azure Storage client that's included with the HDInsight cluster. HDInsight 需要 Azure 存储客户端 0.20.0。HDInsight expects Azure Storage client 0.20.0.

解决方法Resolution. 若要解决此错误,请使用 ssh 手动连接到每个群集节点。To resolve this error, manually connect to each cluster node by using ssh. 运行以下命令重新安装正确的存储客户端版本:Run the following command to reinstall the correct storage client version:

sudo pip install azure-storage==0.20.0

有关使用 SSH 连接到群集的信息,请参阅使用 SSH 连接到 HDInsight (Apache Hadoop)For information on connecting to the cluster with SSH, see Connect to HDInsight (Apache Hadoop) by using SSH.

历史记录未显示创建群集期间使用的脚本History doesn't show the scripts used during cluster creation

如果群集是在 2016 年 3 月 15 日之前创建的,则脚本操作历史记录中可能不显示任何条目。If your cluster was created before March 15, 2016, you might not see an entry in script action history. 调整群集大小后,脚本会出现在脚本操作历史记录中。Resizing the cluster causes the scripts to appear in script action history.

有两种例外情况:There are two exceptions:

  • 群集是在 2015 年 9 月 1 日之前创建的。Your cluster was created before September 1, 2015. 这是脚本操作的推出时间。This date is when script actions were introduced. 在此日期之前创建的任何群集都不可能是使用脚本操作创建的。Any cluster created before this date couldn't have used script actions for cluster creation.

  • 创建群集期间使用了多个脚本操作。You used multiple script actions during cluster creation. 或者,将相同的名称和相同的 URI 用于多个脚本,但将不同的参数用于多个脚本。Or you used the same name for multiple scripts or the same name, same URI, but different parameters for multiple scripts. 在这种情况下,将收到以下错误:In these cases, you get the following error:

    No new script actions can be run on this cluster because of conflicting script names in existing scripts. Script names provided at cluster creation must be all unique. Existing scripts are run on resize.
    

后续步骤Next steps

如果你的问题未在本文中列出,或者无法解决问题,请访问以下渠道以获取更多支持:If you didn't see your problem or are unable to solve your issue, visit the following channel for more support:

  • 如果需要更多帮助,可以从 Azure 门户提交支持请求。If you need more help, you can submit a support request from the Azure portal. 从菜单栏中选择“支持”,或打开“帮助 + 支持” 中心。Select Support from the menu bar or open the Help + support hub. 有关更多详细信息,请参阅如何创建 Azure 支持请求