排查 HDInsight 群集速度慢或作业失败问题Troubleshoot a slow or failing job on a HDInsight cluster

如果 HDInsight 群集上的应用程序处理数据运行速度缓慢或者发生故障并返回错误代码,你可以使用多个故障排除选项。If an application processing data on a HDInsight cluster is either running slowly or failing with an error code, you have several troubleshooting options. 如果作业的运行时间超过预期或者响应时间比平时要慢,原因可能是群集的上游组件(例如运行群集的服务)发生故障。If your jobs are taking longer to run than expected, or you are seeing slow response times in general, there may be failures upstream from your cluster, such as the services on which the cluster runs. 但是,这些速度变慢问题的最常见原因是缩放不足。However, the most common cause of these slowdowns is insufficient scaling. 创建新的 HDInsight 群集时,请选择适当的虚拟机大小When you create a new HDInsight cluster, select the appropriate virtual machine sizes.

若要诊断群集变慢或故障的原因,请收集有关环境的各个方面的信息,例如,关联的 Azure 服务、群集配置和作业执行信息。To diagnose a slow or failing cluster, gather information about all aspects of the environment, such as associated Azure Services, cluster configuration, and job execution information. 一种有效的诊断方法是尝试在另一个群集上再现错误状态。A helpful diagnostic is to try to reproduce the error state on another cluster.

  • 步骤 1:收集有关问题的数据Step 1: Gather data about the issue
  • 步骤 2:验证 HDInsight 群集环境Step 2: Validate the HDInsight cluster environment
  • 步骤 3:查看群集的运行状况Step 3: View your cluster's health
  • 步骤 4:查看环境堆栈和版本Step 4: Review the environment stack and versions
  • 步骤 5:检查群集日志文件Step 5: Examine the cluster log files
  • 步骤 6:检查配置设置Step 6: Check configuration settings
  • 步骤 7:在不同的群集上再现故障Step 7: Reproduce the failure on a different cluster

步骤 1:收集有关问题的数据Step 1: Gather data about the issue

HDInsight 提供了许多工具用于识别和排查群集问题。HDInsight provides many tools that you can use to identify and troubleshoot issues with clusters. 下面逐步讲解这些工具的用法,并提供有关查明问题的建议。The following steps guide you through these tools and provide suggestions for pinpointing the issue.

识别问题Identify the problem

若要帮助识别问题,请考虑以下问题:To help identify the problem, consider the following questions:

  • 预期发生的情况是什么?What did I expect to happen? 实际发生的情况是什么?What happened instead?
  • 运行该过程花费了多长时间?How long did the process take to run? 运行该过程应该花费多长时间?How long should it have run?
  • 在此群集上,我的任务是否一直都缓慢运行?Have my tasks always run slowly on this cluster? 它们在其他群集上的运行速度是否更快?Did they run faster on a different cluster?
  • 此问题第一次是何时发生的?When did this problem first occur? 从那以后,它多久发生一次?How often has it happened since?
  • 群集配置是否有任何更改?Has anything changed in my cluster configuration?

群集详细信息Cluster details

重要的群集信息包括:Important cluster information includes:

  • 群集名称。Cluster name.
  • 群集区域 - 检查区域中断Cluster region - check for region outages.
  • HDInsight 群集类型和版本。HDInsight cluster type and version.
  • 为头节点和工作节点指定的 HDInsight 实例的类型和数量。Type and number of HDInsight instances specified for the head and worker nodes.

Azure 门户可以提供此信息:The Azure portal can provide this information:

HDInsight - Azure 门户信息

还可以使用 Azure CLIYou can also use Azure CLI:

az hdinsight list --resource-group <ResourceGroup>
az hdinsight show --resource-group <ResourceGroup> --name <ClusterName>

另一个选项是使用 PowerShell。Another option is using PowerShell. 有关详细信息,请参阅使用 Azure PowerShell 在 HDInsight 中管理 Apache Hadoop 群集For more information, see Manage Apache Hadoop clusters in HDInsight with Azure PowerShell.

步骤 2:验证 HDInsight 群集环境Step 2: Validate the HDInsight cluster environment

每个 HDInsight 群集依赖于各种 Azure 服务,以及 Apache HBase 和 Apache Spark 等开源软件。Each HDInsight cluster relies on various Azure services, and on open-source software such as Apache HBase and Apache Spark. HDInsight 群集还可能调用其他 Azure 服务,例如 Azure 虚拟网络。HDInsight clusters can also call on other Azure services, such as Azure Virtual Networks. 群集上运行的任何服务或者外部服务都可能导致群集故障。A cluster failure can be caused by any of the running services on your cluster, or by an external service. 群集服务配置更改也可能导致群集故障。A cluster service configuration change can also cause the cluster to fail.

服务详细信息Service details

  • 检查开源库发行版本Check the open-source library release versions
  • 检查 Azure 服务中断Check for Azure Service Outages
  • 检查 Azure 服务使用限制Check for Azure Service usage limits
  • 检查 Azure 虚拟网络子网配置Check the Azure Virtual Network subnet configuration

使用 Ambari UI 查看群集配置设置View cluster configuration settings with the Ambari UI

可以在 Apache Ambari 中使用 Web UI 和 REST API 对 HDInsight 群集进行管理和监视。Apache Ambari provides management and monitoring of a HDInsight cluster with a web UI and a REST API. 基于 Linux 的 HDInsight 群集上已随附 Ambari。Ambari is included on Linux-based HDInsight clusters. 在 Azure 门户的“HDInsight”页上选择“群集仪表板”窗格。 Select the Cluster Dashboard pane on the Azure portal HDInsight page. 选择“HDInsight 群集仪表板”窗格打开 Ambari UI,并输入群集登录凭据。 Select the HDInsight cluster dashboard pane to open the Ambari UI, and enter the cluster login credentials.

Ambari UI

若要打开服务视图列表,请在 Azure 门户页上选择“Ambari 视图”。 To open a list of service views, select Ambari Views on the Azure portal page. 此列表的内容取决于安装的库。This list depends on which libraries are installed. 例如,可能会显示“YARN 队列管理器”、“Hive 视图”和“Tez 视图”。For example, you may see YARN Queue Manager, Hive View, and Tez View. 选择某个服务链接以查看配置和服务信息。Select a service link to see configuration and service information.

检查 Azure 服务中断Check for Azure service outages

HDInsight 依赖于多个 Azure 服务。HDInsight relies on several Azure services. 它在 Azure HDInsight 中运行虚拟服务器,在 Azure Blob 存储或 Azure DataLake Storage 中存储数据和脚本,在 Azure 表存储中为日志文件编制索引。It runs virtual servers on Azure HDInsight, stores data and scripts on Azure Blob storage or Azure Data Lake Storage, and indexes log files in Azure Table storage. 这些服务发生中断(不过这种情况很少见)可能会导致 HDInsight 出现问题。Disruptions to these services, although rare, can cause issues in HDInsight. 如果群集发生意外的速度变慢或故障,请检查 Azure 状态仪表板If you have unexpected slowdowns or failures in your cluster, check the Azure Status Dashboard. 每个服务的状态按区域列出。The status of each service is listed by region. 请检查群集的区域,以及所有相关服务的区域。Check your cluster's region and also regions for any related services.

检查 Azure 服务使用限制Check Azure service usage limits

在启动大型群集或同时启动多个群集时,如果超出 Azure 服务限制,则群集可能发生故障。If you are launching a large cluster, or have launched many clusters simultaneously, a cluster can fail if you have exceeded an Azure service limit. 服务限制因 Azure 订阅而异。Service limits vary, depending on your Azure subscription. 有关详细信息,请参阅 Azure 订阅和服务限制、配额与约束For more information, see Azure subscription and service limits, quotas, and constraints. 可以使用资源管理器提高核心配额请求,向 Microsoft 请求增加可用 HDInsight 资源(例如 VM 核心和 VM 实例)的数量。You can request that Microsoft increase the number of HDInsight resources available (such as VM cores and VM instances) with a Resource Manager core quota increase request.

检查发行版本Check the release version

将群集版本与最新的 HDInsight 发行版进行比较。Compare the cluster version with the latest HDInsight release. 每个 HDInsight 发行版包含改进项目,例如新的应用程序、功能、修补程序和 bug 修复。Each HDInsight release includes improvements such as new applications, features, patches, and bug fixes. 影响群集的问题可能已在最新的发行版本中得到解决。The issue that is affecting your cluster may have been fixed in the latest release version. 如果可能,请使用最新版本的 HDInsight 和关联的库(例如 Apache HBase、Apache Spark 等)重新运行群集。If possible, re-run your cluster using the latest version of HDInsight and associated libraries such as Apache HBase, Apache Spark, and others.

重启群集服务Restart your cluster services

如果群集速度变慢,请考虑通过 Ambari UI 或 Azure 经典 CLI 重启服务。If you are experiencing slowdowns in your cluster, consider restarting your services through the Ambari UI or the Azure Classic CLI. 群集可能遇到暂时性的错误,而重启是稳定环境并可能提高性能的最快捷方法。The cluster may be experiencing transient errors, and restarting is the quickest way to stabilize your environment and possibly improve performance.

步骤 3:查看群集的运行状况Step 3: View your cluster's health

HDInsight 群集由虚拟机实例上运行的不同类型的节点组成。HDInsight clusters are composed of different types of nodes running on virtual machine instances. 可以监视每个节点上存在的资源严重不足、网络连接问题,以及可能降低群集速度的其他问题。Each node can be monitored for resource starvation, network connectivity issues, and other problems that can slow down the cluster. 每个群集包含两个头节点,大多数群集类型包含工作节点和边缘节点的组合。Every cluster contains two head nodes, and most cluster types contain a combination of worker and edge nodes.

有关每个群集类型使用的各个节点的说明,请参阅使用 Apache Hadoop、Apache Spark、Apache Kafka 等在 HDInsight 中设置群集For a description of the various nodes each cluster type uses, see Set up clusters in HDInsight with Apache Hadoop, Apache Spark, Apache Kafka, and more.

下列部分介绍如何检查每个节点和整个群集的运行状况。The following sections describe how to check the health of each node and of the overall cluster.

使用 Ambari UI 仪表板获取群集运行状况的快照Get a snapshot of the cluster health using the Ambari UI dashboard

Ambari UI 仪表板 (https://<clustername>.azurehdinsight.cn) 提供群集运行状况的概述,例如运行时间、内存、网络和 CPU 使用率、HDFS 磁盘使用率,等等。The Ambari UI dashboard (https://<clustername>.azurehdinsight.cn) provides an overview of cluster health, such as uptime, memory, network and CPU usage, HDFS disk usage, and so forth. 使用 Ambari 的“主机”部分可以查看主机级别的资源。Use the Hosts section of Ambari to view resources at a host level. 还可以停止和重启服务。You can also stop and restart services.

检查 WebHCat 服务Check your WebHCat service

Apache Hive、Apache Pig 或 Apache Sqoop 作业失败的常见场合之一是 WebHCat(或 Templeton)服务发生故障。One common scenario for Apache Hive, Apache Pig, or Apache Sqoop jobs failing is a failure with the WebHCat (or Templeton) service. WebHCat 是 Hive、Pig、Scoop 和 MapReduce 等远程作业执行使用的 REST 接口。WebHCat is a REST interface for remote job execution, such as Hive, Pig, Scoop, and MapReduce. WebHCat 将作业提交请求转换为 Apache Hadoop YARN 应用程序,并返回派生自 YARN 应用程序状态的状态。WebHCat translates the job submission requests into Apache Hadoop YARN applications, and returns a status derived from the YARN application status. 以下部分介绍常见的 WebHCat HTTP 状态代码。The following sections describe common WebHCat HTTP status codes.

BadGateway(502 状态代码)BadGateway (502 status code)

此代码是来自网关节点的常规消息,也是最常见的故障状态代码。This code is a generic message from gateway nodes, and is the most common failure status codes. 发生此故障的可能原因之一是活动头节点上的 WebHCat 服务已关闭。One possible cause for this is the WebHCat service being down on the active head node. 若要检查是否存在这种情况,请使用以下 CURL 命令:To check for this possibility, use the following CURL command:

curl -u admin:{HTTP PASSWD} https://{CLUSTERNAME}.azurehdinsight.cn/templeton/v1/status?user.name=admin

Ambari 将显示一条警报,其中指出了 WebHCat 服务已在哪些主机上关闭。Ambari displays an alert showing the hosts on which the WebHCat service is down. 可以通过在相应的主机上重启 WebHCat 服务使其恢复运行。You can try to bring the WebHCat service back up by restarting the service on its host.

重启 WebHCat 服务器

如果 WebHCat 服务器仍未运行,请查看操作日志中的故障消息。If a WebHCat server still does not come up, then check the operations log for failure messages. 有关更多详细信息,请查看节点上提到的 stderrstdout 文件。For more detailed information, check the stderr and stdout files referenced on the node.

WebHCat 超时WebHCat times out

如果等待响应花费的时间超过两分钟,HDInsight 网关会超时并返回 502 BadGatewayAn HDInsight Gateway times out responses that take longer than two minutes, returning 502 BadGateway. WebHCat 向 YARN 服务查询作业状态,如果 YARN 做出响应的时间超过两分钟,则该请求可能超时。WebHCat queries YARN services for job statuses, and if YARN takes longer than two minutes to respond, that request can time out.

在这种情况下,请查看 /var/log/webhcat 目录中的以下日志:In this case, review the following logs in the /var/log/webhcat directory:

  • webhcat.log 是服务器将日志写入到的 log4j 日志webhcat.log is the log4j log to which server writes logs
  • webhcat-console.log 是启动服务器时的 stdoutwebhcat-console.log is the stdout of the server when started
  • webhcat-console-error.log 是服务器进程的 stderrwebhcat-console-error.log is the stderr of the server process

备注

每个 webhcat.log 每日滚动更新,生成名为 webhcat.log.YYYY-MM-DD 的文件。Each webhcat.log is rolled over daily, generating files named webhcat.log.YYYY-MM-DD. 选择想要调查的时间范围内的相应文件。Select the appropriate file for the time range you are investigating.

以下部分介绍 WebHCat 超时的一些可能原因。The following sections describe some possible causes for WebHCat timeouts.

WebHCat 级超时WebHCat level timeout

当 WebHCat 承受包含 10 个以上开放套接字的负载时,需要更长的时间来建立新的套接字连接,从而可能导致超时。When WebHCat is under load, with more than 10 open sockets, it takes longer to establish new socket connections, which can result in a timeout. 若要列出 WebHCat 的源和目标网络连接,请在当前活动头节点上使用 netstatTo list the network connections to and from WebHCat, use netstat on the current active headnode:

netstat | grep 30111

30111 是 WebHCat 侦听的端口。30111 is the port WebHCat listens on. 开放套接字数应小于 10。The number of open sockets should be less than 10.

如果没有开放套接字,则上述命令不会生成结果。If there are no open sockets, the previous command does not produce a result. 若要检查 Templeton 是否已启动并在侦听端口 30111,请使用:To check if Templeton is up and listening on port 30111, use:

netstat -l | grep 30111
YARN 级超时YARN level timeout

Templeton 调用 YARN 来运行作业,Templeton 与 YARN 之间的通信可能导致超时。Templeton calls YARN to run jobs, and the communication between Templeton and YARN can cause a timeout.

在 YARN 级别有两种类型的超时:At the YARN level, there are two types of timeouts:

  1. 提交某个 YARN 作业可能花费了过长的时间,从而导致超时。Submitting a YARN job can take long enough to cause a timeout.

    如果打开 /var/log/webhcat/webhcat.log 日志文件并搜索“queued job”的话,可以看到执行时间过长(超过 2000 毫秒)的条目,以及等待时间不断增加的条目。If you open the /var/log/webhcat/webhcat.log log file and search for "queued job", you may see multiple entries where the execution time is excessively long (>2000 ms), with entries showing increasing wait times.

    排队作业的等待时间之所以不断增加,是因为新作业的提交速率大于已完成的旧作业的提交速率。The time for the queued jobs continues to increase because the rate at which new jobs get submitted is higher than the rate at which the old jobs are completed. 在 YARN 内存使用率达到 100% 之后,joblauncher 队列不再能够从默认队列借用容量。 Once the YARN memory is 100% used, the joblauncher queue can no longer borrow capacity from the default queue. 因此,joblauncher 队列中不再接受新作业。Therefore, no more new jobs can be accepted into the joblauncher queue. 此行为可能导致等待时间变得越来越长,从而导致超时错误,并继而引发其他许多错误。This behavior can cause the waiting time to become longer and longer, causing a timeout error that is usually followed by many others.

    下图显示了过度使用内存 (714.4%) 时的 joblauncher 队列。The following image shows the joblauncher queue at 714.4% overused. 只要默认队列中仍有可借用的容量,则此状态都是可接受的。This is acceptable so long as there is still free capacity in the default queue to borrow from. 但是,当群集完全被占用并且 YARN 内存容量已被 100% 使用时,新作业必须等待,最终导致超时。However, when the cluster is fully utilized and the YARN memory is at 100% capacity, new jobs must wait, which eventually causes timeouts.

    Joblauncher 队列

    可通过两种方法解决此问题:降低新作业的提交速度,或通过扩展群集来提高旧作业的消耗速度。There are two ways to resolve this issue: either reduce the speed of new jobs being submitted, or increase the consumption speed of old jobs by scaling up the cluster.

  2. YARN 处理可能需要花费较长时间,从而导致超时。YARN processing can take a long time, which can cause timeouts.

    • 列出所有作业:这是一个非常耗时的调用。List all jobs: This is a time-consuming call. 此调用通过 YARN ResourceManager 枚举应用程序,并针对每个已完成的应用程序,从 YARN JobHistoryServer 获取状态。This call enumerates the applications from the YARN ResourceManager, and for each completed application, gets the status from the YARN JobHistoryServer. 如果作业数目较大,此调用可能超时。With higher numbers of jobs, this call can time out.

    • 列出七天以前的作业:HDInsight YARN JobHistoryServer 配置为将已完成作业的信息保留七天(mapreduce.jobhistory.max-age-ms 值)。List jobs older than seven days: The HDInsight YARN JobHistoryServer is configured to retain completed job information for seven days (mapreduce.jobhistory.max-age-ms value). 尝试枚举已清除的作业会导致超时。Trying to enumerate purged jobs results in a timeout.

诊断这些问题的步骤:To diagnose these issues:

1. <span data-ttu-id="0828f-244">确定要排查的 UTC 时间范围</span><span class="sxs-lookup"><span data-stu-id="0828f-244">Determine the UTC time range to troubleshoot</span></span>
2. <span data-ttu-id="0828f-245">选择相应的 `webhcat.log` 文件</span><span class="sxs-lookup"><span data-stu-id="0828f-245">Select the appropriate `webhcat.log` file(s)</span></span>
3. <span data-ttu-id="0828f-246">查看这段时间的警告和错误消息</span><span class="sxs-lookup"><span data-stu-id="0828f-246">Look for WARN and ERROR messages during that time</span></span>

其他 WebHCat 故障Other WebHCat failures

  1. HTTP 状态代码 500HTTP status code 500

    在 WebHCat 返回 500 的大多数情况下,错误消息中会包含有关故障的详细信息。In most cases where WebHCat returns 500, the error message contains details on the failure. 否则,请在 webhcat.log 中仔细查看警告和错误消息。Otherwise, look through webhcat.log for WARN and ERROR messages.

  2. 作业失败Job failures

    有时,尽管与 WebHCat 的交互成功,但作业失败。There may be cases where interactions with WebHCat are successful, but the jobs are failing.

    Templeton 以 stderr 的形式将作业控制台输出收集到 statusdir 中,这些信息通常对故障排除很有帮助。Templeton collects the job console output as stderr in statusdir, which is often useful for troubleshooting. stderr 包含实际查询的 YARN 应用程序标识符。stderr contains the YARN application identifier of the actual query.

步骤 4:查看环境堆栈和版本Step 4: Review the environment stack and versions

Ambari UI 中的“堆栈和版本”页提供有关群集服务配置和服务版本历史记录的信息。 The Ambari UI Stack and Version page provides information about cluster services configuration and service version history. 错误的 Hadoop 服务库版本可能是群集故障的原因。Incorrect Hadoop service library versions can be a cause of cluster failure. 在 Ambari UI 中选择“管理”菜单,然后选择“堆栈和版本”。 In the Ambari UI, select the Admin menu and then Stacks and Versions. 选择页面上的“版本”选项卡查看服务版本信息: Select the Versions tab on the page to see service version information:

堆栈和版本

步骤 5:检查日志文件Step 5: Examine the log files

构成 HDInsight 群集的许多服务和组件会生成多种类型的日志。There are many types of logs that are generated from the many services and components that comprise an HDInsight cluster. 前文介绍了 WebHCat 日志文件WebHCat log files are described previously. 可以根据以下部分中所述,调查其他多种有用的日志文件来缩小群集问题的范围。There are several other useful log files you can investigate to narrow down issues with your cluster, as described in the following sections.

  • HDInsight 群集由多个节点组成,其中的大多数节点负责运行已提交的作业。HDInsight clusters consist of several nodes, most of which are tasked to run submitted jobs. 作业可并发运行,但日志文件只能以线性方式显示结果。Jobs run concurrently, but log files can only display results linearly. HDInsight 执行新任务,并终止一开始就无法完成的其他任务。HDInsight executes new tasks, terminating others that fail to complete first. 整个活动将记录到 stderrsyslog 文件。All this activity is logged to the stderr and syslog files.

  • 脚本操作日志文件显示群集的创建过程中发生的错误或意外的配置更改。The script action log files show errors or unexpected configuration changes during your cluster's creation process.

  • Hadoop 步骤日志标识了在执行某个包含错误的步骤过程中启动的 Hadoop 作业。The Hadoop step logs identify Hadoop jobs launched as part of a step containing errors.

检查脚本操作日志Check the script action logs

使用 HDInsight 脚本操作可以手动或者根据指定在群集上运行脚本。HDInsight script actions run scripts on the cluster manually or when specified. 例如,可以使用脚本操作在群集上安装其他软件,或者更改配置设置的默认值。For example, script actions can be used to install additional software on the cluster or to alter configuration settings from the default values. 检查脚本操作日志可以深入了解群集安装和配置期间发生的错误。Checking the script action logs can provide insight into errors that occurred during cluster setup and configuration. 可以通过选择 Ambari UI 中的“操作”按钮,或者访问默认存储帐户中的日志,来查看脚本操作的状态。 You can view the status of a script action by selecting the ops button in the Ambari UI, or by accessing the logs from the default storage account.

脚本操作日志位于 \STORAGE_ACCOUNT_NAME\DEFAULT_CONTAINER_NAME\custom-scriptaction-logs\CLUSTER_NAME\DATE 目录中。The script action logs reside in the \STORAGE_ACCOUNT_NAME\DEFAULT_CONTAINER_NAME\custom-scriptaction-logs\CLUSTER_NAME\DATE directory.

HDInsight Ambari UI 中包含一些“快速链接”部分。 The HDInsight Ambari UI includes a number of Quick Links sections. 若要访问 HDInsight 群集中特定服务的日志链接,请打开该群集的 Ambari UI,然后在左侧列表中选择服务链接。To access the log links for a particular service in your HDInsight cluster, open the Ambari UI for your cluster, then select the service link from the list at left. 依次选择“快速链接”下拉列表、所需的 HDInsight 节点及其关联日志的链接。 Select the Quick Links dropdown, then the HDInsight node of interest, and then select the link for its associated log.

例如,对于 HDFS 日志:For example, for HDFS logs:

日志文件的 Ambari 快速链接

查看 Hadoop 生成的日志文件View Hadoop-generated log files

HDInsight 群集会生成日志,这些日志将写入到 Azure 表和 Azure Blob 存储。An HDInsight cluster generates logs that are written to Azure tables and Azure Blob storage. YARN 会创建自身的执行日志。YARN creates its own execution logs. 有关详细信息,请参阅管理 HDInsight 群集的日志For more information, see Manage logs for an HDInsight cluster.

查看堆转储Review heap dumps

堆转储包含应用程序内存的快照,其中包括当时的变量值,这对于诊断运行时发生的问题很有用。Heap dumps contain a snapshot of the application's memory, including the values of variables at that time, which are useful for diagnosing problems that occur at runtime. 有关详细信息,请参阅为基于 Linux 的 HDInsight 上的 Apache Hadoop 服务启用堆转储For more information, see Enable heap dumps for Apache Hadoop services on Linux-based HDInsight.

步骤 6:检查配置设置Step 6: Check configuration settings

HDInsight 群集中预配置了相关服务(例如 Hadoop、Hive、HBase 等)的默认设置。HDInsight clusters are pre-configured with default settings for related services, such as Hadoop, Hive, HBase, and so on. 根据群集的类型、硬件配置、节点数目、运行的作业类型和正在处理的数据(以及数据处理方式),可能需要优化配置。Depending on the type of cluster, its hardware configuration, its number of nodes, the types of jobs you are running, and the data you are working with (and how that data is being processed), you may need to optimize your configuration.

有关优化大多数方案的性能配置的详细说明,请参阅使用 Apache Ambari 优化群集配置For detailed instructions on optimizing performance configurations for most scenarios, see Optimize cluster configurations with Apache Ambari. 在使用 Spark,请参阅优化 Apache Spark 作业的性能When using Spark, see Optimize Apache Spark jobs for performance.

步骤 7:在不同的群集上再现故障Step 7: Reproduce the failure on a different cluster

若要帮助诊断群集错误的原因,请使用相同的配置启动新群集,然后逐个重新提交已失败作业的步骤。To help diagnose the source of a cluster error, start a new cluster with the same configuration and then resubmit the failed job's steps one by one. 先检查每个步骤的结果,然后再处理下一个步骤。Check the results of each step before processing the next one. 使用此方法也许能够纠正和重新运行单个失败的步骤。This method gives you the opportunity to correct and re-run a single failed step. 此方法还有一个优点,那就是只会加载输入数据一次。This method also has the advantage of only loading your input data once.

  1. 使用与有故障群集相同的配置创建新的测试群集。Create a new test cluster with the same configuration as the failed cluster.
  2. 将第一个作业步骤提交到测试群集。Submit the first job step to the test cluster.
  3. 当此步骤完成处理时,请在步骤日志文件中查看错误。When the step completes processing, check for errors in the step log files. 连接到测试群集的主节点并在其中查看日志文件。Connect to the test cluster's master node and view the log files there. 步骤日志文件只会在该步骤运行了一段时间、已完成或失败之后才显示。The step log files only appear after the step runs for some time, finishes, or fails.
  4. 如果第一个步骤成功,请运行下一个步骤。If the first step succeeded, run the next step. 如果出现错误,请在日志文件中调查错误。If there were errors, investigate the error in the log files. 如果这是代码中的错误,请予以纠正,然后重新运行该步骤。If it was an error in your code, make the correction and re-run the step.
  5. 继续运行,直到所有步骤都可完成运行且不出错。Continue until all steps run without error.
  6. 完成调试测试群集后,请将其删除。When you are done debugging the test cluster, delete it.

后续步骤Next steps