缩放 HDInsight 群集Scale HDInsight clusters

HDInsight 提供弹性,可让你选择纵向扩展和纵向缩减群集中的工作节点数。HDInsight provides elasticity by giving you the option to scale up and scale down the number of worker nodes in your clusters. 这种弹性允许你在若干小时后或者在周末收缩群集,或者在业务需求高峰期扩展群集。This elasticity, allows you to shrink a cluster after hours or on weekends, and expand it during peak business demands.

若要定期进行批处理,则可在该操作之前的几分钟纵向扩展 HDInsight 群集,使群集有足够的内存和 CPU 功率。If you have periodic batch processing, the HDInsight cluster can be scaled up a few minutes prior to that operation, so that your cluster has adequate memory and CPU power.  在完成处理并且用量再次下降后,可将 HDInsight 群集纵向缩减为更少的工作节点。 Later, after the processing is done, and usage goes down again, you can scale down the HDInsight cluster to fewer worker nodes.

可以使用下述方法之一手动缩放群集。You can scale a cluster manually using one of the methods outlined below.

备注

只支持使用 HDInsight 3.1.3 或更高版本的群集。Only clusters with HDInsight version 3.1.3 or higher are supported. 如果不确定群集的版本,可以查看“属性”页面。If you are unsure of the version of your cluster, you can check the Properties page.

用来缩放群集的实用程序Utilities to scale clusters

Microsoft 提供以下实用程序来缩放群集:Microsoft provides the following utilities to scale clusters:

实用程序Utility 说明Description
PowerShell AzPowerShell Az Set-AzHDInsightClusterSize -ClusterName <群集名称> -TargetInstanceCount <NewSize>Set-AzHDInsightClusterSize -ClusterName <Cluster Name> -TargetInstanceCount <NewSize>
PowerShell AzureRMPowerShell AzureRM Set-AzureRmHDInsightClusterSize -ClusterName <群集名称> -TargetInstanceCount <NewSize>Set-AzureRmHDInsightClusterSize -ClusterName <Cluster Name> -TargetInstanceCount <NewSize>
Azure CLIAzure CLI az hdinsight resize --resource-group <Resource group> --name <Cluster Name> --workernode-count <NewSize>az hdinsight resize --resource-group <Resource group> --name <Cluster Name> --workernode-count <NewSize>
Azure CLIAzure CLI azure hdinsight cluster resize <clusterName> <目标实例计数>azure hdinsight cluster resize <clusterName> <Target Instance Count>
Azure 门户Azure portal 打开 HDInsight 群集的窗格,在左侧菜单中选择“群集大小”,然后在“群集大小”窗格中键入工作节点数并选择“保存”。 Open your HDInsight cluster pane, select Cluster size on the left-hand menu, then on the Cluster size pane, type in the number of worker nodes, and select Save.

Azure 门户缩放群集选项

使用以下任一方法可在几分钟之内扩展或缩放 HDInsight 群集。Using any of these methods, you can scale your HDInsight cluster up or down within minutes.

重要

  • Aure 经典 CLI 已弃用,只能与经典部署模型配合使用。The Aure classic CLI is deprecated and should only be used with the classic deployment model. 进行所有其他的部署时,请使用 Azure CLIFor all other deployments, use the Azure CLI.
  • PowerShell AzureRM 模块已弃用。The PowerShell AzureRM module is deprecated. 请尽可能使用 Az 模块Please use the Az module whenever possible.

缩放操作的影响Impact of scaling operations

将节点添加到正在运行的 HDInsight 群集(纵向扩展)时,任何挂起的或正在运行的作业都不受影响。When you add nodes to your running HDInsight cluster (scale up), any pending or running jobs will not be affected. 在运行缩放过程时,可以安全提交新作业。New jobs can be safely submitted while the scaling process is running. 如果缩放操作出于任何原因而失败,系统会处理失败,让群集保持正常运行状态。If the scaling operation fails for any reason, the failure will be handled to leave your cluster in a functional state.

如果删除节点(纵向缩减),则当缩放操作完成时,任何挂起的或正在运行的作业将会失败。If you remove nodes (scale down), any pending or running jobs will fail when the scaling operation completes. 该失败的原因是在缩放过程中某些服务重启。This failure is due to some of the services restarting during the scaling process. 此外还有这样一种风险:在手动缩放操作过程中,群集可能停滞在安全模式下。There is also a risk that your cluster can get stuck in safe mode during a manual scaling operation.

对于 HDInsight 支持的每种类型的群集,更改数据节点数的影响有所不同:The impact of changing the number of data nodes varies for each type of cluster supported by HDInsight:

  • Apache HadoopApache Hadoop

    可顺利增加正在运行的 Hadoop 群集中的辅助节点数,而不会影响任何挂起或运行中的作业。You can seamlessly increase the number of worker nodes in a Hadoop cluster that is running without impacting any pending or running jobs. 也可在操作进行中提交新作业。New jobs can also be submitted while the operation is in progress. 系统会正常处理失败的缩放操作,让群集始终保持正常运行状态。Failures in a scaling operation are gracefully handled so that the cluster is always left in a functional state.

    减少数据节点数目以缩减 Hadoop 群集时,系统会重新启动群集中的某些服务。When a Hadoop cluster is scaled down by reducing the number of data nodes, some of the services in the cluster are restarted. 此行为会导致所有正在运行和挂起的作业在缩放操作完成时失败。This behavior causes all running and pending jobs to fail at the completion of the scaling operation. 但是,可在操作完成后重新提交这些作业。You can, however, resubmit the jobs once the operation is complete.

  • Apache HBaseApache HBase

    可在 HBase 群集运行时顺利添加或删除节点。You can seamlessly add or remove nodes to your HBase cluster while it is running. 完成缩放操作后的几分钟内,区域服务器自动平衡。Regional Servers are automatically balanced within a few minutes of completing the scaling operation. 但也可手动平衡区域服务器,方法是登录到群集的头节点,并在命令提示符窗口中运行以下命令:However, you can also manually balance the regional servers by logging in to the headnode of cluster and running the following commands from a command prompt window:

    pushd %HBASE_HOME%\bin
    hbase shell
    balancer
    

    有关使用 HBase shell 的详细信息,请参阅 HDInsight 中的 Apache HBase 示例入门For more information on using the HBase shell, see Get started with an Apache HBase example in HDInsight.

  • Apache StormApache Storm

    可在 Storm 群集运行时顺利添加或删除数据节点。You can seamlessly add or remove data nodes to your Storm cluster while it is running. 但是,在缩放操作成功完成后,需要重新平衡拓扑。However, after a successful completion of the scaling operation, you will need to rebalance the topology.

    可以使用两种方法来完成重新平衡操作:Rebalancing can be accomplished in two ways:

    • Storm Web UIStorm web UI

    • 命令行界面 (CLI) 工具Command-line interface (CLI) tool

      有关详细信息,请参阅 Apache Storm 文档Refer to the Apache Storm documentation for more details.

      HDInsight 群集上提供了 Storm Web UI:The Storm web UI is available on the HDInsight cluster:

      HDInsight Storm 缩放重新平衡

      以下是用于重新平衡 Storm 拓扑的示例 CLI 命令:Here is an example CLI command to rebalance the Storm topology:

      ## Reconfigure the topology "mytopology" to use 5 worker processes,
      ## the spout "blue-spout" to use 3 executors, and
      ## the bolt "yellow-bolt" to use 10 executors
      $ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10
      

如何安全地纵向缩减群集How to safely scale down a cluster

通过运行的作业纵向缩减群集Scale down a cluster with running jobs

为了避免运行的作业在纵向缩减操作过程中失败,可以尝试三项操作:To avoid having your running jobs fail during a scale down operation, you can try three things:

  1. 等待作业完成之后再纵向缩减群集。Wait for the jobs to complete before scaling down your cluster.
  2. 手动结束作业。Manually end the jobs.
  3. 在缩放操作完成后重新提交这些作业。Resubmit the jobs after the scaling operation has concluded.

若要查看挂起的和正在运行的作业列表,可以遵循以下步骤使用 YARN ResourceManager UITo see a list of pending and running jobs, you can use the YARN ResourceManager UI, following these steps:

  1. Azure 门户中,选择群集。From the Azure portal, select your cluster. 有关说明,请参阅列出和显示群集See List and show clusters for the instructions. 群集会在新的门户页中打开。The cluster is opened in a new portal page.

  2. 在主视图中,导航到“群集仪表板” > “Ambari 主页”。From the main view, navigate to Cluster dashboards > Ambari home. 输入群集凭据。Enter your cluster credentials.

  3. 在 Ambari UI 的左侧菜单中的服务列表内选择“YARN”。 From the Ambari UI, select YARN on the list of services on the left-hand menu.

  4. 在“YARN”页中选择“快速链接”,将鼠标悬停在活动头节点上,然后选择“ResourceManager UI”。 From the YARN page, select Quick Links and hover over the active head node, then select ResourceManager UI.

    ResourceManager UI

可以使用 https://<HDInsightClusterName>.azurehdinsight.cn/yarnui/hn/cluster 直接访问 ResourceManager UI。You may directly access the ResourceManager UI with https://<HDInsightClusterName>.azurehdinsight.cn/yarnui/hn/cluster.

可以看到作业的列表及其当前状态。You see a list of jobs, along with their current state. 在屏幕截图中,当前有一个作业正在运行:In the screenshot, there's one job currently running:

ResourceManager UI 应用程序

若要手动终止正在运行的应用程序,请通过 SSH shell 执行以下命令:To manually kill that running application, execute the following command from the SSH shell:

yarn application -kill <application_id>

例如:For example:

yarn application -kill "application_1499348398273_0003"

停滞在安全模式下Getting stuck in safe mode

纵向缩减群集时,HDInsight 使用 Apache Ambari 管理接口先解除额外的工作器节点,以将其 HDFS 块复制到其他联机工作器节点。When you scale down a cluster, HDInsight uses Apache Ambari management interfaces to first decommission the extra worker nodes, which replicate their HDFS blocks to other online worker nodes. 然后,HDInsight 安全地纵向缩减群集。After that, HDInsight safely scales the cluster down. HDFS 在缩放操作期间进入安全模式,在完成缩放后会退出此模式。HDFS goes into safe mode during the scaling operation, and is supposed to come out once the scaling is finished. 但在某些情况下,HDFS 会在缩放操作期间停滞在安全模式下,因为文件块复制数量不足。In some cases, however, HDFS gets stuck in safe mode during a scaling operation because of file block under-replication.

默认情况下,进行 HDFS 配置时,会将 dfs.replication 设置为 1,该项控制每个文件块的可用副本数。By default, HDFS is configured with a dfs.replication setting of 1, which controls how many copies of each file block are available. 文件块的每个副本存储在群集的不同节点上。Each copy of a file block is stored on a different node of the cluster.

HDFS 在检测到预期的块副本数不可用时,会进入安全模式,此时 Ambari 会生成警报。When HDFS detects that the expected number of block copies aren't available, HDFS enters safe mode and Ambari generates alerts. 如果 HDFS 进入安全模式进行缩放操作,但随后却因为检测不到进行复制所需的节点数目而无法退出安全模式,则群集可能会停滞在安全模式下。If HDFS enters safe mode for a scaling operation, but then cannot exit safe mode because the required number of nodes are not detected for replication, the cluster can become stuck in safe mode.

启用安全模式时的错误示例Example errors when safe mode is turned on

org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /tmp/hive/hive/819c215c-6d87-4311-97c8-4f0b9d2adcf0. Name node is in safe mode.
org.apache.http.conn.HttpHostConnectException: Connect to active-headnode-name.servername.internal.chinacloudapp.cn:10001 [active-headnode-name.servername. internal.chinacloudapp.cn/1.1.1.1] failed: Connection refused

可以查看 /var/log/hadoop/hdfs/ 文件夹中的名称节点日志,以了解缩放群集时群集进入安全模式的大致时间。You can review the name node logs from the /var/log/hadoop/hdfs/ folder, near the time when the cluster was scaled, to see when it entered safe mode. 日志文件命名为 Hadoop-hdfs-namenode-<active-headnode-name>.*The log files are named Hadoop-hdfs-namenode-<active-headnode-name>.*.

上述错误的根本原因是 Hive 在运行查询时依赖于 HDFS 中的临时文件。The root cause of the previous errors is that Hive depends on temporary files in HDFS while running queries. 当 HDFS 进入安全模式时,Hive 无法运行查询,因为它无法写入 HDFS。When HDFS enters safe mode, Hive cannot run queries because it cannot write to HDFS. HDFS 中的临时文件位于已装入到各个工作节点 VM 的本地驱动器上,并且在其他工作节点之间至少复制成三个副本。The temp files in HDFS are located in the local drive mounted to the individual worker node VMs, and replicated amongst other worker nodes at three replicas, minimum.

如何防止 HDInsight 停滞在安全模式下How to prevent HDInsight from getting stuck in safe mode

可通过多种方法防止 HDInsight 保留在安全模式:There are several ways to prevent HDInsight from being left in safe mode:

  • 在缩减 HDInsight 之前停止所有 Hive 作业。Stop all Hive jobs before scaling down HDInsight. 或者,计划好纵向缩减进程,以避免与运行中的 Hive 作业冲突。Alternately, schedule the scale down process to avoid conflicting with running Hive jobs.
  • 执行缩减操作之前,在 HDFS 中手动清理 Hive 的 scratch tmp 目录文件。Manually clean up Hive's scratch tmp directory files in HDFS before scaling down.
  • 只将 HDInsight 纵向缩减为三个工作节点(最少数量)。Only scale down HDInsight to three worker nodes, minimum. 避免将工作节点数减少至一个。Avoid going as low as one worker node.
  • 根据需要运行命令来退出安全模式。Run the command to leave safe mode, if needed.

以下部分将介绍这些选项。The following sections describe these options.

停止所有 Hive 作业Stop all Hive jobs

在缩减至一个工作节点之前停止所有 Hive 作业。Stop all Hive jobs before scaling down to one worker node. 如果已计划工作负荷,请在完成 Hive 工作后执行缩减。If your workload is scheduled, then execute your scale-down after Hive work is done.

在缩放之前停止 Hive 作业有助于将临时文件夹中的 scratch 文件(如果有)的数目减至最少。Stopping the Hive jobs before scaling, helps minimize the number of scratch files in the tmp folder (if any).

手动清理 Hive 的 scratch 文件Manually clean up Hive's scratch files

如果 Hive 遗留了临时文件,可以在缩减之前手动清理这些文件,以避免进入安全模式。If Hive has left behind temporary files, then you can manually clean up those files before scaling down to avoid safe mode.

  1. 通过查看 hive.exec.scratchdir 配置属性,了解用于 Hive 临时文件的具体位置。Check which location is being used for Hive temporary files by looking at the hive.exec.scratchdir configuration property. 此参数在 /etc/hive/conf/hive-site.xml 中设置:This parameter is set within /etc/hive/conf/hive-site.xml:

    <property>
        <name>hive.exec.scratchdir</name>
        <value>hdfs://mycluster/tmp/hive</value>
    </property>
    
  2. 停止 Hive 服务,并确保所有查询和作业都已完成。Stop Hive services and be sure all queries and jobs are completed.

  3. 列出在上面找到的 scratch 目录 hdfs://mycluster/tmp/hive/ 的内容,看其是否包含任何文件:List the contents of the scratch directory found above, hdfs://mycluster/tmp/hive/ to see if it contains any files:

    hadoop fs -ls -R hdfs://mycluster/tmp/hive/hive
    

    下面是存在文件时的示例输出:Here is a sample output when files exist:

    sshuser@scalin:~$ hadoop fs -ls -R hdfs://mycluster/tmp/hive/hive
    drwx------   - hive hdfs          0 2017-07-06 13:40 hdfs://mycluster/tmp/hive/hive/4f3f4253-e6d0-42ac-88bc-90f0ea03602c
    drwx------   - hive hdfs          0 2017-07-06 13:40 hdfs://mycluster/tmp/hive/hive/4f3f4253-e6d0-42ac-88bc-90f0ea03602c/_tmp_space.db
    -rw-r--r--   3 hive hdfs         27 2017-07-06 13:40 hdfs://mycluster/tmp/hive/hive/4f3f4253-e6d0-42ac-88bc-90f0ea03602c/inuse.info
    -rw-r--r--   3 hive hdfs          0 2017-07-06 13:40 hdfs://mycluster/tmp/hive/hive/4f3f4253-e6d0-42ac-88bc-90f0ea03602c/inuse.lck
    drwx------   - hive hdfs          0 2017-07-06 20:30 hdfs://mycluster/tmp/hive/hive/c108f1c2-453e-400f-ac3e-e3a9b0d22699
    -rw-r--r--   3 hive hdfs         26 2017-07-06 20:30 hdfs://mycluster/tmp/hive/hive/c108f1c2-453e-400f-ac3e-e3a9b0d22699/inuse.info
    
  4. 如果知道 Hive 已处理这些文件,则可以删除这些文件。If you know Hive is done with these files, you can remove them. 查看 Yarn ResourceManager UI 页,确保 Hive 中没有任何正在运行的查询。Be sure that Hive does not have any queries running by looking in the Yarn ResourceManager UI page.

    用于从 HDFS 中删除文件的示例命令行:Example command line to remove files from HDFS:

    hadoop fs -rm -r -skipTrash hdfs://mycluster/tmp/hive/
    

缩减 HDInsight 时保持三个或更多个工作器节点Scale HDInsight to three or more worker nodes

如果群集在纵向缩减到三个以下的工作器节点时频繁停滞在安全模式下,且前面的步骤无效,则请保留至少三个工作器节点,这样可以完全避免群集进入安全模式。If your clusters get stuck in safe mode frequently when scaling down to fewer than three worker nodes, and the previous steps don't work, then you can avoid your cluster going in to safe mode altogether by keeping at least three worker nodes.

保留三个工作器节点的成本比纵向缩减到仅一个工作器节点的成本要高,但可防止群集停滞在安全模式下。Retaining three worker nodes is more costly than scaling down to only one worker node, but it will prevent your cluster from getting stuck in safe mode.

将 HDInsight 缩减到一个工作器节点Scale HDInsight down to one worker node

即使群集缩减到 1 个节点,工作器节点 0 仍将继续存在。Even when the cluster is scaled down to 1 node, worker node 0 will still survive. 永远不能停用工作器节点 0。Worker node 0 can never be decommissioned.

运行命令来退出安全模式。Run the command to leave safe mode

最后一种做法是执行退出安全模式的命令。The final option is to execute the leave safe mode command. 如果确定 HDFS 进入安全模式的原因是 Hive 文件复制数量不足,则可执行以下命令退出安全模式:If you know that the reason for HDFS entering safe mode is because of Hive file under-replication, you can execute the following command to leave safe mode:

hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -safemode leave

纵向缩减 Apache HBase 群集Scale down an Apache HBase cluster

在完成缩放操作后的几分钟内,区域服务器会自动进行均衡。Region servers are automatically balanced within a few minutes after completing a scaling operation. 若要手动均衡区域服务器,请完成以下步骤:To manually balance region servers, complete the following steps:

  1. 使用 SSH 连接到 HDInsight 群集。Connect to the HDInsight cluster using SSH. 有关详细信息,请参阅 将 SSH 与 HDInsight 配合使用For more information, see Use SSH with HDInsight.

  2. 启动 HBase shell:Start the HBase shell:

    hbase shell
    
  3. 使用以下命令手动均衡区域服务器:Use the following command to manually balance the region servers:

    balancer
    

后续步骤Next steps