HDInsight 群集的容量规划Capacity planning for HDInsight clusters

在部署 HDInsight 群集之前,应确定需要的性能和规模,从而为所需的群集容量做好规划。Before deploying an HDInsight cluster, plan for the intended cluster capacity by determining the needed performance and scale. 这种规划有助于优化可用性与成本。This planning helps optimize both usability and costs. 部署之后,某些群集容量决策不可更改。Some cluster capacity decisions can't be changed after deployment. 如果性能参数发生更改,可以拆除群集,然后重新创建,而不会丢失存储的数据。If the performance parameters change, a cluster can be dismantled and re-created without losing stored data.

容量规划期间要提出的重要问题包括:The key questions to ask for capacity planning are:

  • 应在哪个地理区域中部署群集?In which geographic region should you deploy your cluster?
  • 需要多少存储?How much storage do you need?
  • 应部署哪种群集类型?What cluster type should you deploy?
  • 群集节点应使用的虚拟机 (VM) 大小和类型是什么?What size and type of virtual machine (VM) should your cluster nodes use?
  • 群集应包含多少个工作节点?How many worker nodes should your cluster have?

选择 Azure 区域Choose an Azure region

Azure 区域确定群集的物理预配位置。The Azure region determines where your cluster is physically provisioned. 为了将读写延迟最小化,群集应靠近数据所在的位置。To minimize the latency of reads and writes, the cluster should be near your data.

许多 Azure 区域提供 HDInsight。HDInsight is available in many Azure regions.

选择存储位置和大小Choose storage location and size

默认存储的位置Location of default storage

默认存储(Azure 存储帐户或 Azure Data Lake Storage)必须与群集位于同一位置。The default storage, either an Azure Storage account or Azure Data Lake Storage, must be in the same location as your cluster. 所有位置都提供 Azure 存储。Azure Storage is available at all locations.

现有数据的位置Location of existing data

如果想将现有的存储帐户或 Data Lake Storage 用作群集的默认存储,你必须在与其相同的位置部署群集。If you want to use an existing storage account or Data Lake Storage as your cluster's default storage, then you must deploy your cluster at that same location.

存储大小Storage size

你可以在部署的群集上附加更多 Azure 存储帐户,或访问其他 Data Lake Storage。On a deployed cluster, you can attach additional Azure Storage accounts or access other Data Lake Storage. 所有存储帐户均必须与群集位于同一位置。All your storage accounts must live in the same location as your cluster. Data Lake Storage 可以位于不同的位置,不过,距离较远可能会造成某种程度的延迟。A Data Lake Storage can be in a different location, though great distances may introduce some latency.

Azure 存储具有某些容量限制Azure Storage has some capacity limits.

群集可以访问不同存储帐户的组合。A cluster can access a combination of different storage accounts. 典型示例包括:Typical examples include:

  • 当数据量可能会超过单个 Blob 存储容器的存储容量时。When the amount of data is likely to exceed the storage capacity of a single blob storage container.
  • 当对 Blob 容器的访问速率可能会超过阈值,从而发生限制时。When the rate of access to the blob container might exceed the threshold where throttling occurs.
  • 想要将已上传到 Blob 容器的数据提供给群集使用时。When you want to make data, you have already uploaded to a blob container available to the cluster.
  • 出于安全原因想要隔离存储的不同部分,或要简化管理时。When you want to isolate different parts of the storage for reasons of security, or to simplify administration.

为提高性能,请对每个存储帐户仅使用一个容器。For better performance, use only one container per storage account.

选择群集类型Choose a cluster type

群集类型决定 HDInsight 群集被配置运行的工作负载。The cluster type determines the workload your HDInsight cluster is configured to run. 类型包括 Apache HadoopApache StormApache KafkaApache SparkTypes include Apache Hadoop, Apache Storm, Apache Kafka, or Apache Spark. 有关可用群集类型的详细说明,请参阅 Azure HDInsight 简介For a detailed description of the available cluster types, see Introduction to Azure HDInsight. 每个群集类型具有一个特定的部署拓扑,该拓扑附带大小和节点数方面的要求。Each cluster type has a specific deployment topology that includes requirements for the size and number of nodes.

选择 VM 大小和类型Choose the VM size and type

每个群集类型具有一组节点类型,每个节点类型在 VM 大小和类型方面提供特定的选项。Each cluster type has a set of node types, and each node type has specific options for their VM size and type.

若要确定应用程序的最佳群集大小,可以建立群集容量基准,并根据指示增加大小。To determine the optimal cluster size for your application, you can benchmark cluster capacity and increase the size as indicated. 例如,可以使用模拟工作负荷或“canary 查询”。For example, you can use a simulated workload, or a canary query. 在不同大小的群集上运行模拟工作负载。Run your simulated workloads on different size clusters. 逐渐增加大小,直到达到预期性能。Gradually increase the size until the intended performance is reached. 可在其他生产查询之间定期插入 canary 查询,以显示群集是否有足够的资源。A canary query can be inserted periodically among the other production queries to show whether the cluster has enough resources.

有关如何为工作负荷选择正确的 VM 系列的详细信息,请参阅为群集选择适当的 VM 大小For more information on how to choose the right VM family for your workload, see Selecting the right VM size for your cluster.

选择群集规模Choose the cluster scale

群集的规模由其 VM 节点数量决定。A cluster's scale is determined by the quantity of its VM nodes. 所有群集类型都存在一些具有特定缩放的节点类型,以及一些支持横向扩展的节点类型。例如,某个群集可能恰好需要三个 Apache ZooKeeper 节点或两个头节点。For all cluster types, there are node types that have a specific scale, and node types that support scale-out. For example, a cluster may require exactly three Apache ZooKeeper nodes or two Head nodes. 更多的工作器节点对以分布方式执行数据处理的工作器节点有益。Worker nodes that do data processing in a distributed fashion benefit from the additional worker nodes.

增加工作器节点数会增加额外的计算能力(例如更多核心),这具体取决于群集的类型。Depending on your cluster type, increasing the number of worker nodes adds additional computational capacity (such as more cores). 更多节点将增加整个群集支持正在处理的数据的内存中存储所需的总内存。More nodes will increase the total memory required for the entire cluster to support in-memory storage of data being processed. 就像选择 VM 大小和类型一样,适当的群集规模通常是凭经验选择出来的。As with the choice of VM size and type, selecting the right cluster scale is typically reached empirically. 使用的是模拟工作负载或 canary 查询。Use simulated workloads or canary queries.

你可以横向扩展群集以满足峰值负载需求。You can scale out your cluster to meet peak load demands. 然后在不再需要这些额外的节点时,进行纵向缩减。Then scale it back down when those extra nodes are no longer needed. 通过自动缩放功能,你可以根据预先确定的指标和时间安排自动缩放群集。The Autoscale feature allows you to automatically scale your cluster based upon predetermined metrics and timings. 有关手动缩放群集的详细信息,请参阅缩放 HDInsight 群集For more information on scaling your clusters manually, see Scale HDInsight clusters.

群集生命周期Cluster lifecycle

在群集的生存期内会产生费用。You're charged for a cluster's lifetime. 如果只是需要在特定的时间使用群集,请使用 Azure 数据工厂创建按需群集If there are only specific times that you need your cluster, create on-demand clusters using Azure Data Factory. 还可以创建 PowerShell 脚本用于预配和删除群集,然后使用 Azure 自动化计划这些脚本。You can also create PowerShell scripts that provision and delete your cluster, and then schedule those scripts using Azure Automation.


删除某个群集时,也会一并删除其默认 Hive 元存储。When a cluster is deleted, its default Hive metastore is also deleted. 若要保留元存储供下一次重新创建群集时使用,可以使用 Azure 数据库或 Apache Oozie 等外部元数据存储。To persist the metastore for the next cluster re-creation, use an external metadata store such as Azure Database or Apache Oozie.

查明群集作业错误Isolate cluster job errors

有时,多节点群集上多个映射和化简组件的并行执行可能导致出错。Sometimes errors can occur because of the parallel execution of multiple maps and reduce components on a multi-node cluster. 若要帮助隔离此问题,请尝试分布式测试。To help isolate the issue, try distributed testing. 在单工作器节点群集上并发运行多个作业。Run concurrent multiple jobs on a single worker node cluster. 然后将这种方法扩展为在包含多个节点的群集上并发运行多个作业。Then expand this approach to run multiple jobs concurrently on clusters containing more than one node. 若要在 Azure 中创建单节点 HDInsight 群集,请使用 Custom(size, settings, apps) 选项,并于在门户中预配新群集时,将值 1 用作“群集大小”部分中的工作器节点数 。To create a single-node HDInsight cluster in Azure, use the Custom(size, settings, apps) option and use a value of 1 for Number of Worker nodes in the Cluster size section when provisioning a new cluster in the portal.


有关管理订阅配额的详细信息,请参阅要求增加配额For more information on managing subscription quotas, see Requesting quota increases.

后续步骤Next steps