将本地 Apache Hadoop 群集迁移到 Azure HDInsight - 基础结构最佳做法Migrate on-premises Apache Hadoop clusters to Azure HDInsight - infrastructure best practices

本文提供有关管理 Azure HDInsight 群集基础结构的建议。This article gives recommendations for managing the infrastructure of Azure HDInsight clusters. 本文是帮助用户将本地 Apache Hadoop 系统迁移到 Azure HDInsight 的最佳做法系列教程中的其中一篇。It's part of a series that provides best practices to assist with migrating on-premises Apache Hadoop systems to Azure HDInsight.

规划 HDInsight 群集容量Plan for HDInsight cluster capacity

在 HDInsight 群集容量规划方面要做出的重要选择如下:The key choices to make for HDInsight cluster capacity planning are the following:

区域Region
Azure 区域确定群集的物理预配位置。The Azure region determines where the cluster is physically provisioned. 为了将读写延迟最小化,群集应与数据位于同一区域。To minimize the latency of reads and writes, the cluster should be in the same Region as the data.

存储位置和大小Storage location and size
默认存储必须位于群集所在区域中。The default storage must be in the same Region as the cluster. 对于 48 节点群集,建议创建 4 到 8 个存储帐户。 For a 48-node cluster, it's recommended to have 4 to 8 storage accounts. 尽管存储总量可能已足够,但每个存储帐户能够为计算节点提供额外的网络带宽。Although there may already be sufficient total storage, each storage account provides additional networking bandwidth for the compute nodes. 如果有多个存储帐户,请为每个存储帐户使用不带前缀的随机名称。When there are multiple storage accounts, use a random name for each storage account, without a prefix. 使用随机名称的目的是降低出现存储瓶颈(限制)或所有帐户发生共模故障的可能性。The purpose of random naming is reducing the chance of storage bottlenecks (throttling) or common-mode failures across all accounts. 为提高性能,请对每个存储帐户仅使用一个容器。For better performance, use only one container per storage account.

VM 大小和类型(现在支持 G 系列)VM size and type (now supports the G-series)
每个群集类型具有一组节点类型,每个节点类型在 VM 大小和类型方面提供特定的选项。Each cluster type has a set of node types, and each node type has specific options for their VM size and type. VM 大小和类型由 CPU 处理能力、RAM 大小和网络延迟决定。The VM size and type is determined by CPU processing power, RAM size, and network latency. 可以使用模拟工作负荷来确定每个节点类型的最佳 VM 大小和类型。A simulated workload can be used to determine the optimal VM size and type for each node types.

工作器节点数Number of worker nodes
可以使用模拟工作负荷来确定初始的工作器节点数。The initial number of worker nodes can be determined using the simulated workloads. 以后可以通过添加更多工作节点来扩展群集,以满足峰值负载需求。The cluster can be scaled later by adding more worker nodes to meet peak load demands. 以后不需要额外的工作器节点时,可以重新缩放群集。The cluster can later be scaled back when the additional worker nodes aren't required.

有关详细信息,请参阅 HDInsight 群集的容量规划一文For more information, see the article Capacity planning for HDInsight clusters

有关每种 HDInsight 群集的建议虚拟机类型,请参阅群集的默认节点配置和虚拟机大小See Default node configuration and virtual machine sizes for clusters for recommended virtual machine types for each type of HDInsight cluster.

检查 HDInsight 中 Hadoop 组件的可用性Check Hadoop components availability in HDInsight

每个 HDInsight 版本都是一组 Hadoop 生态系统组件的云分发版。Each HDInsight version is a cloud distribution of a set of Hadoop eco-system components. 有关所有 HDInsight 组件及其最新版本的详细信息,请参阅 HDInsight 组件版本控制See HDInsight Component Versioning for details on all HDInsight components and their current versions.

还可以使用 Apache Ambari UI 或 Ambari REST API 来检查 HDInsight 中的 Hadoop 组件和版本。You can also use Apache Ambari UI or Ambari REST API to check the Hadoop components and versions in HDInsight.

可以在与 HDInsight 群集位于同一 VNet 中的边缘节点或 VM 上,添加以往在本地群集中提供的、但现在不属于 HDInsight 群集的应用程序或组件。Applications or components that were available in on-premises clusters but aren't part of the HDInsight clusters can be added on an Edge Node or on a VM in the same VNet as the HDInsight cluster. 可以在 HDInsight 群集中使用“应用程序”选项,来安装 Azure HDInsight 中未提供的第三方 Hadoop 应用程序。A third-party Hadoop application that isn't available on Azure HDInsight can be installed using the "Applications" option in HDInsight cluster. 可以使用“脚本操作”在 HDInsight 群集上安装自定义的 Hadoop 应用程序。Custom Hadoop applications can be installed on HDInsight cluster using "script actions". 下表列出了一些常见的应用程序及其 HDInsight 集成选项:The following table lists some of the common applications and their HDInsight integration options:

应用程序Application 集成Integration
气流Airflow IaaS 或 HDInsight 边缘节点IaaS or HDInsight Edge node
AlluxioAlluxio IaaSIaaS  
ArcadiaArcadia IaaSIaaS 
AtlasAtlas 无(仅限 HDP)None (Only HDP)
DatameerDatameer HDInsight 边缘节点HDInsight Edge node
Datastax (Cassandra)Datastax (Cassandra) IaaS(CosmosDB,Azure 上的替代产品)IaaS (CosmosDB an alternative on Azure)
DataTorrentDataTorrent IaaSIaaS 
钻取Drill IaaSIaaS 
IgniteIgnite IaaSIaaS
JethroJethro IaaSIaaS 
MapadorMapador IaaSIaaS 
MongoMongo IaaS(CosmosDB,Azure 上的替代产品)IaaS (CosmosDB an alternative on Azure)
NiFiNiFi IaaSIaaS 
PrestoPresto IaaS 或 HDInsight 边缘节点IaaS or HDInsight Edge node
Python 2Python 2 PaaSPaaS 
Python 3Python 3 PaaSPaaS 
RR PaaSPaaS 
SASSAS IaaSIaaS 
VerticaVertica IaaS(SQLDW,Azure 上的替代产品)IaaS (SQLDW an alternative on Azure)
TableauTableau IaaSIaaS 
WaterlineWaterline HDInsight 边缘节点HDInsight Edge node
StreamSetsStreamSets HDInsight 边缘节点HDInsight Edge 
PalantirPalantir IaaSIaaS 
SailpointSailpoint IaasIaas 

有关详细信息,请参阅随不同 HDInsight 版本提供的 Apache Hadoop 组件一文For more information, see the article Apache Hadoop components available with different HDInsight versions

使用脚本操作自定义 HDInsight 群集Customize HDInsight clusters using script actions

HDInsight 提供名为“脚本操作”的群集配置方法。 HDInsight provides a method of cluster configuration called a script action. 脚本操作是一个 Bash 脚本,在 HDInsight 群集中的节点上运行,可用于安装附加的组件和更改配置设置。A script action is Bash script that runs on the nodes in an HDInsight cluster and can be used to install additional components and change configuration settings.

必须将脚本操作存储在可从 HDInsight 群集访问的 URI 上。Script actions must be stored on a URI that is accessible from the HDInsight cluster. 在创建群集期间或之后可以使用脚本操作,也可以将它们限制为只能在特定的节点类型上运行。They can be used during or after cluster creation and can also be restricted to run only on certain node types.

该脚本可以持久保留,或执行一次。The script can be persisted or executed one time. 持久化脚本用于自定义通过缩放操作添加到群集的新工作节点。The persisted scripts are used to customize new worker nodes added to the cluster through scaling operations. 进行缩放操作时,持久化脚本还可以将更改应用于其他节点类型,如头节点。A persisted script might also apply changes to another node type, such as a head node, when scaling operations occur.

HDInsight 提供预先编写的脚本用于在 HDInsight 群集上安装以下组件:HDInsight provides pre-written scripts to install the following components on HDInsight clusters:

  • 添加 Azure 存储帐户Add an Azure Storage account
  • 安装 HueInstall Hue
  • 安装 PrestoInstall Presto
  • 安装 SolrInstall Solr
  • 安装 GiraphInstall Giraph
  • 预加载 Hive 库Pre-load Hive libraries
  • 安装或更新 MonoInstall or update Mono

备注

HDInsight 不直接支持自定义 Hadoop 组件或使用脚本操作安装的组件。HDInsight does not provide direct support for custom hadoop components or components installed using script actions.

还可以将脚本操作作为 HDInsight 应用程序发布到 Azure 市场。Script actions can also be published to the Azure Marketplace as an HDInsight application.

有关详细信息,请参阅以下文章:For more information, see the following articles:

使用 Bootstrap 自定义 HDInsight 配置Customize HDInsight configs using Bootstrap

可以使用 Bootstrap 对 core-site.xmlhive-site.xmloozie-env.xml 等配置文件中的配置进行更改。Changes to configs in the config files such as core-site.xml, hive-site.xml and oozie-env.xml can be made using Bootstrap. 以下脚本是使用 Powershell AZ module cmdlet New-AzHDInsightClusterConfig 的示例:The following script is an example using the Powershell AZ module cmdlet New-AzHDInsightClusterConfig:

# hive-site.xml configuration
$hiveConfigValues = @{"hive.metastore.client.socket.timeout"="90"}

$config = New—AzHDInsightClusterConfig '
    | Set—AzHDInsightDefaultStorage
    —StorageAccountName "$defaultStorageAccountName.blob. core.chinacloudapi.cn" `
    —StorageAccountKey "defaultStorageAccountKey " `
    | Add—AzHDInsightConfigValues `
        —HiveSite $hiveConfigValues

New—AzHDInsightCluster `
    —ResourceGroupName $existingResourceGroupName `
    —Cluster-Name $clusterName `
    —Location $location `
    —ClusterSizeInNodes $clusterSizeInNodes `
    —ClusterType Hadoop `
    —OSType Linux `
    —Version "3.6" `
    —HttpCredential $httpCredential `
    —Config $config

有关详细信息,请参阅 使用 Bootstrap 自定义 HDInsight 群集一文。For more information, see the article Customize HDInsight clusters using Bootstrap. 另请参阅 使用 Apache Ambari REST API 管理 HDInsight 群集See also, Manage HDInsight clusters by using the Apache Ambari REST API.

从 HDInsight Hadoop 群集边缘节点访问客户端工具Access client tools from HDInsight Hadoop cluster edge nodes

空边缘节点是安装并配置了与头节点中相同的客户端工具,但未运行 Hadoop 服务的 Linux 虚拟机。An empty edge node is a Linux virtual machine with the same client tools installed and configured as on the head nodes, but with no Hadoop services running. 边缘节点可用于以下目的:The edge node can be used for the following purposes:

  • 访问群集accessing the cluster
  • 测试客户端应用程序testing client applications
  • 托管客户端应用程序hosting client applications

可以通过 Azure 门户创建和删除边缘节点,可以在创建群集期间或之后使用边缘节点。Edge nodes can be created and deleted through the Azure portal and may be used during or after cluster creation. 创建边缘节点后,可以使用 SSH 连接到该节点,运行客户端工具访问 HDInsight 中的 Hadoop 群集。After the edge node has been created, you can connect to the edge node using SSH, and run client tools to access the Hadoop cluster in HDInsight. 边缘节点 SSH 终结点为 <EdgeNodeName>.<ClusterName>-ssh.azurehdinsight.cn:22The edge node ssh endpoint is <EdgeNodeName>.<ClusterName>-ssh.azurehdinsight.cn:22.

有关详细信息,请参阅在 HDInsight 中的 Apache Hadoop 群集上使用空边缘节点一文。For more information, see the article Use empty edge nodes on Apache Hadoop clusters in HDInsight.

使用群集的纵向扩展和缩减功能Use scale-up and scale-down feature of clusters

HDInsight 提供弹性,可让你选择纵向扩展和纵向缩减群集中的工作节点数。HDInsight provides elasticity by giving you the option to scale up and scale down the number of worker nodes in your clusters. 使用此功能可在若干小时后或者在周末收缩群集,或者在业务高峰期扩展群集。This feature allows you to shrink a cluster after hours or on weekends and expand it during peak business demands. 有关详细信息,请参阅:For more information, see:

通过 Azure 虚拟网络使用 HDInsightUse HDInsight with Azure Virtual Network

Azure 虚拟网络可以筛选和路由网络流量,使 Azure 资源(例如 Azure 虚拟机)能够以安全方式相互通信,以及与 Internet 和本地网络通信。Azure Virtual Networks enable Azure resources, such as Azure Virtual Machines, to securely communicate with each other, the internet, and on-premises networks, by filtering and routing network traffic.

对 HDInsight 使用 Azure 虚拟网络可实现以下方案:Using Azure Virtual Network with HDInsight enables the following scenarios:

  • 直接从本地网络连接到 HDInsight。Connecting to HDInsight directly from an on-premises network.
  • 在 Azure 虚拟网络中将 HDInsight 连接到数据存储。Connecting HDInsight to data stores in an Azure Virtual network.
  • 直接访问无法通过 Internet 公开访问的 Hadoop 服务。Directly accessing Hadoop services that aren't available publicly over the internet. 例如,Kafka API 或 HBase Java API。For example, Kafka APIs or the HBase Java API.

可将 HDInsight 添加到新的或现有的 Azure 虚拟网络。HDInsight can either be added to a new or existing Azure Virtual Network. 如果将 HDInsight 添加到现有的虚拟网络,则需要更新现有的网络安全组和用户定义的路由,以便能够不受限制地访问 Azure 数据中心内的多个 IP 地址If HDInsight is being added to an existing Virtual Network, the existing network security groups and user-defined routes need to be updated to allow unrestricted access to several IP addresses in the Azure data center.

备注

HDInsight 目前不支持强制隧道。HDInsight does not currently support forced tunneling. 强制隧道是一种子网设置,可以强制出站 Internet 流量流向某个设备,以便进行检查和日志记录。Forced tunneling is a subnet setting that forces outbound Internet traffic to a device for inspection and logging. 在将 HDInsight 安装到子网之前删除强制隧道,或者为 HDInsight 创建新的子网。Either remove forced tunneling before installing HDInsight into a subnet or create a new subnet for HDInsight. 此外,HDInsight 不支持限制出站网络连接。HDInsight also does not support restricting outbound network connectivity.

有关详细信息,请参阅以下文章:For more information, see the following articles:

使用 Azure 虚拟网络服务终结点安全地连接到 Azure 服务Securely connect to Azure services with Azure Virtual Network service endpoints

HDInsight 支持虚拟网络服务终结点 ,使你能够安全连接到 Azure Blob 存储、Azure Data Lake Storage Gen2、Cosmos DB 和 SQL 数据库。HDInsight supports virtual network service endpoints which allow you to securely connect to Azure Blob Storage, Azure Data Lake Storage Gen2, Cosmos DB, and SQL databases. 为 Azure HDInsight 启用服务终结点后,流量将通过 Azure 数据中心内部的受保护路由传送。By enabling a Service Endpoint for Azure HDInsight, traffic flows through a secured route from within the Azure data center. 在网络层实施这种增强的安全级别后,可将大数据存储帐户锁定到其指定的虚拟网络 (VNET),同时仍可以顺畅地使用 HDInsight 群集来访问和处理这些数据。With this enhanced level of security at the networking layer, you can lock down big data storage accounts to their specified Virtual Networks (VNETs) and still use HDInsight clusters seamlessly to access and process that data.

有关详细信息,请参阅以下文章:For more information, see the following articles:

将 HDInsight 连接到本地网络Connect HDInsight to the on-premises network

可以使用 Azure 虚拟网络和 VPN 网关将 HDInsight 连接到本地网络。HDInsight can be connected to the on-premises network by using Azure Virtual Networks and a VPN gateway. 可使用以下步骤建立连接:The following steps can be used to establish connectivity:

  • 在已连接到本地网络的 Azure 虚拟网络中使用 HDInsight。Use HDInsight in an Azure Virtual Network that has connectivity to the on-premises network.
  • 配置虚拟网络与本地网络之间的 DNS 名称解析。Configure DNS name resolution between the virtual network and on-premises network.
  • 配置网络安全组或用户定义的路由 (UDR) 来控制网络流量。Configure network security groups or user-defined routes (UDR) to control network traffic.

有关详细信息,请参阅将 HDInsight 连接到本地网络一文For more information, see the article Connect HDInsight to your on-premises network

后续步骤Next steps

阅读本系列教程的下一篇文章:Read the next article in this series: