将本地 Apache Hadoop 群集迁移到 Azure HDInsight - 体系结构最佳做法Migrate on-premises Apache Hadoop clusters to Azure HDInsight - architecture best practices

本文提供有关 Azure HDInsight 系统体系结构的建议。This article gives recommendations for the architecture of Azure HDInsight systems. 本文是帮助用户将本地 Apache Hadoop 系统迁移到 Azure HDInsight 的最佳做法系列教程中的其中一篇。It's part of a series that provides best practices to assist with migrating on-premises Apache Hadoop systems to Azure HDInsight.

使用多个工作负荷优化的群集Use multiple workload-optimized clusters

许多本地 Apache Hadoop 部署由支持多个工作负荷的单一大型群集构成。Many on-premises Apache Hadoop deployments consist of a single large cluster that supports many workloads. 此单一群集可能比较复杂,需要抑制单个服务的功能才能让各个组件配合工作。This single cluster can be complex and may require compromises to the individual services to make everything work together. 将本地 Hadoop 群集迁移到 Azure HDInsight 需要在方法上面做出革新。Migrating on-premises Hadoop clusters to Azure HDInsight requires a change in approach.

Azure HDInsight 群集是针对特定类型的计算用途设计的。Azure HDInsight clusters are designed for a specific type of compute usage. 由于可在多个群集之间共享存储,可以根据不同作业的需求创建多个工作负荷优化的计算群集。Because storage can be shared across multiple clusters, it is possible to create multiple workload-optimized compute clusters to meet the needs of different jobs. 每个群集类型根据该特定工作负荷采用最佳的配置。Each cluster type has the optimal configuration for that specific workload. 下表列出了 HDInsight 中支持的群集类型和对应的工作负荷。The following table lists the supported cluster types in HDInsight and the corresponding workloads.

工作负载Workload HDInsight 群集类型HDInsight Cluster type
批处理 (ETL/ELT)Batch processing (ETL / ELT) Hadoop、SparkHadoop, Spark
数据仓库Data warehousing Hadoop、Spark、交互式查询Hadoop, Spark, Interactive Query
IoT/流式处理IoT / Streaming Kafka、Storm、SparkKafka, Storm, Spark
NoSQL 事务处理NoSQL Transactional processing HBaseHBase
使用内存中缓存的更快交互式查询Interactive and Faster queries with in-memory caching 交互式查询Interactive Query
数据科学Data Science SparkSpark

下表显示了可用于创建 HDInsight 群集的各种方法。The following table shows the different methods that can be used to create an HDInsight cluster.

工具Tool 基于浏览器Browser based 命令行Command Line REST APIREST API SDKSDK
Azure 门户Azure portal XX
Azure 数据工厂Azure Data Factory XX XX XX XX
Azure CLI(版本 1.0)Azure CLI (ver 1.0) XX
Azure PowerShellAzure PowerShell XX
cURLcURL XX XX
.NET SDK.NET SDK XX
Python SDKPython SDK XX
Java SDKJava SDK XX
Azure Resource Manager 模板Azure Resource Manager templates XX

有关详细信息,请参阅 HDInsight 中的群集类型一文For more information, see the article Cluster types in HDInsight

使用暂时性按需群集Use transient on-demand clusters

HDInsight 群集可能长时间不被使用。HDInsight clusters may go unused for long periods of time. 为了帮助节省资源成本,HDInsight 支持按需的暂时性群集,在成功完成工作负荷后,可以删除这些群集。To help save on resource costs, HDInsight supports on-demand transient clusters, which can be deleted once the workload has been successfully completed.

删除群集不会删除关联的存储帐户和外部元数据。When you delete a cluster, the associated storage account and external metadata are not removed. 以后可以使用相同的存储帐户和元存储重新创建群集。The cluster can later be re-created using the same storage accounts and meta-stores.

可以使用 Azure 数据工厂来计划按需 HDInsight 群集的创建。Azure Data Factory can be used to schedule creation of on-demand HDInsight clusters. 有关详细信息,请参阅使用 Azure 数据工厂在 HDInsight 中创建按需 Apache Hadoop 群集一文。For more information, see the article Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory.

从计算资源解耦存储资源Decouple storage from compute

典型的本地 Hadoop 部署使用相同的计算机组来存储和处理数据。Typical on-premises Hadoop deployments use the same set of machines for data storage and data processing. 由于计算和存储资源共置在一起,因此必须统一对其进行缩放。Because they are colocated, compute and storage must be scaled together.

在 HDInsight 群集中,存储资源不需要与计算资源共置在一起,而可以位于 Azure 存储和/或 Azure Data Lake Storage 中。On HDInsight clusters, storage does not need to be colocated with compute and can either be in Azure storage, Azure Data Lake Storage or both. 从计算资源解耦存储资源可带来以下好处:Decoupling storage from compute has the following benefits:

  • 在群集之间共享数据Data sharing across clusters
  • 由于数据不依赖于群集,因此可以使用暂时性群集Use of transient clusters since the data isn't dependent on cluster
  • 降低存储成本Reduced storage cost
  • 单独缩放存储和计算资源Scaling storage and compute separately
  • 跨区域复制数据Data replication across regions

在靠近 Azure 区域中存储帐户资源的位置创建群集,以消减隔离计算和存储资源所造成的性价比损失。Compute clusters are created close to storage account resources in an Azure region to mitigate performance cost of separating compute and storage. 高速网络可让计算节点高效访问 Azure 存储中的数据。High-speed networks make it efficient for the compute nodes to access the data inside Azure storage.

使用外部元数据存储Use external metadata stores

有两个主要元存储适用于 HDInsight 群集:Apache HiveApache OozieThere are two main metastores that work with HDInsight clusters: Apache Hive and Apache Oozie. Hive 元存储是 Hadoop、Spark、LLAP、Presto 和 Apache Pig 等数据处理引擎可以使用的中央架构存储库。The Hive metastore is the central schema repository that can be used by data processing engines including Hadoop, Spark, LLAP, Presto, and Apache Pig. Oozie 元存储存储有关计划以及正在进行和已完成的 Hadoop 作业状态的详细信息。The Oozie metastore stores details about scheduling and the status of in progress and completed Hadoop jobs.

HDInsight 对 Hive 和 Oozie 元存储使用 Azure SQL 数据库。HDInsight uses Azure SQL Database for Hive and Oozie metastores. 可通过两种方式在 HDInsight 群集中设置元存储:There are two ways to set up a metastore in HDInsight clusters:

  1. 默认元存储Default metastore

    • 不产生额外的费用No additional cost
    • 删除群集时会删除元存储Metastore is deleted when the cluster is deleted
    • 无法在不同的群集之间共享元存储Metastore can't be shared among different clusters
    • 使用基本的 Azure SQL 数据库,DTU 限制为 5 个。Uses basic Azure SQL DB, which has a five DTU limit.
  2. 自定义外部元存储Custom external metastore

    • 将外部 Azure SQL 数据库指定为元存储。specify an external Azure SQL Database as the metastore.
    • 可以创建和删除群集,而不会丢失元数据,包括 Hive 架构 Oozie 作业详细信息。Clusters can be created and deleted without losing metadata including Hive schema Oozie job details.
    • 可与不同类型的群集共享单个元存储数据库Single metastore db can be shared with different types of clusters
    • 可根据需要纵向扩展元存储Metastore can be scaled up as needed
    • 有关详细信息,请参阅在 Azure HDInsight 中使用外部元数据存储For more information, see Use external metadata stores in Azure HDInsight.

Hive 元存储的最佳做法Best practices for Hive Metastore

下面是一些 HDInsight Hive 元存储最佳做法:Some HDInsight Hive metastore best practices are as follows:

  • 使用自定义外部元存储来隔离计算资源和元数据。Use a custom external metastore to separate compute resources and metadata.
  • 首先使用 S2 层 Azure SQL 实例,它提供 50 个 DTU 和 250 GB 存储空间。Start with an S2 tier Azure SQL instance, which provides 50 DTU and 250 GB of storage. 如果空间不够,可扩大数据库。If you see a bottleneck, you can scale the database up.
  • 不要将为一个 HDInsight 群集版本创建的元存储与不同版本的群集共享。Don't share the metastore created for one HDInsight cluster version with clusters of a different version. 不同的 Hive 版本使用不同的架构。Different Hive versions use different schemas. 例如,不能同时与 Hive 1.2 和 Hive 2.1 群集共享某个元存储。For example, a metastore can't be shared with both Hive 1.2 and Hive 2.1 clusters.
  • 定期备份自定义元存储。Back up the custom metastore periodically.
  • 将元存储和 HDInsight 群集保留在同一区域。Keep the metastore and HDInsight cluster in the same region.
  • 使用 Azure SQL 数据库监视工具(例如 Azure 门户或 Azure Monitor 日志)监视元存储的性能和可用性。Monitor the metastore for performance and availability using Azure SQL Database Monitoring tools, like Azure portal or Azure Monitor logs.
  • 根据需要执行 ANALYZE TABLE 命令,以生成表和列的统计信息。Execute the ANALYZE TABLE command as required to generate statistics for tables and columns. 例如,ANALYZE TABLE [table_name] COMPUTE STATISTICSFor example, ANALYZE TABLE [table_name] COMPUTE STATISTICS.

不同工作负荷的最佳做法Best practices for different workloads

  • 考虑对交互式 Hive 查询使用可改善响应时间的 LLAP 群集。LLAP  是 Hive 2.0 中的一项新功能,可用于在内存中缓存查询。Consider using LLAP cluster for interactive Hive queries with improved response time LLAP is a new feature in Hive 2.0 that allows in-memory caching of queries. LLAP 能够大幅加快 Hive 查询的速度, 在某些情况下,速度比 Hive 1.x 要快 26 倍LLAP makes Hive queries much faster, up to 26x faster than Hive 1.x in some cases.
  • 考虑使用 Spark 作业取代 Hive 作业。Consider using Spark jobs in place of Hive jobs.
  • 考虑使用 LLAP 查询取代基于 impala 的查询。Consider replacing impala-based queries with LLAP queries.
  • 考虑使用 Spark 作业取代 MapReduce 作业。Consider replacing MapReduce jobs with Spark jobs.
  • 考虑使用 Spark 结构化流作业取代低延迟 Spark 批处理作业。Consider replacing low-latency Spark batch jobs using Spark Structured Streaming jobs.
  • 考虑使用 Azure 数据工厂 (ADF) 2.0 来协调数据。Consider using Azure Data Factory (ADF) 2.0 for data orchestration.
  • 考虑使用 Ambari 进行群集管理。Consider Ambari for Cluster Management.
  • 将数据存储从本地 HDFS 更改为 WASB、ADLS 或 ADFS,以处理脚本。Change data storage from on-premises HDFS to WASB or ADLS or ADFS for processing scripts.
  • 考虑对 Hive 表和审核使用 Ranger RBAC。Consider using Ranger RBAC on Hive tables and auditing.
  • 考虑使用 CosmosDB 取代 MongoDB 或 Cassandra。Consider using CosmosDB in place of MongoDB or Cassandra.

后续步骤Next steps

阅读本系列教程的下一篇文章:Read the next article in this series: