将本地 Apache Hadoop 群集迁移到 Azure HDInsight - 动机和权益Migrate on-premises Apache Hadoop clusters to Azure HDInsight - motivation and benefits

本文是有关将本地 Apache Hadoop 生态系统部署迁移到 Azure HDInsight 最佳实践系列文章中的第一篇。This article is the first in a series on best-practices for migrating on-premises Apache Hadoop eco-system deployments to Azure HDInsight. 本系列文章适用于负责设计、部署和迁移 Azure HDInsight 中 Apache Hadoop 解决方案的人员。This series of articles is for people who are responsible for the design, deployment, and migration of Apache Hadoop solutions in Azure HDInsight. 可能受益于这些文章的角色包括云架构师、Hadoop 管理员和 DevOps 工程师。The roles that may benefit from these articles include cloud architects, Hadoop administrators, and DevOps engineers. 软件开发人员、数据工程师和数据科学家也可以得益于关于不同类型的群集如何在云中工作的介绍。Software developers, data engineers, and data scientists should also benefit from the explanation of how different types of clusters work in the cloud.

迁移到 Azure HDInsight 的原因Why to migrate to Azure HDInsight

Azure HDInsight 是 Hadoop 组件的云分发版。Azure HDInsight is a cloud distribution of Hadoop components. 可以通过 Azure HDInsight 轻松、快速且经济有效地处理大量数据。Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. HDInsight 包括最常用的开源框架,例如:HDInsight includes the most popular open-source frameworks such as:

  • Apache HadoopApache Hadoop
  • Apache SparkApache Spark
  • 带有 LLAP 的 Apache HiveApache Hive with LLAP
  • Apache KafkaApache Kafka
  • Apache StormApache Storm
  • Apache HBaseApache HBase

Azure HDInsight 优于本地 HadoopAzure HDInsight advantages over on-premises Hadoop

  • 低成本 - 通过按需创建群集且仅为使用的资源付费,可降低成本。Low cost - Costs can be reduced by creating clusters on demand and paying only for what you use. 分离式计算和存储通过保持数据量独立于群集大小来提供灵活性。Decoupled compute and storage provides flexibility by keeping the data volume independent of the cluster size.

  • 自动创建群集 - 自动创建群集需要最少的设置和配置。Automated cluster creation - Automated cluster creation requires minimal setup and configuration. 自动化可用于按需群集。Automation can be used for on-demand clusters.

  • 托管硬件和配置 - 无需担心 HDInsight 群集的物理硬件或基础结构。Managed hardware and configuration - There's no need to worry about the physical hardware or infrastructure with an HDInsight cluster. 只需指定群集的配置,Azure 就会对其进行设置。Just specify the configuration of the cluster, and Azure sets it up.

  • 易于缩放 - 通过 HDInsight 可纵向缩放工作负载。Easily scalable - HDInsight enables you to scale workloads up or down. Azure 负责重新分配数据和重新均衡工作负载,而不会中断数据处理作业。Azure takes care of data redistribution and workload rebalancing without interrupting data processing jobs.

  • 安全性和合规性 - HDInsight 允许通过 Azure 虚拟网络加密以及与 Azure Active Directory 集成来保护企业数据资产。Secure and compliant - HDInsight enables you to protect your enterprise data assets with Azure Virtual Network, encryption, and integration with Azure Active Directory. HDInsight 还满足最常用的行业和政府符合性标准HDInsight also meets the most popular industry and government compliance standards.

  • 简化版本管理 - Azure HDInsight 管理 Hadoop 生态系统组件的版本并使其保持最新。Simplified version management - Azure HDInsight manages the version of Hadoop eco-system components and keeps them up-to-date. 软件更新在内部部署过程中通常比较复杂。Software updates are usually a complex process for on-premises deployments.

  • 针对特定工作负载优化的较小群集与组件之间的依赖关系较少 - 典型的本地 Hadoop 设置使用具有多种用途的单个群集。Smaller clusters optimized for specific workloads with fewer dependencies between components - A typical on-premises Hadoop setup uses a single cluster that serves many purposes. 使用 Azure HDInsight,可创建特定于工作负载的群集。With Azure HDInsight, workload-specific clusters can be created. 为特定工作负载创建群集消除了维护单个群集日益复杂的复杂性。Creating clusters for specific workloads removes the complexity of maintaining a single cluster with growing complexity.

  • 生产力 - 可在首选开发环境中使用 Hadoop 和 Spark 的各种工具。Productivity - You can use various tools for Hadoop and Spark in your preferred development environment.

  • 自定义工具或第三方应用程序的可扩展性 - HDInsight 群集可使用已安装的组件进行扩展,也可以通过 Azure 市场中的一键式部署与其他大数据解决方案进行集成。Extensibility with custom tools or third-party applications - HDInsight clusters can be extended with installed components and can also be integrated with the other big data solutions by using one-click deployments from the Azure Market place.

  • 与其他 Azure 服务集成 - HDInsight 可轻松地与其他常用 Azure 服务进行集成,例如:Integration with other Azure services - HDInsight can easily be integrated with other popular Azure services such as the following:

    • Azure 数据工厂 (ADF)Azure Data Factory (ADF)
    • Azure Blob 存储Azure Blob Storage
    • Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2
    • Azure Cosmos DBAzure Cosmos DB
    • Azure SQL 数据库Azure SQL Database
    • Azure Analysis ServicesAzure Analysis Services
  • 自我修复过程和组件 - HDInsight 使用自己的监视基础结构不断检查基础结构和开源组件。Self-healing processes and components - HDInsight constantly checks the infrastructure and open-source components using its own monitoring infrastructure. 它还可自动修复关键故障,例如开源组件和节点不可用。It also automatically recovers critical failures such as unavailability of open-source components and nodes. 任何 OSS 组件发生故障时都会在 Ambari 中触发警报。Alerts are triggered in Ambari if any OSS component is failed.

有关详细信息,请参阅文章什么是 Azure HDInsight 和 Apache Hadoop 技术堆栈For more information, see the article What is Azure HDInsight and the Apache Hadoop technology stack.

迁移规划过程Migration planning process

建议使用以下步骤来规划本地 Hadoop 群集到 Azure HDInsight 的迁移:The following steps are recommended for planning a migration of on-premises Hadoop clusters to Azure HDInsight:

  1. 了解当前的本地部署和拓扑。Understand the current on-premises deployment and topologies.
  2. 了解当前的项目范围、时间线和团队专业知识。Understand the current project scope, timelines, and team expertise.
  3. 了解 Azure 需求。Understand the Azure requirements.
  4. 构建基于最佳做法的详细计划。Build out a detailed plan based on best practices.

收集详细信息以为迁移做准备Gathering details to prepare for a migration

本节提供模板问卷,以帮助收集有关以下内容的重要信息:This section provides template questionnaires to help gather important information about:

  • 本地部署The on-premises deployment
  • 项目详细信息Project details
  • Azure 要求Azure requirements

本地部署问卷On-Premises deployment questionnaire

问题Question 示例Example 答案Answer
主题:环境 Topic: Environment
群集分发版本Cluster Distribution version HDP 2.6.5、CDH 5.7HDP 2.6.5, CDH 5.7
大数据生态系统组件Big Data eco-system components HDFS、Yarn、Hive、LLAP、Impala、Kudu、HBase、Spark、MapReduce、Kafka、Zookeeper、Solr、Sqoop、Oozie、Ranger、Atlas、Falcon、Zeppelin、RHDFS, Yarn, Hive, LLAP, Impala, Kudu, HBase, Spark, MapReduce, Kafka, Zookeeper, Solr, Sqoop, Oozie, Ranger, Atlas, Falcon, Zeppelin, R
群集类型Cluster types Hadoop、Spark、Confluent Kafka、Storm、SolrHadoop, Spark, Confluent Kafka, Storm, Solr
分类数Number of clusters 44
主节点数Number of Master Nodes 22
辅助进程节点数Number of Worker Nodes 100100
边缘节点数Number of Edge Nodes 55
总磁盘空间Total Disk space 100 TB100 TB
主节点配置Master Node configuration m/y、cpu、磁盘等。m/y, cpu, disk, etc.
数据节点配置Data Nodes configuration m/y、cpu、磁盘等。m/y, cpu, disk, etc.
边缘节点配置Edge Nodes configuration m/y、cpu、磁盘等。m/y, cpu, disk, etc.
是否 HDFS 加密?HDFS Encryption? Yes
高可用性High Availability HDFS 高可用性、元存储高可用性HDFS HA, Metastore HA
灾难恢复/备份Disaster Recovery / Back up 是否备份群集?Backup cluster?
依赖于群集的系统Systems that are dependent on Cluster SQL Server、Teradata、Power BI、MongoDBSQL Server, Teradata, Power BI, MongoDB
第三方集成Third-party integrations Tableau、GridGain、Qubole、Informatica、SplunkTableau, GridGain, Qubole, Informatica, Splunk
主题:安全性 Topic: Security
外围安全性Perimeter security 防火墙Firewalls
群集身份验证和授权Cluster authentication & authorization Active Directory、Ambari、Cloudera Manager,不进行身份验证Active Directory, Ambari, Cloudera Manager, No authentication
HDFS 访问控制HDFS Access Control 手动,SSH 用户Manual, ssh users
Hive 身份验证和授权Hive authentication & authorization Sentry、LDAP、带有 Kerberos 的 AD、RangerSentry, LDAP, AD with Kerberos, Ranger
审核Auditing Ambari、Cloudera Navigator、RangerAmbari, Cloudera Navigator, Ranger
监视Monitoring Graphite、collectd、statsd、Telegraf、InfluxDBGraphite, collectd, statsd, Telegraf, InfluxDB
警报Alerting Kapacitor、Prometheus、DatadogKapacitor, Prometheus, Datadog
数据保留持续时间Data Retention duration 3 年,5 年3 years, 5 years
群集管理员Cluster Administrators 单个管理员,多个管理员Single Administrator, Multiple Administrators

项目详细信息问卷Project details questionnaire

问题Question 示例Example 答案Answer
主题:工作负载和频率 Topic: Workloads and Frequency
MapReduce 作业MapReduce jobs 10 个作业 -- 每天两次10 jobs -- twice daily
Hive 作业Hive jobs 100 个作业 -- 每小时100 jobs -- every hour
Spark 批处理作业Spark batch jobs 50 个作业 -- 每 15 分钟50 jobs -- every 15 minutes
Spark 流式处理作业Spark Streaming jobs 5 个作业 -- 每 3 分钟5 jobs -- every 3 minutes
结构化流作业Structured Streaming jobs 5 个作业 -- 每分钟5 jobs -- every minute
机器学习模型训练作业ML Model training jobs 2 个作业 -- 每周一次2 jobs -- once in a week
编程语言Programming Languages Python、Scala、JavaPython, Scala, Java
脚本编写Scripting Shell、PythonShell, Python
主题:数据 Topic: Data
数据源Data sources 平面文件、Json、Kafka、RDBMSFlat files, Json, Kafka, RDBMS
数据业务流程Data orchestration Oozie 工作流、气流Oozie workflows, Airflow
内存中查找In memory lookups Apache Ignite、RedisApache Ignite, Redis
数据目标Data destinations HDFS、RDBMS、Kafka、MPPHDFS, RDBMS, Kafka, MPP
主题:元数据 Topic: Meta data
Hive 数据库类型Hive DB type Mysql、PostgresMysql, Postgres
Hive 元存储的数目Number of Hive metastores 22
Hive 表的数目Number of Hive tables 100100
Ranger 策略的数目Number of Ranger policies 2020
Oozie 工作流的数目Number of Oozie workflows 100100
主题:缩放 Topic: Scale
数据量包括复制Data volume including Replication 100 TB100 TB
每日引入量Daily ingestion volume 50 GB50 GB
数据增长率Data growth rate 每年 10%10% per year
群集节点增长率Cluster Nodes growth rate 每年 5%5% per year
主题:群集利用率 Topic: Cluster utilization
已使用的平均 CPU 百分比Average CPU % used 60%60%
已使用的平均内存百分比Average Memory % used 75%75%
已使用的磁盘空间Disk space used 75%75%
已使用的平均网络百分比Average Network % used 25%25%
主题:人员 Topic: Staff
管理员的数目Number of Administrators 22
开发人员的数目Number of Developers 1010
最终用户的数目Number of end users 100100
技能Skills Hadoop、SparkHadoop, Spark
可用于迁移工作的资源数目Number of available resources for Migration efforts 22
主题:限制 Topic: Limitations
当前限制Current limitations 延迟较高Latency is high
当前挑战Current challenges 并发问题Concurrency issue

Azure 需求问卷Azure requirements questionnaire

问题Question 示例Example 答案Answer
主题:基础结构Topic: Infrastructure
首选区域Preferred Region 中国东部China East
首选 VNet?VNet preferred? Yes
需要 HA/DR?HA / DR Needed? Yes
与其他云服务进行集成?Integration with other cloud services? ADF、CosmosDBADF, CosmosDB
主题:数据移动 Topic: Data Movement
初始加载首选项Initial load preference DistCp、Data box、ADF、WANDiscoDistCp, Data box, ADF, WANDisco
数据传输增量Data transfer delta DistCp、AzCopyDistCp, AzCopy
正在进行的增量数据传输Ongoing incremental data transfer DistCp、SqoopDistCp, Sqoop
主题:监视和警报 Topic: Monitoring & Alerting
使用 Azure 监控和警报与集成第三方监视Use Azure Monitoring & Alerting Vs Integrate third-party monitoring 使用 Azure 监视和警报Use Azure Monitoring & Alerting
主题:安全性首选项 Topic: Security preferences
专用和受保护的数据管道?Private and protected data pipeline? Yes
已加入域的群集 (ESP)?Domain Joined cluster (ESP)? Yes
本地 AD 同步到云?On-Premises AD Sync to Cloud? Yes
要同步的 AD 用户数?Number of AD users to sync? 100100
确定将密码同步到云?Ok to sync passwords to cloud? Yes
仅云用户?Cloud only Users? Yes
需要 MFA?MFA needed? No
数据授权需求?Data authorization requirements? Yes
基于角色的访问控制?Role-Based Access Control? Yes
需要审核?Auditing needed? Yes
静态数据加密?Data encryption at rest? Yes
在传输中进行数据加密?Data encryption in transit? Yes
主题:重建体系结构首选项 Topic: Re-Architecture preferences
单个群集与特定群集类型Single cluster vs Specific cluster types 特定群集类型Specific cluster types
共存存储与远程存储?Colocated Storage Vs Remote Storage? 远程存储Remote Storage
在远程存储数据群集大小更小?Smaller cluster size as data is stored remotely? 群集大小更小Smaller cluster size
使用多个较小的群集而不是单个大型群集?Use multiple smaller clusters rather than a single large cluster? 使用多个较小的群集Use multiple smaller clusters
使用远程元存储?Use a remote metastore? Yes
在不同的群集之间共享元存储?Share metastores between different clusters? Yes
解构工作负载?Deconstruct workloads? 使用 Spark 作业替换 Hive 作业Replace Hive jobs with Spark jobs
使用 ADF 实现数据业务流程?Use ADF for data orchestration? No

后续步骤Next steps

阅读本系列教程的下一篇文章:Read the next article in this series: