将本地 Apache Hadoop 群集迁移到 Azure HDInsight - 动机和权益Migrate on-premises Apache Hadoop clusters to Azure HDInsight - motivation and benefits
本文是有关将本地 Apache Hadoop 生态系统部署迁移到 Azure HDInsight 最佳实践系列文章中的第一篇。This article is the first in a series on best-practices for migrating on-premises Apache Hadoop eco-system deployments to Azure HDInsight. 本系列文章适用于负责设计、部署和迁移 Azure HDInsight 中 Apache Hadoop 解决方案的人员。This series of articles is for people who are responsible for the design, deployment, and migration of Apache Hadoop solutions in Azure HDInsight. 可能受益于这些文章的角色包括云架构师、Hadoop 管理员和 DevOps 工程师。The roles that may benefit from these articles include cloud architects, Hadoop administrators, and DevOps engineers. 软件开发人员、数据工程师和数据科学家也可以得益于关于不同类型的群集如何在云中工作的介绍。Software developers, data engineers, and data scientists should also benefit from the explanation of how different types of clusters work in the cloud.
迁移到 Azure HDInsight 的原因Why to migrate to Azure HDInsight
Azure HDInsight 是 Hadoop 组件的云分发版。Azure HDInsight is a cloud distribution of Hadoop components. 可以通过 Azure HDInsight 轻松、快速且经济有效地处理大量数据。Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. HDInsight 包括最常用的开源框架,例如:HDInsight includes the most popular open-source frameworks such as:
- Apache HadoopApache Hadoop
- Apache SparkApache Spark
- 带有 LLAP 的 Apache HiveApache Hive with LLAP
- Apache KafkaApache Kafka
- Apache StormApache Storm
- Apache HBaseApache HBase
Azure HDInsight 优于本地 HadoopAzure HDInsight advantages over on-premises Hadoop
低成本 - 通过按需创建群集且仅为使用的资源付费,可降低成本。Low cost - Costs can be reduced by creating clusters on demand and paying only for what you use. 分离式计算和存储通过保持数据量独立于群集大小来提供灵活性。Decoupled compute and storage provides flexibility by keeping the data volume independent of the cluster size.
自动创建群集 - 自动创建群集需要最少的设置和配置。Automated cluster creation - Automated cluster creation requires minimal setup and configuration. 自动化可用于按需群集。Automation can be used for on-demand clusters.
托管硬件和配置 - 无需担心 HDInsight 群集的物理硬件或基础结构。Managed hardware and configuration - There's no need to worry about the physical hardware or infrastructure with an HDInsight cluster. 只需指定群集的配置,Azure 就会对其进行设置。Just specify the configuration of the cluster, and Azure sets it up.
易于缩放 - 通过 HDInsight 可纵向缩放工作负载。Easily scalable - HDInsight enables you to scale workloads up or down. Azure 负责重新分配数据和重新均衡工作负载,而不会中断数据处理作业。Azure takes care of data redistribution and workload rebalancing without interrupting data processing jobs.
安全性和合规性 - HDInsight 允许通过 Azure 虚拟网络、加密以及与 Azure Active Directory 集成来保护企业数据资产。Secure and compliant - HDInsight enables you to protect your enterprise data assets with Azure Virtual Network, encryption, and integration with Azure Active Directory. HDInsight 还满足最常用的行业和政府符合性标准。HDInsight also meets the most popular industry and government compliance standards.
简化版本管理 - Azure HDInsight 管理 Hadoop 生态系统组件的版本并使其保持最新。Simplified version management - Azure HDInsight manages the version of Hadoop eco-system components and keeps them up-to-date. 软件更新在内部部署过程中通常比较复杂。Software updates are usually a complex process for on-premises deployments.
针对特定工作负载优化的较小群集与组件之间的依赖关系较少 - 典型的本地 Hadoop 设置使用具有多种用途的单个群集。Smaller clusters optimized for specific workloads with fewer dependencies between components - A typical on-premises Hadoop setup uses a single cluster that serves many purposes. 使用 Azure HDInsight,可创建特定于工作负载的群集。With Azure HDInsight, workload-specific clusters can be created. 为特定工作负载创建群集消除了维护单个群集日益复杂的复杂性。Creating clusters for specific workloads removes the complexity of maintaining a single cluster with growing complexity.
生产力 - 可在首选开发环境中使用 Hadoop 和 Spark 的各种工具。Productivity - You can use various tools for Hadoop and Spark in your preferred development environment.
自定义工具或第三方应用程序的可扩展性 - HDInsight 群集可使用已安装的组件进行扩展,也可以通过 Azure 市场中的一键式部署与其他大数据解决方案进行集成。Extensibility with custom tools or third-party applications - HDInsight clusters can be extended with installed components and can also be integrated with the other big data solutions by using one-click deployments from the Azure Market place.
与其他 Azure 服务集成 - HDInsight 可轻松地与其他常用 Azure 服务进行集成,例如:Integration with other Azure services - HDInsight can easily be integrated with other popular Azure services such as the following:
- Azure 数据工厂 (ADF)Azure Data Factory (ADF)
- Azure Blob 存储Azure Blob Storage
- Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2
- Azure Cosmos DBAzure Cosmos DB
- Azure SQL 数据库Azure SQL Database
- Azure Analysis ServicesAzure Analysis Services
自我修复过程和组件 - HDInsight 使用自己的监视基础结构不断检查基础结构和开源组件。Self-healing processes and components - HDInsight constantly checks the infrastructure and open-source components using its own monitoring infrastructure. 它还可自动修复关键故障,例如开源组件和节点不可用。It also automatically recovers critical failures such as unavailability of open-source components and nodes. 任何 OSS 组件发生故障时都会在 Ambari 中触发警报。Alerts are triggered in Ambari if any OSS component is failed.
有关详细信息,请参阅文章什么是 Azure HDInsight 和 Apache Hadoop 技术堆栈。For more information, see the article What is Azure HDInsight and the Apache Hadoop technology stack.
迁移规划过程Migration planning process
建议使用以下步骤来规划本地 Hadoop 群集到 Azure HDInsight 的迁移:The following steps are recommended for planning a migration of on-premises Hadoop clusters to Azure HDInsight:
- 了解当前的本地部署和拓扑。Understand the current on-premises deployment and topologies.
- 了解当前的项目范围、时间线和团队专业知识。Understand the current project scope, timelines, and team expertise.
- 了解 Azure 需求。Understand the Azure requirements.
- 构建基于最佳做法的详细计划。Build out a detailed plan based on best practices.
收集详细信息以为迁移做准备Gathering details to prepare for a migration
本节提供模板问卷,以帮助收集有关以下内容的重要信息:This section provides template questionnaires to help gather important information about:
- 本地部署The on-premises deployment
- 项目详细信息Project details
- Azure 要求Azure requirements
本地部署问卷On-Premises deployment questionnaire
问题Question | 示例Example | 答案Answer |
---|---|---|
主题:环境 Topic: Environment | ||
群集分发版本Cluster Distribution version | HDP 2.6.5、CDH 5.7HDP 2.6.5, CDH 5.7 | |
大数据生态系统组件Big Data eco-system components | HDFS、Yarn、Hive、LLAP、Impala、Kudu、HBase、Spark、MapReduce、Kafka、Zookeeper、Solr、Sqoop、Oozie、Ranger、Atlas、Falcon、Zeppelin、RHDFS, Yarn, Hive, LLAP, Impala, Kudu, HBase, Spark, MapReduce, Kafka, Zookeeper, Solr, Sqoop, Oozie, Ranger, Atlas, Falcon, Zeppelin, R | |
群集类型Cluster types | Hadoop、Spark、Confluent Kafka、Storm、SolrHadoop, Spark, Confluent Kafka, Storm, Solr | |
分类数Number of clusters | 44 | |
主节点数Number of Master Nodes | 22 | |
辅助进程节点数Number of Worker Nodes | 100100 | |
边缘节点数Number of Edge Nodes | 55 | |
总磁盘空间Total Disk space | 100 TB100 TB | |
主节点配置Master Node configuration | m/y、cpu、磁盘等。m/y, cpu, disk, etc. | |
数据节点配置Data Nodes configuration | m/y、cpu、磁盘等。m/y, cpu, disk, etc. | |
边缘节点配置Edge Nodes configuration | m/y、cpu、磁盘等。m/y, cpu, disk, etc. | |
是否 HDFS 加密?HDFS Encryption? | 是Yes | |
高可用性High Availability | HDFS 高可用性、元存储高可用性HDFS HA, Metastore HA | |
灾难恢复/备份Disaster Recovery / Back up | 是否备份群集?Backup cluster? | |
依赖于群集的系统Systems that are dependent on Cluster | SQL Server、Teradata、Power BI、MongoDBSQL Server, Teradata, Power BI, MongoDB | |
第三方集成Third-party integrations | Tableau、GridGain、Qubole、Informatica、SplunkTableau, GridGain, Qubole, Informatica, Splunk | |
主题:安全性 Topic: Security | ||
外围安全性Perimeter security | 防火墙Firewalls | |
群集身份验证和授权Cluster authentication & authorization | Active Directory、Ambari、Cloudera Manager,不进行身份验证Active Directory, Ambari, Cloudera Manager, No authentication | |
HDFS 访问控制HDFS Access Control | 手动,SSH 用户Manual, ssh users | |
Hive 身份验证和授权Hive authentication & authorization | Sentry、LDAP、带有 Kerberos 的 AD、RangerSentry, LDAP, AD with Kerberos, Ranger | |
审核Auditing | Ambari、Cloudera Navigator、RangerAmbari, Cloudera Navigator, Ranger | |
监视Monitoring | Graphite、collectd、statsd、Telegraf、InfluxDBGraphite, collectd, statsd, Telegraf, InfluxDB | |
警报Alerting | Kapacitor、Prometheus、DatadogKapacitor, Prometheus, Datadog | |
数据保留持续时间Data Retention duration | 3 年,5 年3 years, 5 years | |
群集管理员Cluster Administrators | 单个管理员,多个管理员Single Administrator, Multiple Administrators |
项目详细信息问卷Project details questionnaire
问题Question | 示例Example | 答案Answer |
---|---|---|
主题:工作负载和频率 Topic: Workloads and Frequency | ||
MapReduce 作业MapReduce jobs | 10 个作业 -- 每天两次10 jobs -- twice daily | |
Hive 作业Hive jobs | 100 个作业 -- 每小时100 jobs -- every hour | |
Spark 批处理作业Spark batch jobs | 50 个作业 -- 每 15 分钟50 jobs -- every 15 minutes | |
Spark 流式处理作业Spark Streaming jobs | 5 个作业 -- 每 3 分钟5 jobs -- every 3 minutes | |
结构化流作业Structured Streaming jobs | 5 个作业 -- 每分钟5 jobs -- every minute | |
机器学习模型训练作业ML Model training jobs | 2 个作业 -- 每周一次2 jobs -- once in a week | |
编程语言Programming Languages | Python、Scala、JavaPython, Scala, Java | |
脚本编写Scripting | Shell、PythonShell, Python | |
主题:数据 Topic: Data | ||
数据源Data sources | 平面文件、Json、Kafka、RDBMSFlat files, Json, Kafka, RDBMS | |
数据业务流程Data orchestration | Oozie 工作流、气流Oozie workflows, Airflow | |
内存中查找In memory lookups | Apache Ignite、RedisApache Ignite, Redis | |
数据目标Data destinations | HDFS、RDBMS、Kafka、MPPHDFS, RDBMS, Kafka, MPP | |
主题:元数据 Topic: Meta data | ||
Hive 数据库类型Hive DB type | Mysql、PostgresMysql, Postgres | |
Hive 元存储的数目Number of Hive metastores | 22 | |
Hive 表的数目Number of Hive tables | 100100 | |
Ranger 策略的数目Number of Ranger policies | 2020 | |
Oozie 工作流的数目Number of Oozie workflows | 100100 | |
主题:缩放 Topic: Scale | ||
数据量包括复制Data volume including Replication | 100 TB100 TB | |
每日引入量Daily ingestion volume | 50 GB50 GB | |
数据增长率Data growth rate | 每年 10%10% per year | |
群集节点增长率Cluster Nodes growth rate | 每年 5%5% per year | |
主题:群集利用率 Topic: Cluster utilization | ||
已使用的平均 CPU 百分比Average CPU % used | 60%60% | |
已使用的平均内存百分比Average Memory % used | 75%75% | |
已使用的磁盘空间Disk space used | 75%75% | |
已使用的平均网络百分比Average Network % used | 25%25% | |
主题:人员 Topic: Staff | ||
管理员的数目Number of Administrators | 22 | |
开发人员的数目Number of Developers | 1010 | |
最终用户的数目Number of end users | 100100 | |
技能Skills | Hadoop、SparkHadoop, Spark | |
可用于迁移工作的资源数目Number of available resources for Migration efforts | 22 | |
主题:限制 Topic: Limitations | ||
当前限制Current limitations | 延迟较高Latency is high | |
当前挑战Current challenges | 并发问题Concurrency issue |
Azure 需求问卷Azure requirements questionnaire
问题Question | 示例Example | 答案Answer |
---|---|---|
主题:基础结构Topic: Infrastructure | ||
首选区域Preferred Region | 中国东部China East | |
首选 VNet?VNet preferred? | 是Yes | |
需要 HA/DR?HA / DR Needed? | 是Yes | |
与其他云服务进行集成?Integration with other cloud services? | ADF、CosmosDBADF, CosmosDB | |
主题:数据移动 Topic: Data Movement | ||
初始加载首选项Initial load preference | DistCp、Data box、ADF、WANDiscoDistCp, Data box, ADF, WANDisco | |
数据传输增量Data transfer delta | DistCp、AzCopyDistCp, AzCopy | |
正在进行的增量数据传输Ongoing incremental data transfer | DistCp、SqoopDistCp, Sqoop | |
主题:监视和警报 Topic: Monitoring & Alerting | ||
使用 Azure 监控和警报与集成第三方监视Use Azure Monitoring & Alerting Vs Integrate third-party monitoring | 使用 Azure 监视和警报Use Azure Monitoring & Alerting | |
主题:安全性首选项 Topic: Security preferences | ||
专用和受保护的数据管道?Private and protected data pipeline? | 是Yes | |
已加入域的群集 (ESP)?Domain Joined cluster (ESP)? | 是Yes | |
本地 AD 同步到云?On-Premises AD Sync to Cloud? | 是Yes | |
要同步的 AD 用户数?Number of AD users to sync? | 100100 | |
确定将密码同步到云?Ok to sync passwords to cloud? | 是Yes | |
仅云用户?Cloud only Users? | 是Yes | |
需要 MFA?MFA needed? | 否No | |
数据授权需求?Data authorization requirements? | 是Yes | |
基于角色的访问控制?Role-Based Access Control? | 是Yes | |
需要审核?Auditing needed? | 是Yes | |
静态数据加密?Data encryption at rest? | 是Yes | |
在传输中进行数据加密?Data encryption in transit? | 是Yes | |
主题:重建体系结构首选项 Topic: Re-Architecture preferences | ||
单个群集与特定群集类型Single cluster vs Specific cluster types | 特定群集类型Specific cluster types | |
共存存储与远程存储?Colocated Storage Vs Remote Storage? | 远程存储Remote Storage | |
在远程存储数据群集大小更小?Smaller cluster size as data is stored remotely? | 群集大小更小Smaller cluster size | |
使用多个较小的群集而不是单个大型群集?Use multiple smaller clusters rather than a single large cluster? | 使用多个较小的群集Use multiple smaller clusters | |
使用远程元存储?Use a remote metastore? | 是Yes | |
在不同的群集之间共享元存储?Share metastores between different clusters? | 是Yes | |
解构工作负载?Deconstruct workloads? | 使用 Spark 作业替换 Hive 作业Replace Hive jobs with Spark jobs | |
使用 ADF 实现数据业务流程?Use ADF for data orchestration? | 否No |
后续步骤Next steps
阅读本系列教程的下一篇文章:Read the next article in this series: