Azure HDInsight 高度可用的解决方案体系结构案例研究Azure HDInsight highly available solution architecture case study

Azure HDInsight 的复制机制可集成到高度可用的解决方案体系结构中。Azure HDInsight's replication mechanisms can be integrated into a highly available solution architecture. 在本文中,一个虚构的 Contoso Retail 案例研究用于解释可能的高可用性灾难恢复方法、成本考虑因素及其相应的设计。In this article, a fictional case study for Contoso Retail is used to explain possible high availability disaster recovery approaches, cost considerations, and their corresponding designs.

高可用性灾难恢复建议可以有许多排列和组合。High availability disaster recovery recommendations can have many permutations and combinations. 这些解决方案是在考虑每个选项的利弊后得出的。These solutions are to be arrived at after deliberating the pros and cons of each option. 本文仅讨论一种可能的解决方案。This article only discusses one possible solution.

客户体系结构Customer architecture

下图描绘了 Contoso Retail 的主要体系结构。The following image depicts the Contoso Retail primary architecture. 该体系结构包含流式处理工作负荷、批处理工作负荷、服务层、消耗层、存储层和版本控制。The architecture consists of a streaming workload, batch workload, serving layer, consumption layer, storage layer, and version control.

Contoso Retail 体系结构

流式处理工作负荷Streaming workload

设备和传感器将数据生成到 HDInsight Kafka,后者构建消息传送框架。Devices and sensors produce data to HDInsight Kafka, which constitutes the messaging framework. HDInsight Spark 使用者从 Kafka 主题读取数据。An HDInsight Spark consumer reads from the Kafka Topics. Spark 转换传入的消息,并将其写入到服务层上的 HDInsight HBase 群集。Spark transforms the incoming messages and writes it to an HDInsight HBase cluster on the serving layer.

批处理工作负荷Batch workload

运行 Hive 和 MapReduce 的 HDInsight Hadoop 群集从本地事务系统中引入数据。An HDInsight Hadoop cluster running Hive and MapReduce ingests data from on-premises transactional systems. Hive 和 MapReduce 转换的原始数据存储在由 Azure Data Lake Storage Gen2 提供支持的数据湖的逻辑分区上的 Hive 表中。Raw data transformed by Hive and MapReduce is stored in Hive tables on a logical partition of the data lake backed by Azure Data Lake Storage Gen2. 存储在 Hive 表中的数据还可供 Spark SQL 使用,Spark SQL 在将特选数据存储在 HBase 中以供使用之前执行批量转换。Data stored in Hive tables is also made available to Spark SQL, which does batch transforms before storing the curated data in HBase for serving.

服务层Serving layer

包含 Apache Phoenix 的 HDInsight HBase 群集用于向 Web 应用程序和可视化效果仪表板提供数据。An HDInsight HBase cluster with Apache Phoenix is used to serve data to web applications and visualization dashboards. HDInsight LLAP 群集用于满足内部报告要求。An HDInsight LLAP cluster is used to fulfill internal reporting requirements.

消耗层Consumption layer

Azure API 应用和 API 管理层为面向公众的网页提供支持。An Azure API Apps and API Management layer back a public facing webpage. 内部报告要求由 Power BI 来满足。Internal reporting requirements are fulfilled by Power BI.

存储层Storage layer

进行了逻辑分区的 Azure Data Lake Storage Gen2 用作企业数据湖。Logically partitioned Azure Data Lake Storage Gen2 is used as an enterprise data lake. HDInsight 元存储由 Azure SQL DB 提供支持。The HDInsight metastores are backed by Azure SQL DB.

版本控制系统Version control system

集成到 Azure Pipelines 中并在 Azure 外部承载的版本控制系统。A version control system integrated into an Azure Pipelines and hosted outside of Azure.

客户业务连续性要求Customer business continuity requirements

确定发生灾难时所需的最低业务功能非常重要。It's important to determine the minimal business functionality you'll need if there is a disaster.

Contoso Retail 的业务连续性要求Contoso Retail's business continuity requirements

  • 我们必须防范区域性故障或区域性服务运行状况问题。We must be protected against a regional failure or regional service health issue.
  • 我的客户不得看到 404 错误。My customers must never see a 404 error. 必须始终提供公共内容。Public content must always be served. (RTO = 0)(RTO = 0)
  • 在一年的大部分时间里,我们可以显示过时 5 小时的公共内容。For most part of the year, we can show public content that is stale by 5 hours. (RPO = 5 小时)(RPO = 5 hours)
  • 在节假日期间,面向公众的内容必须始终保持最新状态。During holiday season, our public facing content must always be up to date. (RPO = 0)(RPO = 0)
  • 我的内部报告要求对业务连续性并不重要。My internal reporting requirements aren't considered critical to business continuity.
  • 优化业务连续性成本。Optimize business continuity costs.

建议的解决方案Proposed solution

下图显示了 Contoso Retail 的高可用性灾难恢复体系结构。The following image shows Contoso Retail's high availability disaster recovery architecture.

Contoso 解决方案

Kafka 使用主动 – 被动复制将 Kafka 主题从主要区域镜像到辅助区域。Kafka uses Active – Passive replication to mirror Kafka Topics from the primary region to the secondary region. Kafka 复制的替代方法是在这两个区域中都生成 Kafka。An alternative to Kafka replication could be to produce to Kafka in both the regions.

在正常运行期间, Hive 和 Spark 使用主动主要区域 – 按需辅助区域复制模型。Hive and Spark use Active Primary – On-Demand Secondary replication models during normal times. Hive 复制过程会定期运行,同时还会进行 Hive Azure SQL 元存储和 Hive 存储帐户复制。The Hive replication process runs periodically and accompanies the Hive Azure SQL metastore and Hive storage account replication. Spark 存储帐户使用 ADF DistCP 定期进行复制。The Spark storage account is periodically replicated using ADF DistCP. 这些群集的暂时性特性有助于优化成本。The transient nature of these clusters helps optimize costs. 复制安排为每 4 小时进行一次,这样达到的 RPO 可确保符合不超过 5 小时的要求。Replications are scheduled every 4 hours to arrive at an RPO that is well within the five-hour requirement.

在正常运行期间, HBase 复制使用领导者 – 追随者模型,以确保在任何地区都始终提供数据且 RPO 为零。HBase replication uses the Leader – Follower model during normal times to ensure that data is always served regardless of the region and the RPO is zero.

如果主要区域中发生区域性故障,则会从辅助区域提供 5 小时的在一定程度上是过时的网页和后端内容。If there is a regional failure in the primary region, the webpage and backend content are served from the secondary region for 5 hours with some degree of staleness. 如果 Azure 服务运行状况仪表板在 5 小时期限内未指示恢复 ETA,Contoso Retail 将在辅助区域中创建 Hive 和 Spark 转换层,然后将所有上游数据源指向辅助区域。If the Azure service health dashboard does not indicate a recovery ETA in the five-hour window, the Contoso Retail will create the Hive and Spark transformation layer in the secondary region, and then point all upstream data sources to the secondary region. 使辅助区域可写会导致执行故障回复过程,该过程涉及复制回主要区域。Making the secondary region writable would cause a failback process that involves replication back to the primary.

在购物高峰季,整个辅助管道始终处于活动和运行状态。During a peak shopping season, the entire secondary pipeline is always active and running. Kafka 生成者同时向这两个区域生成数据,HBase 复制将从“领导者-追随者”更改为“领导者-领导者”,以确保面向公众的内容始终处于最新状态。Kafka producers produce to both regions and the HBase replication would be changed from Leader-Follower to Leader-Leader to ensure that public facing content is always up to date.

不需要为内部报告设计故障转移解决方案,因为它对于业务连续性并不重要。No failover solution needs to be designed for internal reporting since it's not critical to business continuity.

后续步骤Next steps

若要详细了解本文中所述的项,请参阅:To learn more about the items discussed in this article, see: