Azure HDInsight 业务连续性Azure HDInsight business continuity

Azure HDInsight 群集依赖于许多 Azure 服务,例如存储、数据库、Active Directory、Active Directory 域服务、网络和 Key Vault。Azure HDInsight clusters depend on many Azure services like storage, databases, Active Directory, Active Directory Domain Services, networking, and Key Vault. 设计分析应用程序时,若要确保其设计良好、高度可用且可容错,则应让其有足够的冗余,以应对其中一项或多项服务出现区域中断或本地中断的情况。A well-designed, highly available, and fault-tolerant analytics application should be designed with enough redundancy to withstand regional or local disruptions in one or more of these services. 本文概述了业务连续性规划的最佳做法、单一区域可用性和优化选项。This article gives an overview of best practices, single region availability, and optimization options for business continuity planning.

一般最佳实践General best practices

本部分讨论了在业务连续性规划期间要考虑的一些最佳做法。This section discusses a few best practices for you to consider during business continuity planning.

  • 确定发生灾难时所需的最低业务功能以及原因。Determine the minimal business functionality you will need if there is a disaster and why. 例如,评估你是需要数据转换层(显示为黄色)和数据服务层(显示为蓝色)的故障转移功能,还是只需要数据服务层的故障转移功能。For example, evaluate if you need failover capabilities for the data transformation layer (shown in yellow) and the data serving layer (shown in blue), or if you only need failover for the data service layer.

    数据转换层和数据服务层

  • 根据工作负荷、开发生命周期和部门对你的群集进行划分。Segment your clusters based on workload, development lifecycle, and departments. 具有更多群集可降低发生影响多个不同业务流程的单个大故障的几率。Having more clusters reduces the chances of a single large failure affecting multiple different business processes.

  • 使你的辅助区域成为只读的区域。Make your secondary regions read-only. 对同时启用了读取和写入功能的区域进行故障转移可能会导致复杂的体系结构。Failover regions with both read and write capabilities can lead to complex architectures.

  • 发生灾难时,暂时性群集更易于管理。Transient clusters are easier to manage when there is a disaster. 设计你的工作负荷时使群集可以循环利用,并且不在群集中维护任何状态。Design your workloads in a way that clusters can be cycled and no state is maintained in clusters.

  • 如果发生灾难,工作负荷通常会保持未完成状态,需要在新区域中重启。Often workloads are left unfinished if there is a disaster and need to restart in the new region. 将你的工作负荷设计为在本质上是幂等的。Design your workloads to be idempotent in nature.

  • 在群集部署过程中使用自动化,并确保尽可能将群集配置设置脚本化,以确保在发生灾难时快速进行完全自动化的部署。Use automation during cluster deployments and ensure cluster configuration settings are scripted as far as possible to ensure rapid and fully automated deployment if there is a disaster.

  • 使用 HDInsight 上的 Azure 监视工具检测群集中的异常行为并设置相应的警报通知。Use Azure monitoring tools on HDInsight to detect abnormal behavior in the cluster and set corresponding alert notifications. 你可以部署预配置的、特定于 HDInsight 群集的管理解决方案,这些解决方案收集特定群集类型的重要性能指标。You can deploy the pre-configured HDInsight cluster-specific management solutions that collect important performance metrics of the specific cluster type.

  • 订阅 Azure 运行状况警报,以获得有关订阅、服务或区域的服务问题、计划内维护、运行状况和安全建议的通知。Subscribe to Azure health alerts to be notified about service issues, planned maintenance, health and security advisories for a subscription, service, or region. 包含问题原因和解决方法 ETA 的运行状况通知可帮助你更好地执行故障转移和故障回复。Health notifications that include the issue cause and resolute ETA help you to better execute failover and failbacks. 有关详细信息,请参阅 Azure 服务运行状况文档For more information, see Azure Service Health documentation.

单一区域可用性Single region availability

基本 HDInsight 系统具有以下组件。A basic HDInsight system has the following components. 所有组件都有其自己的单一区域容错机制。All components have their own single region fault tolerance mechanisms.

  • 计算(虚拟机):Azure HDInsight 群集Compute (virtual machines): Azure HDInsight cluster
  • 元存储:Azure SQL 数据库Metastore(s): Azure SQL Database
  • 存储:Azure Data Lake Gen2 或 Blob 存储Storage: Azure Data Lake Gen2 or Blob storage
  • 身份验证:Azure Active Directory、Azure Active Directory 域服务、企业安全性套餐Authentication: Azure Active Directory, Azure Active Directory Domain Services, Enterprise Security Package
  • 域名解析:Azure DNSDomain name resolution: Azure DNS

还可以使用其他可选服务,例如 Azure Key Vault 和 Azure 数据工厂。There are other optional services that can be used, such as Azure Key Vault and Azure Data Factory.

HDInsight 组件

Azure HDInsight 群集(计算)Azure HDInsight cluster (compute)

HDInsight 提供 99.9% 的可用性 SLA。HDInsight offers an availability SLA of 99.9%. 为了在单个部署中提供高可用性,默认情况下,HDInsight 附带了许多处于高可用性模式的服务。To provide high availability in a single deployment, HDInsight is accompanied by many services that are in high availability mode by default. HDInsight 中的容错机制由 Microsoft 和 Apache OSS 生态系统高可用性服务提供。Fault tolerance mechanisms in HDInsight are provided by both Microsoft and Apache OSS ecosystem high availability services.

以下服务设计为具有高可用性:The following services are designed to be highly available:

基础结构Infrastructure

  • 主动和备用头节点Active and Standby Headnodes
  • 多个网关节点Multiple Gateway Nodes
  • 三个 Zookeeper 仲裁节点Three Zookeeper Quorum nodes
  • 按容错域和更新域分布的工作器节点Worker Nodes distributed by fault and update domains

服务Service

  • Apache Ambari 服务器Apache Ambari Server
  • YARN 的应用程序时间线服务器Application timeline severs for YARN
  • 适用于 Hadoop MapReduce 的作业历史记录服务器Job History Server for Hadoop MapReduce
  • Apache LivyApache Livy
  • HDFSHDFS
  • YARN 资源管理器YARN Resource Manager
  • HBase MasterHBase Master

若要了解详细信息,请参阅 Azure HDInsight 支持的高可用性服务相关文档。Refer documentation on high availability services supported by Azure HDInsight to learn more.

并非总是会发生影响业务功能的灾难性事件。It doesn't always take a catastrophic event to impact business functionality. 单个区域中一个或多个以下服务出现服务事件也可能会导致预期的业务功能丢失。Service incidents in one or more of the following services in a single region can also lead to loss of expected business functionality.

HDInsight 元存储HDInsight metastore

HDInsight 使用 Azure SQL 数据库作为元存储,该元存储提供 99.99% 的 SLA。HDInsight uses Azure SQL Database as a metastore, which provides an SLA of 99.99%. 数据的三个副本通过同步复制持久保存在数据中心内。Three replicas of data persist within a data center with synchronous replication. 如果副本丢失,则可以无缝地提供备用副本。If there is a replica loss, an alternate replica is served seamlessly. 现成支持活动异地复制,最多可使用四个数据中心。Active geo-replication is supported out of the box with a maximum of four data centers. 如果通过手动方式或通过数据中心进行故障转移,则层次结构中的第一个副本将自动变为可读写的副本。When there is a failover, either manual or data center, the first replica in the hierarchy will automatically become read-write capable. 有关详细信息,请参阅 Azure SQL 数据库业务连续性For more information, see Azure SQL Database business continuity.

HDInsight 存储HDInsight Storage

HDInsight 建议使用 Azure Data Lake Storage Gen2 作为基础存储层。HDInsight recommends Azure Data Lake Storage Gen2 as the underlying storage layer. Azure 存储(包括 Azure Data Lake Storage Gen2)提供 99.9% 的 SLA。Azure Storage, including Azure Data Lake Storage Gen2, provides an SLA of 99.9%. HDInsight 使用 LRS 服务,其中有三个数据副本持久保存在一个数据中心内,复制是同步的。HDInsight uses the LRS service in which three replicas of data persist within a data center, and replication is synchronous. 发生副本丢失时,可以无缝地提供副本。When there is a replica loss, a replica is served seamlessly.

Azure Active DirectoryAzure Active Directory

Azure Active Directory 提供 99.9% 的 SLA。Azure Active Directory provides an SLA of 99.9%. Active Directory 是一项全局服务,具有多层内部冗余和自动可恢复性。Active Directory is a global service with multiple levels of internal redundancy and automatic recoverability. 有关详细信息,请参阅 Microsoft 如何持续提高 Azure Active Directory 的可靠性For more information, see how Microsoft in continually improving the reliability of Azure Active Directory.

Azure Active Directory 域服务 (AD DS)Azure Active Directory Domain Services (AD DS)

Azure Active Directory 域服务提供 99.9% 的 SLA。Azure Active Directory Domain Services provides an SLA of 99.9%. Azure AD DS 是在全球分布的数据中心内承载的高度可用的服务。Azure AD DS is a highly available service hosted in globally distributed data centers. 副本集是 Azure AD DS 中的一项预览版功能,当 Azure 区域脱机时,它可以实现地理灾难恢复。Replica sets are a preview feature in Azure AD DS that enables geographic disaster recovery if an Azure region goes offline. 有关详细信息,请参阅 Azure Active Directory 域服务的副本集概念和功能For more information, see replica sets concepts and features for Azure Active Directory Domain Services to learn more.

Azure DNSAzure DNS

Azure DNS 提供 100% 的 SLA。Azure DNS provides an SLA of 100%. HDInsight 使用不同位置的 Azure DNS 进行域名解析。HDInsight uses Azure DNS in various places for domain name resolution.

多区域成本和复杂性优化Multi-region cost and complexity optimizations

如果使用跨区域高可用性灾难恢复来提高业务连续性,则所需的体系结构设计复杂性更高且成本更高。Improving business continuity using cross region high availability disaster recovery requires architectural designs of higher complexity and higher cost. 下表详细说明了一些可能会增加总拥有成本的技术领域。The following tables detail some technical areas that may increase total cost of ownership.

成本优化Cost optimizations

区域Area 成本增加的原因Cause of cost escalation 优化策略Optimization strategies
数据存储Data Storage 在辅助区域中复制主数据/表Duplicating primary data/tables in a secondary region 仅复制特选数据Replicate only curated data
数据流出Data Egress 出站跨区域数据传输需要支出一定的成本。Outbound cross region data transfers come at a price. 查看带宽定价准则Review Bandwidth pricing guidelines 请仅复制特选数据以减少区域数据流出量Replicate only curated data to reduce the region egress footprint
群集计算Cluster Compute 辅助区域中的其他 HDInsight 群集Additional HDInsight cluster/s in secondary region 在主计算失败后使用自动化脚本部署辅助计算。Use automated scripts to deploy secondary compute after primary failure. 使用自动缩放将辅助群集大小保持在最小值。Use Autoscaling to keep secondary cluster size to a minimum. 使用更便宜的 VM SKU。Use cheaper VM SKUs. 在 VM SKU 可能会打折的区域中创建辅助群集。Create secondaries in regions where VM SKUs may be discounted.
身份验证Authentication 辅助区域中的多用户方案将产生其他 Azure AD DS 设置Multiuser scenarios in secondary region will incur additional Azure AD DS setups 请避免在辅助区域中使用多用户设置。Avoid multiuser setups in secondary region.

复杂性优化Complexity optimizations

区域Area 复杂性增加的原因Cause of complexity escalation 优化策略Optimization strategies
读取写入模式Read Write patterns 需要同时为主区域和辅助区域启用读取和写入Requiring both primary and secondary to be Read and Write enabled 将辅助区域设计为只读区域Design the secondary to be read only
零 RPO 和 RTOZero RPO & RTO 要求零数据丢失 (RPO=0) 和零停机时间 (RTO=0)Requiring zero data loss (RPO=0) and zero downtime (RTO=0) 以减少需要故障转移的组件数量的方式设计 RPO 和 RTO。Design RPO and RTO in ways to reduce the number of components that need to fail over.
业务功能Business functionality 要求辅助区域具备主区域的完整业务功能Requiring full business functionality of primary in secondary 评估是否可以在辅助区域中使用业务功能的最低限度的关键子集来运行。Evaluate if you can run with bare minimum critical subset of the business functionality in secondary.
连接Connectivity 要求主区域中的所有上游和下游系统也连接到辅助区域Requiring all upstream and downstream systems from primary to connect to the secondary as well 将辅助连接限制为最低限度的关键子集。Limit the secondary connectivity to a bare minimum critical subset.

后续步骤Next steps

若要详细了解本文中所述的项,请参阅:To learn more about the items discussed in this article, see: