业务连续性和灾难恢复概述Business continuity and disaster recovery overview

使用 Azure 数据资源管理器中的业务连续性和灾难恢复,你的业务能够在发生中断时继续正常运转。Business continuity and disaster recovery in Azure Data Explorer enables your business to continue operating in the face of a disruption. 本文讨论可用性(区域内)和灾难恢复。This article discusses availability (intra-region) and disaster recovery. 详细介绍了弹性 Azure 数据资源管理器部署的本机功能和体系结构注意事项。It details native capabilities and architectural considerations for a resilient Azure Data Explorer deployment. 并详细说明了如何在发生人为错误时进行恢复,介绍了高可用性,以及多个灾难恢复配置。It details recovery from human errors, high availability, followed by multiple disaster recovery configurations. 这些配置取决于恢复点目标 (RPO) 和恢复时间目标 (RTO) 等复原要求、所需的工作量和成本。These configurations depend on resiliency requirements such as Recovery Point Objective (RPO) and Recovery Time Objective (RTO), needed effort, and cost.

缓解中断性事件Mitigate disruptive events

人为错误Human error

人为错误是不可避免的。Human errors are inevitable. 用户可能会意外删除群集、数据库或表。Users can accidentally drop a cluster, database, or a table.

意外删除群集或数据库Accidental cluster or database deletion

意外删除群集或数据库是无法恢复的操作。Accidental cluster or database deletion is an irrecoverable action. 作为 Azure 数据资源管理器资源所有者,你可以通过启用在 Azure 资源级别提供的删除功能来防止数据丢失。As the Azure Data Explorer resource owner, you can prevent data loss by enabling the delete lock capability, available at the Azure resource level.

意外的表删除Accidental table deletion

允许具有表管理权限或更高权限的用户删除表Users with table admin permissions or higher are allowed to drop tables. 如果其中一个用户意外地删除了某个表,你可以使用 .undo drop table 命令恢复该表。If one of those users accidentally drops a table, you can recover it using the .undo drop table command. 若要成功执行此命令,必须首先启用保留策略中的“可恢复性”属性。For this command to be successful, you must first enable the recoverability property in the retention policy.

意外删除外部表Accidental external table deletion

外部表是引用存储在数据库外部的数据的 Kusto 查询架构实体。External tables are Kusto query schema entities that reference data stored outside the database. 删除外部表只会删除表元数据。Deletion of an external table only deletes the table metadata. 可以通过重新执行表创建命令来恢复它。You can recover it by re-executing the table creation command. 使用软删除功能,在用户配置的时间内防止意外删除或覆盖文件/blob。Use the soft delete capability to protect against accidental deletion or overwrite of a file/blob for a user-configured amount of time.

Azure 数据资源管理器的高可用性High availability of Azure Data Explorer

高可用性是指 Azure 数据资源管理器、其组件和 Azure 区域内的基本依赖项的容错能力。High availability refers to the fault-tolerance of Azure Data Explorer, its components, and underlying dependencies within an Azure region. 这种容错避免了实现中的单一故障点 (SPOF)。This fault tolerance avoids single points of failure (SPOF) in the implementation. 在 Azure 数据资源管理器中,高可用性包括持久性层、计算层和先导-后继配置。In Azure Data Explorer, high availability includes the persistence layer, compute layer, and a leader-follower configuration.

持久性层Persistence layer

Azure 数据资源管理器利用 Azure 存储作为其持久性层。Azure Data Explorer leverages Azure Storage as its durable persistence layer. Azure 存储自动提供容错功能,默认设置是在数据中心内提供本地冗余存储 (LRS)。Azure Storage automatically provides fault tolerance, with the default setting offering Locally Redundant Storage (LRS) within a data center. 保留三个副本。Three replicas are persisted. 如果一个副本在使用过程中丢失,则在不中断的情况下部署另一个副本。If a replica is lost while in use, another is deployed without disruption. 通过区域冗余存储,可以进一步提高复原能力,该存储可以智能地跨 Azure 区域可用性区域放置副本,在增加一定地额外成本后可实现最大容错能力。Further resiliency is possible with Zone Redundant Storage that places replicas intelligently across Azure regional availability zones for maximum fault tolerance at an additional cost.

计算层Compute layer

Azure 数据资源管理器是一种分布式计算平台,根据规模和节点角色类型可以有两到多个节点。Azure Data Explorer is a distributed computing platform and can have two to many nodes depending on scale and node role type. 在预配时,选择可用性区域以跨区域分布节点部署,以实现最大的区域内恢复能力。At provision time, select availability zones to distribute the node deployment, across zones for maximum intra-region resiliency. 可用性区域故障不会导致完全中断,而是性能下降,直到该区域恢复为止。An availability zone failure won't result in a complete outage but instead, performance degradation until recovery of the zone.

先导-后继群集配置Leader-follower cluster configuration

Azure 数据资源管理器针对后面跟有其他后继群集的先导群集提供了一个可选的后继功能,用于提供对先导群集的数据和元数据的只读访问权限。Azure Data Explorer provides an optional follower capability for a leader cluster to be followed by other follower clusters for read-only access to the leader's data and metadata. 先导群集中的更改,例如 createappenddrop 将自动与后继群集同步。Changes in the leader, such as create, append, and drop are automatically synchronized to the follower. 先导群集可以跨越 Azure 区域,而后继群集应与先导群集位于同一个区域中。While the leaders could span Azure regions, the follower clusters should be hosted in the same region(s) as the leader. 如果先导群集故障或数据库或表意外丢失,后继群集会失去访问权限,直到在先导群集中恢复访问。If the leader cluster is down or databases or tables are accidentally dropped, the follower clusters will lose access until access is recovered in the leader.

Azure 可用性区域中断Outage of an Azure availability zone

Azure 可用性区域是同一 Azure 区域中独特的物理位置。Azure availability zones are unique physical locations within the same Azure region. 它们可以保护 Azure 数据资源管理器群集的计算和数据不会在部分区域发生故障。They can protect an Azure Data Explorer cluster's compute and data from partial region failure. 区域故障是一种与可用性相关的情况,因为它发生在区域内。Zone failure is an availability scenario as it is intra-region.

将 Azure 数据资源管理器群集与其他已连接的 Azure 资源固定到同一区域。Pin an Azure Data Explorer cluster to the same zone as other connected Azure resources. 有关启用可用性区域的更多信息,请参阅创建群集For more information on enabling availability zones, see create a cluster.

备注

只有在创建群集时才支持选择可用性区域,选择后不能修改。Availability zone selection is only supported at the time of cluster creation and can't be modified later.

Azure 数据中心中断Outage of an Azure datacenter

Azure 可用性区域是有成本的,一些客户会选择在没有区域冗余的情况下进行部署。Azure availability zones come with a cost and some customers choose to deploy without zonal redundancy. 如果采用这种 Azure 数据资源管理器部署,那么当 Azure 数据中心发生中断时,会导致群集中断。With such an Azure Data Explorer deployment, an Azure datacenter outage will result in cluster outage. 因此,处理 Azure 数据中心中断与处理 Azure 区域中断是相同的。Handling an Azure datacenter outage is therefore identical to that of an Azure region outage.

Azure 区域中断Outage of an Azure region

Azure 数据资源管理器不提供针对整个 Azure 区域中断的自动保护。Azure Data Explorer doesn't provide automatic protection against the outage of an entire Azure region. 若要在发生此类中断时最大程度地降低业务影响,应跨 Azure 配对区域设置多个 Azure 数据资源管理器群集。To minimize business impact if there is such an outage, multiple Azure Data Explorer clusters across Azure paired regions. 根据恢复时间目标 (RTO)、恢复点目标 (RPO) 以及工作量和成本,有多种灾难恢复配置Based on your recovery time objective (RTO), recovery point objective (RPO), as well as effort and cost considerations, there are multiple disaster recovery configurations. 可通过 Azure 顾问建议和自动缩放配置来实现成本和性能优化。Cost and performance optimizations are possible with Azure Advisor recommendations and autoscale configuration.

灾难恢复配置Disaster recovery configurations

本节详细介绍了多种灾难恢复配置,具体取决于复原要求(RPO 和 RTO)、所需的工作量和成本。This section details multiple disaster recovery configurations depending on resiliency requirements (RPO and RTO), needed effort, and cost.

恢复时间目标 (RTO) 是指发生中断后恢复所用的时间。Recovery time objective (RTO) refers to the time to recover from a disruption. 例如,RTO 为 2 小时意味着应用程序必须在中断后两小时内恢复正常运行。For example, RTO of 2 hours means the application has to be up and running within two hours of a disruption. 恢复点目标 (RPO) 是指在发生中断后,在中断期间丢失的数据量超过允许的阈值之前可经过的时间间隔。Recovery point objective (RPO) refers to the interval of time that might pass during a disruption before the quantity of data lost during that period is greater than the allowable threshold. 例如,如果 RPO 是 24 小时,而应用程序的数据是从 15 年前开始的,则它们仍处于商定的 RPO 参数范围内。For example, if the RPO is 24 hours, and an application has data beginning from 15 years ago, they're still within the parameters of the agreed-upon RPO.

在规划灾难恢复时,引入、处理和特选过程需要预先进行精心设计。Ingestion, processing, and curation processes need diligent design upfront when planning for disaster recovery. 引入是指从各种来源将数据集成到 Azure 数据资源管理器中;处理是指转换及类似活动;特选是指具体化视图、导出到数据湖等等。Ingestion refers to data integrated into Azure Data Explorer from various sources; processing refers to transformations and similar activities; curation refers to materialized views, exports to the data lake, and so on.

以下是常用的灾难恢复配置,下面将详细介绍每种配置。The following are popular disaster recovery configurations, and each is described in detail below.

“永不中断”配置Always-on configuration

对于不允许中断的关键应用程序部署,应该跨 Azure 配对区域使用多个 Azure 数据资源管理器群集。For critical application deployments with no tolerance for outages, you should use multiple Azure Data Explorer clusters across Azure paired regions. 在所有群集中并行设置引入、处理和特选。Set up ingestion, processing, and curation in parallel to all of the clusters. 不同区域的群集 SKU 必须相同。The cluster SKU must be the same across regions. Azure 将确保在所有 Azure 配对区域中根据具体需要和安排推出更新。Azure will ensure that updates are rolled out and staggered across Azure paired regions. Azure 区域中断不会导致应用程序中断。An Azure region outage won't cause an application outage. 你可能会遇到延迟或性能下降的情况。You may experience some latency or performance degradation.

主动-主动-主动-n 配置

配置Configuration RPORPO RTORTO 工作量Effort 成本Cost
主动-主动-主动-nActive-Active-Active-n 0 小时0 hours 0 小时0 hours 较低Lower 最高Highest

主动-主动配置Active-Active configuration

此配置与永不中断配置相同,但只涉及了两个 Azure 配对区域。This configuration is identical to the Always-on configuration, but only involves two Azure paired regions. 配置双重引入、处理和特选。Configure dual ingestion, processing, and curation. 将用户路由到最近的区域。Users are routed to the nearest region. 不同区域的群集 SKU 必须相同。The cluster SKU must be the same across regions.

主动-主动配置

配置Configuration RPORPO RTORTO 工作量Effort 成本Cost
主动-主动Active-Active None None 较低Lower High

主动-热备用服务器配置Active-Hot standby configuration

主动-热配置与主动-主动配置在双重引入、处理和特选方面相似。The Active-Hot configuration is similar to the Active-Active configuration in dual ingest, processing, and curation. 但是,备用群集对最终用户是离线状态,且不需要与主群集位于同一 SKU 中。However, the standby cluster is offline to end users, and doesn't need to be in the same SKU as the primary. 热备用群集也可以具有较小的 SKU 和规模,这样性能会降低。The hot standby cluster can also be of a smaller SKU and scale, and as such is less performant. 在发生灾难的情况下,备用群集变为联机状态并进行纵向扩展。In a disaster scenario, the standby cluster is brought online, and scaled up.

主动-热备用服务器配置

配置Configuration RPORPO RTORTO 工作量Effort 成本Cost
主动-热备用服务器Active-Hot Standby Low Low Medium Medium

按需数据恢复配置On-demand data recovery configuration

此解决方案的复原能力(最高的 RPO 和 RTO)最低、成本最低、工作量最大。This solution offers the least resiliency (highest RPO and RTO), is the lowest in cost and highest in effort. 在此配置中,没有数据恢复群集。In this configuration, there's no data recovery cluster. 配置特选数据(除非还需要原始数据和中间数据)的连续导出,将其导出到配置了 GRS(异地冗余存储)的存储帐户。Configure continuous export of curated data (unless raw and intermediate data is also required) to a storage account that is configured GRS (Geo Redundant Storage). 如果需要进行灾难恢复,会启动数据恢复群集。A data recovery cluster is spun up if there is a disaster recovery scenario. 此时将应用 DDL、配置、策略和流程。At that time, DDLs, configuration, policies, and processes are applied.

按需数据恢复群集配置

配置Configuration RPORPO RTORTO 工作量Effort 成本Cost
按需数据恢复群集On-demand data recovery cluster 最高Highest 最高Highest 最高Highest 最低Lowest

灾难恢复配置选项摘要Summary of disaster recovery configuration options

配置Configuration 复原能力Resiliency RPORPO RTORTO 工作量Effort 成本Cost
主动-主动-主动-nActive-Active-Active-n 最高Highest 0 小时0 hours 0 小时0 hours 较低Lower 最高Highest
主动-主动Active-Active High None None 较低Lower High
主动-热备用服务器Active-Hot Standby 中等Medium Low Low Medium Medium
按需数据恢复群集On-demand data recovery cluster 最低Lowest 最高Highest 最高Highest 最高Highest 最低Lowest

最佳做法Best practices

无论选择哪种灾难恢复配置,请遵循以下最佳做法:Regardless of which disaster recovery configuration is chosen, follow these best practices:

  • 所有数据库对象、策略和配置都应该保存在源代码管理中,这样就可以从发布自动化工具中将其发布到群集。All database objects, policies, and configurations should be persisted in source control so they can be released to the cluster from your release automation tool.
  • 设计、开发和实现验证例程,以确保从数据角度来看所有群集都是同步的。Design, develop, and implement validation routines to ensure all clusters are in-sync from a data perspective. Azure 数据资源管理器支持跨群集联接Azure Data Explorer supports cross cluster joins. 表之间的简单计数或行可帮助验证。A simple count or rows across tables can help validate.
  • 使用连续导出功能并将 Azure 数据资源管理器表中的数据导出到 Azure 数据湖存储。Use continuous export capability and export data within Azure Data Explorer tables to an Azure Data Lake store. 确保选择的 GRS 可实现最高复原能力。Ensure selection of GRS for the highest resilience.
  • 发布过程应该包括可确保实现群集镜像的治理检查和均衡操作。Release procedures should involve governance checks and balances that ensure mirroring of the clusters.
  • 充分了解从头开始构建群集所需完成的所有操作。Be fully cognizant of what it takes to build a cluster from scratch.
  • 创建部署单元清单。Create a checklist of deployment units. 你的列表的具体内容取决于你的特定需求,但应包括:部署脚本、引入连接、BI 工具和其他重要配置。Your list will be unique to your needs, but should include: deployment scripts, ingestion connections, BI tools, and other important configurations.

后续步骤Next steps

通过使用 Azure 数据资源管理器创建业务连续性和灾难恢复解决方案了解详细信息。Learn more with the Create business continuity and disaster recovery solutions with Azure Data Explorer.