使用 Azure 数据资源管理器创建业务连续性和灾难恢复解决方案Create business continuity and disaster recovery solutions with Azure Data Explorer

本文详细介绍了如何通过在不同的 Azure 区域中复制 Azure 数据资源管理器资源、管理活动和数据引入,为应对 Azure 区域性中断做准备。This article details how you can prepare for an Azure regional outage by replicating your Azure Data Explorer resources, management, and ingestion in different Azure regions. 提供了通过事件中心进行数据引入的一个示例。An example of data ingestion with Event Hub is given. 此外还针对不同的体系结构配置讨论了成本优化。Cost optimization is also discussed for different architecture configurations. 有关体系结构注意事项和恢复解决方案的详细信息,请参阅业务连续性概述For a more in-depth look at architecture considerations and recovery solutions, see the business continuity overview.

为应对 Azure 区域性中断做准备以保护你的数据Prepare for Azure regional outage to protect your data

Azure 数据资源管理器不支持针对整个 Azure 区域的中断进行自动保护。Azure Data Explorer doesn't support automatic protection against the outage of an entire Azure region. 在自然灾害(如地震)期间,会发生这种中断。This disruption can happen during a natural disaster, like an earthquake. 如果你需要一个用于灾难恢复的解决方案,请执行以下步骤以确保业务连续性。If you require a solution for a disaster recovery situation, do the following steps to ensure business continuity. 在这些步骤中,需在两个 Azure 配对区域中复制群集、管理活动和数据引入。In these steps, you'll replicate your clusters, management, and data ingestion in two Azure paired regions.

  1. 在两个 Azure 配对区域中创建两个或更多个独立的群集Create two or more independent clusters in two Azure paired regions.
  2. 复制所有管理活动,例如,在每个群集上创建新表或管理用户角色。Replicate all management activities such as creating new tables or managing user roles on each cluster.
  3. 以并行方式将数据引入到每个群集中。Ingest data to each cluster in parallel.

创建多个独立的群集Create multiple independent clusters

在多个区域中创建多个 Azure 数据资源管理器群集Create more than one Azure Data Explorer cluster in more than one region. 请确保在 Azure 配对区域中创建至少两个这样的群集。Make sure that at least two of these clusters are created in Azure paired regions.

下图显示的是副本,三个群集位于三个不同的区域中。The following image shows replicas, three clusters in three different regions.

创建独立的群集

复制管理活动Replicate management activities

复制管理活动,使每个副本中具有相同的群集配置。Replicate the management activities to have the same cluster configuration in every replica.

  1. 在每个副本上创建相同的项,如下所示:Create on each replica the same:

  2. 在每个副本上管理身份验证和授权Manage the authentication and authorization on each replica.

    复制管理活动

配置数据引入Configure data ingestion

在每个群集上一致地配置数据引入。Configure data ingestion consistently on every cluster. 以下引入方法使用以下高级业务连续性功能。The following ingestion methods use the following advanced business continuity features.

引入方法Ingestion method 灾难恢复功能Disaster recovery feature
IoT 中心Iot Hub Microsoft 发起的故障转移和手动故障转移Microsoft-initiated failover and manual failover

使用事件中心引入功能的灾难恢复解决方案Disaster recovery solution using Event Hub ingestion

完成为应对 Azure 区域性中断做准备以保护你的数据后,你的数据和管理活动会分发到多个区域。Once you've completed Prepare for Azure regional outage to protect your data, your data and management are distributed to multiple regions. 如果某个区域发生服务中断,则 Azure 数据资源管理器将能够使用其他副本。If there's an outage in one region, Azure Data Explorer will be able to use the other replicas.

使用事件中心设置引入Set up ingestion using Event Hub

以下示例通过事件中心使用引入。The following example uses ingestion via Event Hub. 已设置故障转移流,并且 Azure 数据资源管理器通过别名引入数据。A failover flow has been set up, and Azure Data Explorer ingests data from the alias. 使用每个群集副本的唯一使用者组从事件中心引入数据Ingest data from Event Hub using a unique consumer group per cluster replica. 否则,最终将分发流量,而不是复制流量。Otherwise, you'll end up distributing the traffic instead of replicating it.

备注

通过事件中心/IoT 中心/存储进行引入是可靠的。Ingestion via Event Hub/IoT Hub/Storage is robust. 如果某个群集在一段时间内不可用,则它稍后会追赶进度并插入任何挂起的消息或 blob。If a cluster isn't available for a period of time, it will catch up at a later time and insert any pending messages or blobs. 此过程依赖于检查点设置This process relies on checkpointing.

通过事件中心进行引入

如下图所示,数据源会生成发往已进行故障转移配置的事件中心的事件,而每个 Azure 数据资源管理器副本都会使用这些事件。As shown in the diagram below, your data sources produce events to the failover-configured Event Hub, and each Azure Data Explorer replica consumes the events. 数据可视化效果组件(例如 Power BI、Grafana 或 SDK 支持的 WebApps)可以查询其中一个副本。Data visualization components like Power BI, Grafana, or SDK powered WebApps can query one of the replicas.

从数据源到数据可视化效果

优化成本Optimize costs

现在,你可以使用下面的一些方法优化副本:Now you're ready to optimize your replicas using some of the following methods:

创建主动-热备用服务器配置Create an active-hot standby configuration

复制和更新 Azure 数据资源管理器设置会导致成本随副本数量线性增加。Replicating and updating the Azure Data Explorer setup will linearly increase the cost with the number of replicas. 为了优化成本,你可以实施体系结构变体来平衡时间、故障转移和成本。To optimize cost, you can implement an architectural variant to balance time, failover, and cost. 在主动-热备用服务器配置中,通过引入被动 Azure 数据资源管理器副本实现了成本优化。In an active-hot standby configuration, cost optimization has been implemented by introducing passive Azure Data Explorer replicas. 只有当主要区域(例如区域 A)中发生灾难时,才会启用这些副本。These replicas are only turned on if there's a disaster in the primary region (for example, region A). 区域 B 和 C 中的副本不需要全天候处于活动状态,因而大大降低了成本。The replicas in Regions B and C don't need to be active 24/7, reducing the cost significantly. 但在大多数情况下,这些副本的性能不如主要群集。However, in most cases, the performance of these replicas won't be as good as the primary cluster. 有关详细信息,请参阅主动-热备用服务器配置For more information, see Active-Hot standby configuration.

在下图中,只有一个群集从事件中心引入数据。In the image below, only one cluster is ingesting data from the Event Hub. 区域 A 中的主要群集执行连续数据导出来将所有数据导出到某个存储帐户。The primary cluster in Region A performs continuous data export of all data to a storage account. 次要副本有权使用外部表访问数据。The secondary replicas have access to the data using external tables.

主动/热备用服务器的体系结构

启动和停止副本Start and stop the replicas

可以使用以下方法之一启动和停止次要副本:You can start and stop the secondary replicas using one of the following methods:

  • Azure 门户中“概览”选项卡上的“停止”按钮。The Stop button in the Overview tab in the Azure portal. 有关详细信息,请参阅停止和重启群集For more information, see Stop and restart the cluster.

  • Azure CLI:Azure CLI:

az kusto cluster stop --name=<clusterName> --resource-group=<rgName> --subscription=<subscriptionId>” 

实施高度可用的应用程序服务Implement a highly available application service

创建 Azure 应用服务 BCDR 客户端Create the Azure App Service BCDR client

本部分介绍了如何创建 Azure 应用服务,该服务支持与单个主要的和多个辅助的 Azure 数据资源管理器群集进行连接。This section shows you how to create an Azure App Service that supports a connection to a single primary and multiple secondary Azure Data Explorer clusters. 下图展示了 Azure 应用服务设置。The following image illustrates the Azure App Service setup.

创建 Azure App Service

提示

在同一服务中的副本之间建立多个连接可以提高可用性。Having multiple connections between replicas in the same service gives you increased availability. 此设置不仅仅在发生区域性中断的情况下有用。This setup isn't only useful in instances of regional outages.

  1. 使用此应用服务样板代码Use this boilerplate code for an app service. 为了实施多群集客户端,已经创建了 AdxBcdrClient 类。To implement a multi-cluster client, the AdxBcdrClient class has been created. 使用此客户端执行的每个查询都将首先发送到主要群集Each query that is executed using this client will be sent first to the primary cluster. 如果出现故障,则查询将发送到次要副本。If there's a failure, the query will be sent to secondary replicas.

  2. 使用自定义应用程序见解指标来度量性能,并请求将内容分发到主要群集和辅助群集。Use custom application insights metrics to measure performance, and request distribution to primary and secondary clusters.

测试 Azure 应用服务 BCDR 客户端Test the Azure App Service BCDR client

我们使用多个 Azure 数据资源管理器副本运行了测试。We ran a test using multiple Azure Data Explorer replicas. 在模拟主要和辅助群集的服务中断后,可以看到应用服务 BCDR 客户端的行为符合预期。After a simulated outage of primary and secondary clusters, you can see that the app service BCDR client is behaving as intended.

验证应用服务 BCDR 客户端

备注

响应时间较慢是由不同的 SKU 和全球查询导致的。Slower response times are due to different SKUs and cross planet queries.

执行动态或静态路由Perform dynamic or static routing

使用 Azure 流量管理器路由方法对请求进行动态或静态路由。Use Azure Traffic Manager routing methods for dynamic or static routing of the requests. Azure 流量管理器是一个基于 DNS 的流量负载均衡器,可以用来分发应用服务流量。Azure Traffic Manager is a DNS-based traffic load balancer that enables you to distribute app service traffic. 此流量针对全球 Azure 区域中的服务进行了优化,同时提供高可用性和快速响应能力。This traffic is optimized to services across global Azure regions, while providing high availability and responsiveness.

在主动-主动配置中优化成本Optimize cost in an active-active configuration

使用主动-主动配置进行灾难恢复会线性增加成本。Using an active-active configuration for disaster recovery increases the cost linearly. 成本包括节点、存储、标记成本以及因带宽增加的网络成本。The cost includes nodes, storage, markup, and increased networking cost for bandwidth.

后续步骤Next steps

业务连续性和灾难恢复概述着手。Get started with the business continuity and disaster recovery overview.