Create business continuity and disaster recovery solutions with Azure Data Explorer

This article details how you can prepare for an Azure regional outage by replicating your Azure Data Explorer resources, management, and ingestion in different Azure regions. An example of data ingestion with Azure Event Hubs is given. Cost optimization is also discussed for different architecture configurations. For a more in-depth look at architecture considerations and recovery solutions, see the business continuity overview.

Prepare for Azure regional outage to protect your data

Azure Data Explorer doesn't support automatic protection against the outage of an entire Azure region. This disruption can happen during a natural disaster, like an earthquake. If you require a solution for a disaster recovery situation, do the following steps to ensure business continuity. In these steps, you'll replicate your clusters, management, and data ingestion in two Azure paired regions.

  1. Create two or more independent clusters in two Azure paired regions.
  2. Replicate all management activities such as creating new tables or managing user roles on each cluster.
  3. Ingest data into each cluster in parallel.

Create multiple independent clusters

Create more than one Azure Data Explorer cluster in more than one region. Make sure that at least two of these clusters are created in Azure paired regions.

The following image shows replicas, three clusters in three different regions.

Create independent clusters.

Replicate management activities

Replicate the management activities to have the same cluster configuration in every replica.

  1. Create on each replica the same:

  2. Manage the authentication and authorization on each replica.

    Duplicate management activities.

Disaster recovery solution using event hub ingestion

Once you've completed Prepare for Azure regional outage to protect your data, your data and management are distributed to multiple regions. If there's an outage in one region, Azure Data Explorer will be able to use the other replicas.

Set up ingestion using an event hub

To ingest data from Azure Event Hubs into each region's Azure Data Explorer cluster, first replicate your Azure Event Hubs setup in each region. Then configure each region's Azure Data Explorer replica to ingest data from its corresponding Event Hubs.

Note

Ingestion via Azure Event Hubs/IoT Hub/Storage is robust. If a cluster isn't available for a period of time, it will catch up at a later time and insert any pending messages or blobs. This process relies on checkpointing.

Ingest via Azure Event Hubs.

As shown in the diagram below, your data sources produce events to event hubs in all regions, and each Azure Data Explorer replica consumes the events. Data visualization components like Power BI, Grafana, or SDK powered WebApps can query one of the replicas.

Data sources to data visualization.

Optimize costs

Now you're ready to optimize your replicas using some of the following methods:

Create an on-demand data recovery configuration

Replicating and updating the Azure Data Explorer setup will linearly increase the cost with the number of replicas. To optimize cost, you can implement an architectural variant to balance time, failover, and cost. In an on-demand data recovery configuration, cost optimization has been implemented by introducing passive Azure Data Explorer replicas. These replicas are only turned on if there's a disaster in the primary region (for example, region A). The replicas in Regions B and C don't need to be active 24/7, reducing the cost significantly. However, in most cases, the performance of these replicas won't be as good as the primary cluster. For more information, see On-demand data recovery configuration.

In the image below, only one cluster is ingesting data from the event hub. The primary cluster in Region A performs continuous data export of all data to a storage account. The secondary replicas have access to the data using external tables.

architecture for an on-demand data recovery configuration.

Start and stop the replicas

You can start and stop the secondary replicas using one of the following methods:

az kusto cluster stop --name=<clusterName> --resource-group=<rgName> --subscription=<subscriptionId>"

Implement a highly available application service

Create the Azure App Service BCDR client

This section shows you how to create an Azure App Service that supports a connection to a single primary and multiple secondary Azure Data Explorer clusters. The following image illustrates the Azure App Service setup.

Create an Azure App Service.

Tip

Having multiple connections between replicas in the same service gives you increased availability. This setup isn't only useful in instances of regional outages.

  1. Use this boilerplate code for an app service. To implement a multi-cluster client, the AdxBcdrClient class has been created. Each query that is executed using this client will be sent first to the primary cluster. If there's a failure, the query will be sent to secondary replicas.

  2. Use custom application insights metrics to measure performance, and request distribution to primary and secondary clusters.

Test the Azure App Service BCDR client

We ran a test using multiple Azure Data Explorer replicas. After a simulated outage of primary and secondary clusters, you can see that the app service BCDR client is behaving as intended.

Verify app service BCDR client.

The Azure Data Explorer clusters are distributed across West Europe (2xD14v2 primary), South East Asia, and East US (2xD11v2).

Cross planet query response time.

Note

Slower response times are due to different SKUs and cross planet queries.

Perform dynamic or static routing

Use Azure Traffic Manager routing methods for dynamic or static routing of the requests. Azure Traffic Manager is a DNS-based traffic load balancer that enables you to distribute app service traffic. This traffic is optimized to services across global Azure regions, while providing high availability and responsiveness.

Optimize cost in an active-active configuration

Using an active-active configuration for disaster recovery increases the cost linearly. The cost includes nodes, storage, markup, and increased networking cost for bandwidth.

Use optimized autoscale to optimize costs

Use the optimized autoscale feature to configure horizontal scaling for the secondary clusters. They should be dimensioned so they can handle the ingestion load. Once the primary cluster isn't reachable, the secondary clusters will get more traffic and scale according to the configuration.

Using optimized autoscale in this example saved roughly 50% of the cost in comparison to having the same horizontal and vertical scale on all replicas.