Set up disaster recovery at scale for VMware VMs/physical servers

Article
10/09/2024

This article describes how to set up disaster recovery to Azure for large numbers (> 1000) of on-premises VMware VMs or physical servers in your production environment, using the Azure Site Recovery service.

Define your BCDR strategy

As part of your business continuity and disaster recovery (BCDR) strategy, you define recovery point objectives (RPOs) and recovery time objectives (RTOs) for your business apps and workloads. RTO measures the duration of time and service level within which a business app or process must be restored and available, in order to avoid continuity issues.

Site Recovery provides continuous replication for VMware VMs and physical servers, and an SLA for RTO.
As you plan for large-scale disaster recovery for VMware VMs and figure out the Azure resources you need, you can specify an RTO value that will be used for capacity calculations.

Best practices

Some general best practices for large-scale disaster recovery. These best practices are discussed in more detail in the next sections of the document.

Identify target requirements: Estimate out capacity and resource needs in Azure before you set up disaster recovery.
Plan for Site Recovery components: Figure out what Site Recovery components (configuration server, process servers) you need to meet your estimated capacity.
Set up one or more scale-out process servers: Don't use the process server that's running by default on the configuration server.
Run the latest updates: The Site Recovery team releases new versions of Site Recovery components on a regular basis, and you should make sure you're running the latest versions. To help with that, track what's new for updates, and enable and install updates as they release.
Monitor proactively: As you get disaster recovery up and running, you should proactively monitor the status and health of replicated machines, and infrastructure resources.
Disaster recovery drills: You should run disaster recovery drills on a regular basis. These don't impact on your production environment, but do help ensure that failover to Azure will work as expected when needed.

Gather capacity planning information

Gather information about your on-premises environment, to help assess and estimate your target (Azure) capacity needs.

For physical servers, gather the information manually.

Plan target (Azure) requirements and capacity

Using your gathered estimations and recommendations, you can plan for target resources and capacity.

Compatible VMs: Use this number to identify the number of VMs that are ready for disaster recovery to Azure. Recommendations about network bandwidth and Azure cores are based on this number.
Required network bandwidth: Note the bandwidth you need for delta replication of compatible VMs.
- When you run the Planner you specify the desired RPO in minutes. The recommendations show you the bandwidth needed to meet that RPO 100% and 90% of the time.
- The network bandwidth recommendations take into account the bandwidth needed for total number of configuration servers and process servers recommended in the Planner.
Required Azure cores: Note the number of cores you need in the target Azure region, based on the number of compatible VMs. If you don't have enough cores, at failover Site Recovery won't be able to create the required Azure VMs.
Recommended VM batch size: The recommended batch size is based on the ability to finish initial replication for the batch within 72 hours by default, while meeting an RPO of 100%. The hour value can be modified.

You can use these recommendations to plan for Azure resources, network bandwidth, and VM batching.

Plan Azure subscriptions and quotas

We want to make sure that available quotas in the target subscription are sufficient to handle failover.

Task	Details	Action
Check cores	If cores in the available quota don't equal or exceed the total target count at the time of failover, failovers will fail.	For VMware VMs, check you have enough cores in the target subscription. For physical servers, check that Azure cores meet your manual estimations. To check quotas, in the Azure portal > Subscription, click Usage + quotas. Learn more about increasing quotas.
Check failover limits	The number of failovers mustn't exceed Site Recovery failover limits.	If failovers exceed the limits, you can add subscriptions, and fail over to multiple subscriptions, or increase quota for a subscription.

Failover limits

The limits indicate the number of failovers that are supported by Site Recovery within one hour, assuming three disks per machine.

What does comply mean? To start an Azure VM, Azure requires some drivers to be in boot start state, and services like DHCP to be set to start automatically.

Machines that comply will already have these settings in place.
For machines running Windows, you can proactively check compliance, and make them compliant if needed. Learn more.
Linux machines are only brought into compliance at the time of failover.

Machine complies with Azure?	Azure VM limits (managed disk failover)
Yes	2000
No	1000

Limits assume that minimal other jobs are in progress in the target region for the subscription.
Some Azure regions are smaller, and might have slightly lower limits.

Plan infrastructure and VM connectivity

After failover to Azure you need your workloads to operate as they did on-premises, and to enable users to access workloads running on the Azure VMs.

Learn more about failing over your Active Directory or DNS on-premises infrastructure to Azure.
Learn more about preparing to connect to Azure VMs after failover.

Plan for source capacity and requirements

It's important that you have sufficient configuration servers and scale-out process servers to meet capacity requirements. As you begin your large-scale deployment, start off with a single configuration server, and a single scale-out process server. As you reach the prescribed limits, add additional servers.

Set up a configuration server

Configuration server capacity is affected by the number of machines replicating, and not by data churn rate. To figure out whether you need additional configuration servers, use these defined VM limits.

CPU	Memory	Cache disk	Replicated machine limit
8 vCPUs 2 sockets * 4 cores @ 2.5 Ghz	16 GB	600 GB	Up to 550 machines Assumes that each machine has three disks of 100 GB each.

These limits are based on a configuration server setup using an OVF template.
The limits assume that you're not using the process server that's running by default on the configuration server.

If you need to add a new configuration server, follow these instructions:

Set up a configuration server for VMware VM disaster recovery, using an OVF template.
Set up a configuration server manually for physical servers, or for VMware deployments that can't use an OVF template.

As you set up a configuration server, note that:

When you set up a configuration server, it's important to consider the subscription and vault within which it resides, since these shouldn't be changed after setup. If you do need to change the vault, you have to disassociate the configuration server from the vault, and reregister it. This stops replication of VMs in the vault.
If you want to set up a configuration server with multiple network adapters, you should do this during set up. You can't do this after the registering the configuration server in the vault.

Set up a process server

Process server capacity is affected by data churn rates, and not by the number of machines enabled for replication.

For large deployments you should always have at least one scale-out process server.
To figure out whether you need additional servers, use the following table.
We recommend that you add a server with the highest spec.

CPU	Memory	Cache disk	Churn rate
12 vCPUs 2 sockets*6 cores @ 2.5 Ghz	24 GB	1 TB	Up to 2 TB a day

Set up the process server as follows:

Review the prerequisites.
Install the server in the portal, or from the command line.
Configure replicated machines to use the new server. If you already have machines replicating:
- You can move an entire process server workload to the new process server.
- Alternatively, you can move specific VMs to the new process server.

Enable large-scale replication

After planning capacity and deploying the required components and infrastructure, enable replication for large numbers of VMs.

Sort machines into batches. You enable replication for VMs within a batch, and then move on to the next batch.
- For physical machines, we recommend you identify batches based on machines that have a similar size and amount of data, and on available network throughput. The aim is to batch machines that are likely to finish their initial replication in around the same amount of time.
If disk churn for a machine is high, or exceeds limits in Deployment thePlanner, you can move non-critical files you don't need to replicate (such as log dumps or temp files) off the machine. For VMware VMs, you can move these files to a separate disk, and then exclude that disk from replication.
Before you enable replication, check that machines meet replication requirements.
Configure a replication policy for VMware VMs or physical servers.
Enable replication for VMware VMs or physical servers. This kicks off the initial replication for the selected machines.

Monitor your deployment

After you kick off replication for the first batch of VMs, start monitoring your deployment as follows:

Assign a disaster recovery administrator to monitor the health status of replicated machines.
Monitor events for replicated items and the infrastructure.
Monitor the health of your scale-out process servers.
Conduct regular disaster recovery drills, to ensure that everything's working as expected.

Plan for large-scale failovers

In an event of disaster, you might need to fail over a large number of machines/workloads to Azure. Prepare for this type of event as follows.

You can prepare in advance for failover as follows:

Prepare your infrastructure and VMs so that your workloads will be available after failover, and so that users can access the Azure VMs.
Note the failover limits earlier in this document. Make sure your failovers will fall within these limits.
Run regular disaster recovery drills. Drills help to:
- Find gaps in your deployment before failover.
- Estimate end-to-end RTO for your apps.
- Estimate end-to-end RPO for your workloads.
- Identify IP address range conflicts.
- As you run drills, we recommend that you don't use production networks for drills and clean up test failovers after every drill.

To run a large-scale failover, we recommend the following:

Create recovery plans for workload failover.
- Each recovery plan can trigger failover of up to 100 machines.
- Learn more about recovery plans.
Add Azure Automation runbook scripts to recovery plans, to automate any manual tasks on Azure. Typical tasks include configuring load balancers, updating DNS etc.
Before failover, prepare Windows machines so that they comply with the Azure environment. Failover limits are higher for machines that comply. Learn more about runbooks.
Trigger failover with the Start-AzRecoveryServicesAsrPlannedFailoverJob PowerShell cmdlet, together with a recovery plan.

Next steps

Monitor Site Recovery