为 VMware VM/物理服务器设置大规模灾难恢复Set up disaster recovery at scale for VMware VMs/physical servers

本文介绍如何使用 Azure Site Recovery 服务将生产环境中的大量(1000 台以上)本地 VMware VM 或物理服务器设置为灾难恢复到 Azure。This article describes how to set up disaster recovery to Azure for large numbers (> 1000) of on-premises VMware VMs or physical servers in your production environment, using the Azure Site Recovery service.

定义 BCDR 策略Define your BCDR strategy

在业务连续性和灾难恢复 (BCDR) 策略中,定义业务应用和工作负荷的恢复点目标 (RPO) 与恢复时间目标 (RTO)。As part of your business continuity and disaster recovery (BCDR) strategy, you define recovery point objectives (RPOs) and recovery time objectives (RTOs) for your business apps and workloads. RTO 衡量业务应用或流程为了避免出现连续性问题,必须能够在多长时间内还原并保持可用性服务级别。RTO measures the duration of time and service level within which a business app or process must be restored and available, in order to avoid continuity issues.

  • Site Recovery 为 VMware VM 和物理服务器提供连续复制,并提供 RTO 方面的 SLASite Recovery provides continuous replication for VMware VMs and physical servers, and an SLA for RTO.
  • 在规划 VMware VM 的大规模灾难恢复以及测算所需的 Azure 资源时,可以指定用于容量计算的 RTO 值。As you plan for large-scale disaster recovery for VMware VMs and figure out the Azure resources you need, you can specify an RTO value that will be used for capacity calculations.

最佳实践Best practices

适用于大规模灾难恢复的一些常规最佳做法。Some general best practices for large-scale disaster recovery. 本文档的后续几个部分将更详细地讨论这些最佳做法。These best practices are discussed in more detail in the next sections of the document.

  • 确定目标要求:在设置灾难恢复之前,估算出 Azure 中的容量和资源需求。Identify target requirements: Estimate out capacity and resource needs in Azure before you set up disaster recovery.
  • 规划 Site Recovery 组件:确定需要提供哪些 Site Recovery 组件(配置服务器、进程服务器)才能符合估算的容量。Plan for Site Recovery components: Figure out what Site Recovery components (configuration server, process servers) you need to meet your estimated capacity.
  • 设置一个或多个横向扩展进程服务器:不要使用配置服务器上默认运行的进程服务器。Set up one or more scale-out process servers: Don't use the process server that's running by default on the configuration server.
  • 运行最新的更新:Site Recovery 团队定期发布 Site Recovery 组件的新版本,你应确保运行最新版本。Run the latest updates: The Site Recovery team releases new versions of Site Recovery components on a regular basis, and you should make sure you're running the latest versions. 为帮助做到这一点,请在新增功能中跟踪更新,在发布后启用并安装更新To help with that, track what's new for updates, and enable and install updates as they release.
  • 主动监视:正常运行灾难恢复后,应主动监视复制的计算机以及基础结构资源的状态和运行状况。Monitor proactively: As you get disaster recovery up and running, you should proactively monitor the status and health of replicated machines, and infrastructure resources.
  • 灾难恢复演练:应定期运行灾难恢复演练。Disaster recovery drills: You should run disaster recovery drills on a regular basis. 这些演练不会影响生产环境,但有助于确保在必要时可按预期故障转移到 Azure。These don't impact on your production environment, but do help ensure that failover to Azure will work as expected when needed.

收集容量规划信息Gather capacity planning information

收集有关本地环境的信息,以帮助评估目标 (Azure) 容量需求。Gather information about your on-premises environment, to help assess and estimate your target (Azure) capacity needs.

  • 对于 VMware,可运行适用于 VMware VM 的部署规划器来实现此目的。For VMware, run the Deployment Planner for VMware VMs to do this.
  • 对于物理服务器,可手动收集信息。For physical servers, gather the information manually.

运行适用于 VMware VM 的部署规划器Run the Deployment Planner for VMware VMs

部署规划器可帮助你收集有关 VMware 本地环境的信息。The Deployment Planner helps you to gather information about your VMware on-premises environment.

  • 在 VM 出现典型变动情况的时间段运行部署规划器。Run the Deployment Planner during a period that represents typical churn for your VMs. 这会生成更准确的估算值和建议。This will generate more accurate estimates and recommendations.
  • 我们建议在配置服务器计算机上运行部署规划器,因为规划器将计算其运行所在的服务器的吞吐量。We recommend that you run the Deployment Planner on the configuration server machine, since the Planner calculates throughput from the server on which it's running. 详细了解如何测量吞吐量。Learn more about measuring throughput.
  • 如果尚未设置配置服务器:If you don't yet have a configuration server set up:

然后按如下所示运行规划器:Then run the Planner as follows:

  1. 了解部署规划器。Learn about the Deployment Planner. 可以从门户下载最新版本,或者直接下载You can download the latest version from the portal, or download it directly.
  2. 请查看部署规划器的先决条件最新更新,然后下载并提取该工具。Review the prerequisites and latest updates for the Deployment Planner, and download and extract the tool.
  3. 在配置服务器上运行部署规划器Run the Deployment Planner on the configuration server.
  4. 生成报告以汇总估算值和建议。Generate a report to summarize estimations and recommendations.
  5. 分析报告建议成本估算Analyze the report recommendations and cost estimations.

备注

默认情况下,此工具配置为在分析后为最多 1000 个 VM 生成报告。By default, the tool is configured to profile and generates report for up to 1000 VMs. 若要更改此限制,可以增大 ASRDeploymentPlanner.exe.config 文件中的 MaxVMsSupported 键值。You can change this limit by increasing the MaxVMsSupported key value in the ASRDeploymentPlanner.exe.config file.

规划目标 (Azure) 要求和容量Plan target (Azure) requirements and capacity

使用收集的估算值和建议,可以规划目标资源和容量。Using your gathered estimations and recommendations, you can plan for target resources and capacity. 如果已运行适用于 VMware VM 的部署规划器,可以借助许多的报告建议If you ran the Deployment Planner for VMware VMs, you can use a number of the report recommendations to help you.

  • 兼容的 VM 数:使用此数字来识别已准备好灾难恢复到 Azure 的 VM 数目。Compatible VMs: Use this number to identify the number of VMs that are ready for disaster recovery to Azure. 有关网络带宽和 Azure 核心数的建议基于此数字。Recommendations about network bandwidth and Azure cores are based on this number.
  • 所需的网络带宽:注意兼容 VM 的增量复制所需的带宽。Required network bandwidth: Note the bandwidth you need for delta replication of compatible VMs.
    • 运行规划器时,应指定所需的 RPO(以分钟为单位)。When you run the Planner you specify the desired RPO in minutes. 建议中会显示符合该 RPO 时间的 100% 和 90% 所需的带宽。The recommendations show you the bandwidth needed to meet that RPO 100% and 90% of the time.
    • 网络带宽建议考虑到了规划器中建议的所有配置服务器和进程服务器所需的带宽。The network bandwidth recommendations take into account the bandwidth needed for total number of configuration servers and process servers recommended in the Planner.
  • 所需的 Azure 核心数:注意目标 Azure 区域中所需的核心数,该数字基于兼容的 VM 数。Required Azure cores: Note the number of cores you need in the target Azure region, based on the number of compatible VMs. 如果没有足够的核心,在故障转移时,Site Recovery 将无法创建所需的 Azure VM。If you don't have enough cores, at failover Site Recovery won't be able to create the required Azure VMs.
  • 建议的 VM 批大小:建议的批大小基于在默认 72 小时内完成该批的初始复制,同时满足 100% 的 RPO 的能力。Recommended VM batch size: The recommended batch size is based on the ability to finish initial replication for the batch within 72 hours by default, while meeting an RPO of 100%. 可以修改小时值。The hour value can be modified.

使用这些建议可以规划 Azure 资源、网络带宽和 VM 批处理。You can use these recommendations to plan for Azure resources, network bandwidth, and VM batching.

规划 Azure 订阅和配额Plan Azure subscriptions and quotas

我们希望确保目标订阅中的可用配额足以应对故障转移。We want to make sure that available quotas in the target subscription are sufficient to handle failover.

TaskTask 详细信息Details 操作Action
检查核心数Check cores 如果可用配额中的核心数少于故障转移时的目标总数,故障转移将会失败。If cores in the available quota don't equal or exceed the total target count at the time of failover, failovers will fail. 对于 VMware VM,请检查目标订阅中是否有足够的核心,与部署规划器的核心建议相符。For VMware VMs, check you have enough cores in the target subscription to meet the Deployment Planner core recommendation.

对于物理服务器,请检查 Azure 核心数是否符合人工估算结果。For physical servers, check that Azure cores meet your manual estimations.

若要检查配额,请在 Azure 门户中依次单击“订阅”、“用量 + 配额”。 To check quotas, in the Azure portal > Subscription, click Usage + quotas.

详细了解如何提高配额。Learn more about increasing quotas.
检查故障转移限制Check failover limits 故障转移次数不得超过 Site Recovery 的故障转移限制。The number of failovers mustn't exceed Site Recovery failover limits. 如果故障转移次数超过限制,你可以添加订阅并故障转移到多个订阅,或者提高订阅的配额。If failovers exceed the limits, you can add subscriptions, and fail over to multiple subscriptions, or increase quota for a subscription.

故障转移限制Failover limits

限制是指 Site Recovery 在一小时内支持的故障转移次数(假设每台计算机包含三个磁盘)。The limits indicate the number of failovers that are supported by Site Recovery within one hour, assuming three disks per machine.

“符合条件”是什么意思?What does comply mean? 若要启动 Azure VM,Azure 要求某些驱动程序处于引导/启动状态,并将 DHCP 等服务设置为自动启动。To start an Azure VM, Azure requires some drivers to be in boot start state, and services like DHCP to be set to start automatically.

  • 符合条件的计算机已采用这些设置。Machines that comply will already have these settings in place.
  • 对于运行 Windows 的计算机,可以主动检查符合性,并根据需要使其符合条件。For machines running Windows, you can proactively check compliance, and make them compliant if needed. 了解详细信息Learn more.
  • 对于 Linux 计算机,只能在故障转移时使其符合条件。Linux machines are only brought into compliance at the time of failover.
计算机是否符合 Azure 条件?Machine complies with Azure? Azure VM 限制(托管磁盘故障转移)Azure VM limits (managed disk failover)
Yes 20002000
No 10001000
  • 限制假设订阅的目标区域中只有其他极少量的作业正在进行。Limits assume that minimal other jobs are in progress in the target region for the subscription.
  • 某些 Azure 区域的规模较小,因此限制可能略低一些。Some Azure regions are smaller, and might have slightly lower limits.

规划基础结构和 VM 连接Plan infrastructure and VM connectivity

故障转移到 Azure 后,需要让工作负荷像在本地一样正常运行,并使用户能够访问 Azure VM 上运行的工作负荷。After failover to Azure you need your workloads to operate as they did on-premises, and to enable users to access workloads running on the Azure VMs.

  • 详细了解如何将 Active Directory 或 DNS 本地基础结构故障转移到 Azure。Learn more about failing over your Active Directory or DNS on-premises infrastructure to Azure.
  • 详细了解如何准备好在故障转移后连接到 Azure VM。Learn more about preparing to connect to Azure VMs after failover.

规划源容量和要求Plan for source capacity and requirements

重要的一点是,必须提供足够的配置服务器和横向扩展进程服务器来满足容量要求。It's important that you have sufficient configuration servers and scale-out process servers to meet capacity requirements. 开始大规模部署时,请先从一台配置服务器和一台横向扩展进程服务器着手。As you begin your large-scale deployment, start off with a single configuration server, and a single scale-out process server. 达到规定的限制后,添加更多的服务器。As you reach the prescribed limits, add additional servers.

备注

对于 VMware VM,部署规划器将在所需的配置服务器和进程服务器方面提供一些建议。For VMware VMs, the Deployment Planner makes some recommendations about the configuration and process servers you need. 我们建议使用以下过程中包含的表格,而不要遵循部署规划器的建议。We recommend that you use the tables included in the following procedures, instead of following the Deployment Planner recommendation.

设置配置服务器Set up a configuration server

配置服务器容量受启用复制的计算机数目的影响,而不受数据变动率的影响。Configuration server capacity is affected by the number of machines replicating, and not by data churn rate. 若要确定是否需要更多的配置服务器,请参考定义的这些 VM 限制。To figure out whether you need additional configuration servers, use these defined VM limits.

CPUCPU 内存Memory 缓存磁盘Cache disk 复制的计算机限制Replicated machine limit
8 个 vCPU8 vCPUs
2 个插槽 * 4 个核心 @ 2.5 GHz2 sockets * 4 cores @ 2.5 Ghz
16 GB16 GB 600 GB600 GB 最多 550 台计算机Up to 550 machines
假设每台计算机有 3 个 100 GB 的磁盘。Assumes that each machine has three disks of 100 GB each.
  • 这些限制基于使用 OVF 模板设置的配置服务器。These limits are based on a configuration server set up using an OVF template.
  • 这些限制假设不是使用配置服务器上默认运行的进程服务器。The limits assume that you're not using the process server that's running by default on the configuration server.

如果需要添加新的配置服务器,请遵照以下说明操作:If you need to add a new configuration server, follow these instructions:

设置配置服务器时,请注意:As you set up a configuration server, note that:

  • 设置配置服务器时,必须考虑到该服务器所在的订阅和保管库,因为设置后不能更改订阅和保管库。When you set up a configuration server, it's important to consider the subscription and vault within which it resides, since these shouldn't be changed after setup. 如果确实需要更改保管库,必须从保管库取消关联配置服务器,然后重新注册该服务器。If you do need to change the vault, you have to disassociate the configuration server from the vault, and reregister it. 这会停止保管库中的 VM 复制。This stops replication of VMs in the vault.
  • 若要设置包含多个网络适配器的配置服务器,应在设置期间执行此操作。If you want to set up a configuration server with multiple network adapters, you should do this during set up. 将配置服务器注册到保管库中后,无法执行此操作。You can't do this after the registering the configuration server in the vault.

设置进程服务器Set up a process server

进程服务器容量受数据变动率的影响,而不受启用复制的计算机数目的影响。Process server capacity is affected by data churn rates, and not by the number of machines enabled for replication.

  • 对于大型部署,始终应该至少提供一台横向扩展进程服务器。For large deployments you should always have at least one scale-out process server.
  • 若要确定是否需要更多的服务器,请参考下表。To figure out whether you need additional servers, use the following table.
  • 我们建议添加最高规格的服务器。We recommend that you add a server with the highest spec.
CPUCPU 内存Memory 缓存磁盘Cache disk 变动率Churn rate
12 个 vCPU12 vCPUs
2 个插槽 * 6 个核心 @ 2.5 GHz2 sockets*6 cores @ 2.5 Ghz
24 GB24 GB 1 GB1 GB 每天最大 2 TBUp to 2 TB a day

按如下所述设置进程服务器:Set up the process server as follows:

  1. 请查看先决条件Review the prerequisites.
  2. 通过门户命令行安装该服务器。Install the server in the portal, or from the command line.
  3. 将复制的计算机配置为使用新服务器。Configure replicated machines to use the new server. 如果已启用计算机复制:If you already have machines replicating:
    • 可将整个进程服务器工作负荷移动到新的进程服务器。You can move an entire process server workload to the new process server.
    • 或者,可将特定的 VM 移动到新的进程服务器。Alternatively, you can move specific VMs to the new process server.

启用大规模复制Enable large-scale replication

规划容量并部署所需的组件和基础结构之后,为大量的 VM 启用复制。After planning capacity and deploying the required components and infrastructure, enable replication for large numbers of VMs.

  1. 将计算机排序成批。Sort machines into batches. 为一个批中的 VM 启用复制,然后转到下一批。You enable replication for VMs within a batch, and then move on to the next batch.

    • 对于 VMware VM,可以使用部署规划器报告中建议的 VM 批大小For VMware VMs, you can use the recommended VM batch size in the Deployment Planner report.
    • 对于物理计算机,我们建议根据大小和数据量类似的计算机以及可用的网络吞吐量来标识批。For physical machines, we recommend you identify batches based on machines that have a similar size and amount of data, and on available network throughput. 目的是将有可能在大致相同的时间内完成初始复制的计算机分批。The aim is to batch machines that are likely to finish their initial replication in around the same amount of time.
  2. 如果某台计算机的磁盘变动率较高或超过部署规划器中的限制,则你可以将不需要复制的非关键文件(例如日志转储或临时文件)移出该计算机。If disk churn for a machine is high, or exceeds limits in Deployment thePlanner, you can move non-critical files you don't need to replicate (such as log dumps or temp files) off the machine. 对于 VMware VM,可将这些文件移到单独的磁盘,然后从复制项中排除该磁盘For VMware VMs, you can move these files to a separate disk, and then exclude that disk from replication.

  3. 在启用复制之前,请检查计算机是否满足复制要求Before you enable replication, check that machines meet replication requirements.

  4. VMware VM物理服务器配置复制策略。Configure a replication policy for VMware VMs or physical servers.

  5. VMware VM物理服务器启用复制。Enable replication for VMware VMs or physical servers. 这会启动所选计算机的初始复制。This kicks off the initial replication for the selected machines.

监视部署Monitor your deployment

启动第一批 VM 的复制后,按如下所述开始监视部署:After you kick off replication for the first batch of VMs, start monitoring your deployment as follows:

  1. 分配一名灾难恢复管理员来监视复制的计算机的运行状态。Assign a disaster recovery administrator to monitor the health status of replicated machines.
  2. 监视复制项和基础结构的事件Monitor events for replicated items and the infrastructure.
  3. 监视横向扩展进程服务器的运行状况Monitor the health of your scale-out process servers.
  4. 注册接收事件的电子邮件通知,以便于监视。Sign up to get email notifications for events, for easier monitoring.
  5. 定期开展灾难恢复演练,以确保一切按预期方式进行。Conduct regular disaster recovery drills, to ensure that everything's working as expected.

规划大规模故障转移Plan for large-scale failovers

发生灾难时,你可能需要将大量计算机/工作负荷故障转移到 Azure。In an event of disaster, you might need to fail over a large number of machines/workloads to Azure. 按如下所述准备应对此类事件。Prepare for this type of event as follows.

可按如下所述提前准备好故障转移:You can prepare in advance for failover as follows:

  • 准备基础结构和 VM,以便在故障转移后工作负荷可用,并且用户可以访问 Azure VM。Prepare your infrastructure and VMs so that your workloads will be available after failover, and so that users can access the Azure VMs.
  • 请注意本文档前面所述的故障转移限制Note the failover limits earlier in this document. 确保故障转移在这些限制范围内进行。Make sure your failovers will fall within these limits.
  • 定期运行灾难恢复演练Run regular disaster recovery drills. 演练可以帮助:Drills help to:
    • 在故障转移之前发现部署中的不足。Find gaps in your deployment before failover.
    • 估算应用的端到端 RTO。Estimate end-to-end RTO for your apps.
    • 估算工作负荷的端到端 RPO。Estimate end-to-end RPO for your workloads.
    • 识别 IP 地址范围冲突。Identify IP address range conflicts.
    • 运行演练时,我们建议不要使用生产网络,避免在生产和测试网络中使用相同的子网名称,并在每次演练后清理测试故障转移。As you run drills, we recommend that you don't use production networks for drills, avoid using the same subnet names in production and test networks, and clean up test failovers after every drill.

若要运行大规模故障转移,我们建议:To run a large-scale failover, we recommend the following:

  1. 为工作负荷故障转移创建恢复计划。Create recovery plans for workload failover.
    • 每个恢复计划最多可以触发 100 台计算机的故障转移。Each recovery plan can trigger failover of up to 100 machines.
    • 详细了解恢复计划。Learn more about recovery plans.
  2. 将 Azure 自动化 Runbook 脚本添加到恢复计划,以将 Azure 上的任何手动任务自动化。Add Azure Automation runbook scripts to recovery plans, to automate any manual tasks on Azure. 典型的任务包括配置负载均衡器、更新 DNS,等等。Typical tasks include configuring load balancers, updating DNS etc. 了解详细信息Learn more
  3. 在故障转移之前,请准备好 Windows 计算机,使之符合 Azure 环境的条件。Before failover, prepare Windows machines so that they comply with the Azure environment. 符合条件的计算机的故障转移限制更高。Failover limits are higher for machines that comply. 详细了解 Runbook。Learn more about runbooks.
  4. 结合恢复计划使用 Start-AzRecoveryServicesAsrPlannedFailoverJob PowerShell cmdlet 触发故障转移。Trigger failover with the Start-AzRecoveryServicesAsrPlannedFailoverJob PowerShell cmdlet, together with a recovery plan.

后续步骤Next steps