VMware 到 Azure 的灾难恢复体系结构VMware to Azure disaster recovery architecture

本文介绍使用 Azure Site Recovery 服务在本地 VMware 站点和 Azure 之间部署 VMware 虚拟机 (VM) 的灾难恢复复制、故障转移和恢复时使用的体系结构和过程。This article describes the architecture and processes used when you deploy disaster recovery replication, failover, and recovery of VMware virtual machines (VMs) between an on-premises VMware site and Azure using the Azure Site Recovery service.

体系结构组件Architectural components

下面的表和图提供了用于将 VMware 灾难恢复到 Azure 的组件的概要视图。The following table and graphic provide a high-level view of the components used for VMware disaster recovery to Azure.

组件Component 要求Requirement 详细信息Details
AzureAzure Azure 订阅、用于缓存的 Azure 存储帐户、托管磁盘和 Azure 网络。An Azure subscription, Azure Storage account for cache, Managed Disk, and Azure network. 从本地 VM 复制的数据存储在 Azure 存储中。Replicated data from on-premises VMs is stored in Azure storage. 运行从本地到 Azure 的故障转移时,将使用复制的数据创建 Azure VM。Azure VMs are created with the replicated data when you run a failover from on-premises to Azure. 创建 Azure VM 后,它们将连接到 Azure 虚拟网络。The Azure VMs connect to the Azure virtual network when they're created.
配置服务器计算机Configuration server machine 单个本地计算机。A single on-premises machine. 建议将其作为可通过下载的 OVF 模板部署的 VMware VM 来运行。We recommend that you run it as a VMware VM that can be deployed from a downloaded OVF template.

计算机运行所有本地 Site Recovery 组件,包括配置服务器、进程服务器和主目标服务器。The machine runs all on-premises Site Recovery components, which include the configuration server, process server, and master target server.
配置服务器 :在本地和 Azure 之间协调通信并管理数据复制。Configuration server : Coordinates communications between on-premises and Azure, and manages data replication.

进程服务器 :默认安装在配置服务器上。Process server : Installed by default on the configuration server. 它接收复制数据,通过缓存、压缩和加密对其进行优化,然后将数据发送到 Azure 存储。It receives replication data; optimizes it with caching, compression, and encryption; and sends it to Azure Storage. 进程服务器还会将 Azure Site Recovery 移动服务安装在要复制的 VM 上,并在本地计算机上执行自动发现。The process server also installs Azure Site Recovery Mobility Service on VMs you want to replicate, and performs automatic discovery of on-premises machines. 随着部署扩大,可以另外添加单独的进程服务器来处理更大的复制流量。As your deployment grows, you can add additional, separate process servers to handle larger volumes of replication traffic.

主目标服务器 :默认安装在配置服务器上。Master target server : Installed by default on the configuration server. 它处理从 Azure 进行故障回复期间产生的复制数据。It handles replication data during failback from Azure. 对于大型部署,可以另外添加一个单独的主目标服务器用于故障回复。For large deployments, you can add an additional, separate master target server for failback.
VMware 服务器VMware servers VMware VM 在本地 vSphere ESXi 服务器上托管。VMware VMs are hosted on on-premises vSphere ESXi servers. 我们建议使用 vCenter 服务器管理主机。We recommend a vCenter server to manage the hosts. 在 Site Recovery 部署期间,将 VMware 服务器添加到恢复服务保管库。During Site Recovery deployment, you add VMware servers to the Recovery Services vault.
复制的计算机Replicated machines 移动服务将安装在复制的每个 VMware VM 上。Mobility Service is installed on each VMware VM that you replicate. 建议允许从进程服务器自动安装。We recommend that you allow automatic installation from the process server. 也可以手动安装此服务,或者使用 Configuration Manager 等自动部署方法。Alternatively, you can install the service manually or use an automated deployment method, such as Configuration Manager.

此图显示 VMware 到 Azure 的复制体系结构关系。

设置出站网络连接Set up outbound network connectivity

若要使 Site Recovery 按预期工作,需修改出站网络连接以允许环境复制。For Site Recovery to work as expected, you need to modify outbound network connectivity to allow your environment to replicate.

备注

Site Recovery 不支持使用身份验证代理来控制网络连接。Site Recovery doesn't support using an authentication proxy to control network connectivity.

URL 的出站连接Outbound connectivity for URLs

如果使用基于 URL 的防火墙代理来控制出站连接,请允许访问以下 URL:If you're using a URL-based firewall proxy to control outbound connectivity, allow access to these URLs:

名称Name Azure 中国世纪互联Azure China 21Vianet 说明Description
存储Storage *.blob.core.chinacloudapi.cn 允许将数据从 VM 写入源区域中的缓存存储帐户。Allows data to be written from the VM to the cache storage account in the source region.
Azure Active DirectoryAzure Active Directory login.chinacloudapi.cn 向 Site Recovery 服务 URL 提供授权和身份验证。Provides authorization and authentication to Site Recovery service URLs.
复制Replication *.hypervrecoverymanager.windowsazure.cn 允许 VM 与 Site Recovery 服务进行通信。Allows the VM to communicate with the Site Recovery service.
服务总线Service Bus *.servicebus.chinacloudapi.cn 允许 VM 写入 Site Recovery 监视和诊断数据。Allows the VM to write Site Recovery monitoring and diagnostics data.

复制过程Replication process

  1. 为某台 VM 启用复制时,将使用指定的复制策略开始到 Azure 存储的初始复制。When you enable replication for a VM, initial replication to Azure storage begins, using the specified replication policy. 注意以下事项:Note the following:

    • 对于 VMware VM,复制是在块级别进行的,几乎连续进行,使用的是在 VM 上运行的移动服务代理。For VMware VMs, replication is block-level, near-continuous, using the Mobility service agent running on the VM.
    • 应用的任何复制策略设置:Any replication policy settings are applied:
      • RPO 阈值RPO threshold . 此设置不影响复制。This setting does not affect replication. 它有助于监视。It helps with monitoring. 如果当前 RPO 超出了你指定的阈值限制,则会引发事件,并且还可能会发送电子邮件。An event is raised, and optionally an email sent, if the current RPO exceeds the threshold limit that you specify.
      • 恢复点保留期Recovery point retention . 此设置指定当发生中断时要回退的时间距离。This setting specifies how far back in time you want to go when a disruption occurs. 高级存储上的最大保留期为 24 小时。Maximum retention on premium storage is 24 hours. 标准存储上的最大保留期为 72 小时。On standard storage it's 72 hours.
      • 应用一致的快照App-consistent snapshots . 可以每隔 1 到 12 个小时获取一次应用一致的快照,具体取决于应用需求。App-consistent snapshot can be taken every 1 to 12 hours, depending on your app needs. 快照是标准的 Azure blob 快照。Snapshots are standard Azure blob snapshots. 在 VM 上运行的移动代理根据此设置请求 VSS 快照,并且会将该时间点标记为复制流中的一个应用一致点。The Mobility agent running on a VM requests a VSS snapshot in accordance with this setting, and bookmarks that point-in-time as an application consistent point in the replication stream.
  2. 流量通过 Internet 复制到 Azure 存储公共终结点。Traffic replicates to Azure storage public endpoints over the internet. 或者,可以将 Azure ExpressRoute 与 Azure 对等互连配合使用。Alternately, you can use Azure ExpressRoute with Azure peering. 不支持通过站点到站点虚拟专用网络 (VPN) 将流量从本地站点复制到 Azure。Replicating traffic over a site-to-site virtual private network (VPN) from an on-premises site to Azure isn't supported.

  3. 初始复制操作可确保在启用复制时将计算机上的所有数据发送到 Azure。Initial replication operation ensures that entire data on the machine at the time of enable replication is sent to Azure. 完成初始复制后,开始将增量更改复制到 Azure。After initial replication finishes, replication of delta changes to Azure begins. 针对机器跟踪的更改将发送到进程服务器。Tracked changes for a machine are sent to the process server.

  4. 通信按如下方式发生:Communication happens as follows:

    • VM 通过 HTTPS 443 入站端口与本地配置服务器通信,进行复制管理。VMs communicate with the on-premises configuration server on port HTTPS 443 inbound, for replication management.
    • 配置服务器通过 HTTPS 443 出站端口来与 Azure 协调复制。The configuration server orchestrates replication with Azure over port HTTPS 443 outbound.
    • VM 将复制数据发送到 HTTPS 9443 入站端口上的进程服务器(在配置服务器计算机上运行)。VMs send replication data to the process server (running on the configuration server machine) on port HTTPS 9443 inbound. 可以修改此端口。This port can be modified.
    • 进程服务器接收复制数据、优化和加密数据,然后通过 443 出站端口将其发送到 Azure 存储。The process server receives replication data, optimizes, and encrypts it, and sends it to Azure storage over port 443 outbound.
  5. 复制数据首先登陆 Azure 中的缓存存储帐户。The replication data logs first land in a cache storage account in Azure. 处理这些日志,并将数据存储在 Azure 托管磁盘(称为 asr 种子磁盘)中。These logs are processed and the data is stored in an Azure Managed Disk (called as asr seed disk). 将在此磁盘上创建恢复点。The recovery points are created on this disk.

此图显示 VMware 到 Azure 的复制过程。

重新同步过程Resynchronization process

  1. 有时,在初始复制或传输增量更改期间,源计算机到进程服务器之间或进程服务器到 Azure 之间可能存在网络连接问题。At times, during initial replication or while transferring delta changes, there can be network connectivity issues between source machine to process server or between process server to Azure. 这两种情况都可能导致暂时无法将数据传输到 Azure。Either of these can lead to failures in data transfer to Azure momentarily.
  2. 为避免出现数据完整性问题,并最大限度地降低数据传输成本,Site Recovery 会标记计算机以进行重新同步。To avoid data integrity issues, and minimize data transfer costs, Site Recovery marks a machine for resynchronization.
  3. 在以下情况下也可能标记计算机以进行重新同步,目的是保持源计算机和存储在 Azure 中的数据之间的一致性A machine can also be marked for resynchronization in situations like following to maintain consistency of between source machine and data stored in Azure
    • 如果计算机强制关机If a machine undergoes force shut-down
    • 如果计算机出现了诸如磁盘大小调整之类的配置更改(将磁盘大小从 2 TB 修改为 4 TB)If a machine undergoes configurational changes like disk resizing (modifying the size of disk from 2 TB to 4 TB)
  4. 重新同步仅向 Azure 发送增量数据。Resynchronization sends only delta data to Azure. 通过计算源计算机和 Azure 中存储的数据之间的数据校验和来最小化本地和 Azure 之间的数据传输。Data transfer between on-premises and Azure by minimized by computing checksums of data between source machine and data stored in Azure.
  5. 默认情况下,重新同步安排为在非工作时间自动运行。By default resynchronization is scheduled to run automatically outside office hours. 如果你不希望等待默认非工作时间的重新同步,可手动重新同步 VM。If you don't want to wait for default resynchronization outside hours, you can resynchronize a VM manually. 为此,请转到 Azure 门户,选择“VM”>“重新同步”。To do this, go to Azure portal, select the VM > Resynchronize .
  6. 如果默认重新同步在办公时间之外失败,并且需要手动干预,则会在 Azure 门户中的特定计算机上生成错误。If default resynchronization fails outside office hours and a manual intervention is required, then an error is generated on the specific machine in Azure portal. 可以解决错误并手动触发重新同步。You can resolve the error and trigger the resynchronization manually.
  7. 重新同步完成后,将继续复制增量更改。After completion of resynchronization, replication of delta changes will resume.

复制策略Replication policy

启用 Azure VM 复制时,Site Recovery 默认会使用下表中汇总的默认设置创建新的复制策略。When you enable Azure VM replication, by default Site Recovery creates a new replication policy with the default settings summarized in the table.

策略设置Policy setting 详细信息Details 默认Default
恢复点保留期Recovery point retention 指定 Site Recovery 保留恢复点的时间长短Specifies how long Site Recovery keeps recovery points 24 小时24 hours
应用一致性快照频率App-consistent snapshot frequency Site Recovery 创建应用一致性快照的频率。How often Site Recovery takes an app-consistent snapshot. 每 4 小时Every four hours

管理复制策略Managing replication policies

可按如下所述管理和修改默认的复制策略设置:You can manage and modify the default replication policies settings as follows:

  • 启用复制时可以修改设置。You can modify the settings as you enable replication.
  • 随时可以创建复制策略,并在启用复制时应用该策略。You can create a replication policy at any time, and then apply it when you enable replication.

多 VM 一致性Multi-VM consistency

如果希望 VM 一同复制,并在故障转移时获得共享的崩溃一致性和应用一致性恢复点,可将这些 VM 集中到一个复制组中。If you want VMs to replicate together, and have shared crash-consistent and app-consistent recovery points at failover, you can gather them together into a replication group. 多 VM 一致性会影响工作负荷的性能。仅当 VM 运行的工作负荷需要在所有计算机之间保持一致时,才应该对这些 VM 使用此功能。Multi-VM consistency impacts workload performance, and should only be used for VMs running workloads that need consistency across all machines.

快照和恢复点Snapshots and recovery points

恢复点是基于在特定时间点生成的 VM 磁盘快照创建的。Recovery points are created from snapshots of VM disks taken at a specific point in time. 故障转移 VM 时,可以使用恢复点来还原目标位置中的 VM。When you fail over a VM, you use a recovery point to restore the VM in the target location.

故障转移时,我们通常想要确保 VM 在不发生任何数据损坏或丢失的情况下启动,并且 VM 数据可在操作系统以及 VM 上运行的应用中保持一致。When failing over, we generally want to ensure that the VM starts with no corruption or data loss, and that the VM data is consistent for the operating system, and for apps that run on the VM. 这取决于创建的快照类型。This depends on the type of snapshots taken.

Site Recovery 按如下所述创建快照:Site Recovery takes snapshots as follows:

  1. 默认情况下,Site Recovery 创建崩溃一致性数据快照;如果指定了频率,则创建应用一致性快照。Site Recovery takes crash-consistent snapshots of data by default, and app-consistent snapshots if you specify a frequency for them.
  2. 恢复点是基于快照创建的,根据复制策略中的保留期设置进行存储。Recovery points are created from the snapshots, and stored in accordance with retention settings in the replication policy.

一致性Consistency

下表解释了不同的一致性类型。The following table explains different types of consistency.

崩溃一致性Crash-consistent

说明Description 详细信息Details 建议Recommendation
崩溃一致性快照捕获创建快照时磁盘上的数据。A crash consistent snapshot captures data that was on the disk when the snapshot was taken. 它不包括内存中的任何数据。It doesn't include anything in memory.

崩溃一致性快照包含在 VM 发生崩溃或者在创建快照的那一刻从服务器上拔下电源线时,磁盘上的等量数据。It contains the equivalent of the on-disk data that would be present if the VM crashed or the power cord was pulled from the server at the instant that the snapshot was taken.

崩溃一致性不能保证操作系统或 VM 上的应用中的数据一致性。A crash-consistent doesn't guarantee data consistency for the operating system, or for apps on the VM.
默认情况下,Site Recovery 每隔五分钟创建崩溃一致性恢复点。Site Recovery creates crash-consistent recovery points every five minutes by default. 此设置不可修改。This setting can't be modified.

目前,大多数应用都可以从崩溃一致性恢复点正常恢复。Today, most apps can recover well from crash-consistent points.

对于操作系统以及 DHCP 服务器和打印服务器等应用而言,崩溃一致性恢复点通常已足够。Crash-consistent recovery points are usually sufficient for the replication of operating systems, and apps such as DHCP servers and print servers.

应用一致性App-consistent

说明Description 详细信息Details 建议Recommendation
应用一致性恢复点是基于应用一致性快照创建的。App-consistent recovery points are created from app-consistent snapshots.

应用一致性快照包含崩溃一致性快照中的所有信息,此外加上内存中的数据,以及正在进行的事务中的数据。An app-consistent snapshot contain all the information in a crash-consistent snapshot, plus all the data in memory and transactions in progress.
应用一致性快照使用卷影复制服务 (VSS):App-consistent snapshots use the Volume Shadow Copy Service (VSS):

1) Azure Site Recovery 使用“仅复制”备份 (VSS_BT_COPY) 方法,此方法不会更改 Azure SQL 的事务日志备份时间和序列号1) Azure Site Recovery uses Copy Only backup (VSS_BT_COPY) method which does not change Azure SQL's transaction log backup time and sequence number

2) 启动快照时,VSS 会在卷上执行写入时复制 (COW) 操作。2) When a snapshot is initiated, VSS perform a copy-on-write (COW) operation on the volume.

3) 执行 COW 之前,VSS 会告知计算机上的每个应用它需要将内存常驻数据刷新到磁盘。3) Before it performs the COW, VSS informs every app on the machine that it needs to flush its memory-resident data to disk.

4) 然后,VSS 会允许备份/灾难恢复应用(在本例中为 Site Recovery)读取快照数据并继续处理。4) VSS then allows the backup/disaster recovery app (in this case Site Recovery) to read the snapshot data and proceed.
应用一致性快照是按指定的频率创建的。App-consistent snapshots are taken in accordance with the frequency you specify. 此频率始终应小于为保留恢复点设置的频率。This frequency should always be less than you set for retaining recovery points. 例如,如果使用默认设置 24 小时保留恢复点,则应将频率设置为小于 24 小时。For example, if you retain recovery points using the default setting of 24 hours, you should set the frequency at less than 24 hours.

应用一致性快照比崩溃一致性快照更复杂,且完成时间更长。They're more complex and take longer to complete than crash-consistent snapshots.

应用一致性快照会影响已启用复制的 VM 上运行的应用的性能。They affect the performance of apps running on a VM enabled for replication.

故障转移和故障回复过程Failover and failback process

在设置复制并运行故障恢复演练(测试故障转移)来检查是否一切都按预期工作后,可以根据需要运行故障转移和故障回复。After replication is set up and you run a disaster recovery drill (test failover) to check that everything's working as expected, you can run failover and failback as you need to.

  1. 可以针对一台计算机运行故障转移,或创建恢复计划来同时故障转移多个 VM。You run fail for a single machine, or create a recovery plans to fail over multiple VMs at the same time. 相比单计算机故障转移,恢复计划的优势包括:The advantage of a recovery plan rather than single machine failover include:

    • 可以通过在单个恢复计划中包含整个应用的所有 VM,为应用依赖关系建模。You can model app-dependencies by including all the VMs across the app in a single recovery plan.
    • 可以添加脚本、Azure Runbook 和暂停手动操作。You can add scripts, Azure runbooks, and pause for manual actions.
  2. 触发初始故障转移之后,可提交它来开始访问 Azure VM 中的工作负荷。After triggering the initial failover, you commit it to start accessing the workload from the Azure VM.

  3. 当本地主站点再次可用时,可以准备故障回复。When your primary on-premises site is available again, you can prepare for fail back. 若要故障回复,需要设置故障回复基础结构,包括:In order to fail back, you need to set up a failback infrastructure, including:

    • Azure 中的临时进程服务器 :若要从 Azure 进行故障回复,需要设置用作进程服务器的 Azure VM,以处理从 Azure 进行的复制。Temporary process server in Azure : To fail back from Azure, you set up an Azure VM to act as a process server to handle replication from Azure. 故障回复完成后,可以删除此 VM。You can delete this VM after failback finishes.
    • VPN 连接 :若要进行故障回复,需要设置从 Azure 网络到本地站点的 VPN 连接(或 ExpressRoute)。VPN connection : To fail back, you need a VPN connection (or ExpressRoute) from the Azure network to the on-premises site.
    • 单独的主目标服务器 :默认情况下,在本地 VMware VM 上与配置服务器一起安装的主目标服务器用于处理故障回复。Separate master target server : By default, the master target server that was installed with the configuration server on the on-premises VMware VM handles failback. 如果对大量流量进行故障回复,应创建专用于此目的的独立本地主目标服务器。If you need to fail back large volumes of traffic, set up a separate on-premises master target server for this purpose.
    • 故障回复策略 :若要复制回到本地站点,需要创建故障回复策略。Failback policy : To replicate back to your on-premises site, you need a failback policy. 此策略是在创建从本地到 Azure 的复制策略时自动创建的。This policy is automatically created when you create a replication policy from on-premises to Azure.
  4. 所有组件均就位后,故障回复通过三个操作进行:After the components are in place, failback occurs in three actions:

    • 第 1 阶段:重新保护 Azure VM,以便它们可以从 Azure 复制回本地 VMware VM。Stage 1: Reprotect the Azure VMs so that they replicate from Azure back to the on-premises VMware VMs.
    • 第 2 阶段:运行到本地站点的故障转移。Stage 2: Run a failover to the on-premises site.
    • 第 3 阶段:在工作负荷进行故障回复后,为本地 VM 重新启用复制。Stage 3: After workloads have failed back, you reenable replication for the on-premises VMs.

此图显示从 Azure 进行 VMware 故障回复。

后续步骤Next steps

根据此教程启用 VMware 到 Azure 复制。Follow this tutorial to enable VMware to Azure replication.