保护在 Azure Stack 上部署的虚拟机Protect virtual machines deployed on Azure Stack

使用本文作为指南来制定计划,用以保护用户在 Azure Stack 上部署的虚拟机 (VM)。Use this article as a guide to developing a plan for protecting virtual machines (VMs) that your users deploy on Azure Stack.

为了防止数据丢失和计划外停机,需要为用户应用程序及其数据实施备份恢复或灾难恢复计划。To protect against data loss and unplanned downtime, you need to implement a backup-recovery or disaster-recovery plan for user applications and their data. 对于各个应用程序而言,该计划可能各不相同,但会遵循一个框架,而该框架是按组织的综合性业务连续性和灾难恢复 (BC/DR) 策略制定的。This plan might be unique for each application but follows a framework established by your organization's comprehensive business continuity and disaster recovery (BC/DR) strategy.

Azure Stack 基础结构恢复Azure Stack infrastructure recovery

用户负责独立于 Azure Stack 的基础结构服务保护其 VM。Users are responsible for protecting their VMs separately from Azure Stack's infrastructure services.

Azure Stack 基础结构服务的恢复计划包括恢复用户 VM、存储帐户或数据库。The recovery plan for Azure Stack infrastructure services does not include recovery of user VMs, storage accounts, or databases. 作为应用程序所有者,你负责实施针对应用程序和数据的恢复计划。As the application owner, you are responsible for implementing a recovery plan for your applications and data.

如果 Azure Stack 云较长时间处于脱机状态或者永久性不可恢复,则需要实施一个具有以下用途的恢复计划:If the Azure Stack cloud is offline for an extended time or permanently unrecoverable, you need to have a recovery plan in place that:

  • 确保故障时间最短Ensures minimal downtime
  • 使关键 VM(例如数据库服务器)保持运行Keeps critical VMs, such as database servers, running
  • 使应用程序可以持续为用户请求提供服务Enables applications to keep servicing user requests

Azure Stack 云的操作员负责创建针对底层 Azure Stack 基础结构和服务的恢复计划。The operator of the Azure Stack cloud is responsible for creating a recovery plan for the underlying Azure Stack infrastructure and services. 若要进行详细了解,请阅读从灾难性数据丢失中恢复一文。To learn more, read the article Recover from catastrophic data loss.

有关 IaaS VM 的注意事项Considerations for IaaS VMs

安装在 IaaS VM 中的操作系统会进行限制,确保只有特定的产品可以用来保护其所包含的数据。The operating system installed in the IaaS VM will limit which products you can use to protect the data it contains. 对于基于 Windows 的 IaaS VMs,可以使用 Azure 和合作伙伴产品来保护数据。For Windows based IaaS VMs, you can use Azure and partner products to protect data. 对于基于 Linux 的 IaaS VM,唯一的选择是使用合作伙伴产品。For Linux based IaaS VMs, the only option is to use partner products.

源/目标组合Source/target combinations

每个 Azure Stack 云都部署到一个数据中心。Each Azure Stack cloud is deployed to one datacenter. 需要一个单独的环境来恢复应用程序。A separate environment is required so you can recover your applications. 该恢复环境可以是另一数据中心内的另一个 Azure Stack 云,也可以是 Azure 公有云。The recovery environment can be another Azure Stack cloud in a different datacenter or the Azure public cloud. 应用程序的恢复环境将取决于数据自主性和数据隐私要求。Your data sovereignty and data privacy requirements will determine the recovery environment for your application. 为每个应用程序启用保护时,可以灵活地为每个应用程序选择特定的恢复选项。As you enable protection for each application, you have the flexibility to choose a specific recovery option for each one. 可以让一个订阅中的应用程序将数据备份到另一数据中心。You can have applications in one subscription backing up data to another datacenter. 在另一订阅中,可以将数据复制到 Azure 公有云。In another subscription, you can replicate data to the Azure public cloud.

为每个应用程序计划备份恢复和灾难恢复策略,以便确定每个应用程序的目标。Plan your backup-recovery and disaster-recovery strategy for each application to determine the target for each application. 恢复计划将帮助你的组织正确确定本地需要的存储容量大小,并对公有云中的消耗进行计划。A recovery plan will help your organization properly size the storage capacity required on-premises and project consumption in the public cloud.

全球 AzureGlobal Azure 部署到 CSP 数据中心并由 CSP 操作的 Azure StackAzure Stack deployed into CSP datacenter and operated by CSP 部署到客户数据中心并由客户操作的 Azure StackAzure Stack deployed into customer datacenter and operated by customer
部署到 CSP 数据中心并由 CSP 操作的 Azure StackAzure Stack deployed into CSP datacenter and operated by CSP 用户 VM 部署到 CSP 操作的 Azure Stack。User VMs are deployed to the CSP operated Azure Stack.

用户 VM 从备份还原,或者直接故障转移到 Azure。User VMs are restored from backup or failed over directly to Azure.
CSP 在自己的数据中心操作 Azure Stack 的主要和次要实例。CSP operates the primary and secondary instances of Azure Stack in their own datacenters.

用户 VM 在这两个 Azure Stack 实例之间还原或故障转移。User VMs are restored or failed over between the two Azure Stack instances.
CSP 在主要站点操作 Azure Stack。CSP operates Azure Stack in the primary site.

客户的数据中心是还原或故障转移目标。Customer’s datacenter is the restore or failover target.
部署到客户数据中心并由客户操作的 Azure StackAzure Stack deployed into customer datacenter and operated by customer 用户 VM 部署到客户操作的 Azure Stack。User VMs are deployed to the customer operated Azure Stack.

用户 VM 从备份还原,或者直接故障转移到 Azure。User VMs are restored from backup or failed over directly to Azure.
客户在主要站点操作 Azure Stack。Customer operates Azure Stack in the primary site.

CSP 的数据中心是还原或故障转移目标。CSP’s datacenter is the restore or failover target.
客户在自己的数据中心操作 Azure Stack 的主要和次要实例。Customer operates the primary and secondary instances of Azure Stack in their own datacenters.

用户 VM 在这两个 Azure Stack 实例之间还原或故障转移。User VMs are restored or failed over between the two Azure Stack instances.

源-目标组合

应用程序恢复目标Application recovery objectives

你需要确定你的组织对每个应用程序可以容忍的故障时间量和数据丢失量。You will need to determine the amount of downtime and data loss your organization can tolerate for each application. 通过对故障时间和数据丢失进行量化,你可以创建恢复计划来将灾难对你的组织的影响降到最低。By quantifying downtime and data loss you can create a recovery plan that minimizes the impact of a disaster on your organization. 对于每个应用程序,请考虑:For each application, consider:

  • 恢复时间目标 (RTO)Recovery time objective (RTO)
    RTO 是指发生某个事件后,可接受应用程序不可用的最长时间。RTO is the maximum acceptable time that an application can be unavailable after an incident. 例如,如果 RTO 是 90 分钟,则意味着从发生灾难开始,必须能够在 90 分钟内将应用程序还原到正常运行状态。For example, an RTO of 90 minutes means that you must be able to restore the application to a running state within 90 minutes from the start of a disaster. 如果 RTO 低,可以持续运转一个后备部署,以防范区域性服务中断。If you have a low RTO, you might keep a second deployment continually running on standby to protect against a regional outage.
  • 恢复点目标 (RPO)Recovery point objective (RPO)
    RPO 是指发生灾难期间,可接受数据丢失的最大持续时间。RPO is the maximum duration of data loss that is acceptable during a disaster. 例如,如果在单个数据库中存储数据并且未将数据复制到其他数据库,而是执行每小时备份,则最长可能会丢失一小时的数据。For example, if you store data in a single database, with no replication to other databases, and perform hourly backups, you could lose up to an hour of data.

RTO 和 RPO 属于业务要求。RTO and RPO are business requirements. 开展风险评估的目的是定义应用程序的 RTO 和 RPO。Conduct a risk assessment to define the application's RTO and RPO.

另一个指标是平均恢复时间 (MTTR),指的是发生故障后还原应用程序所需的平均时间。Another metric is Mean Time to Recover (MTTR), which is the average time that it takes to restore the application after a failure. MTTR 反映的是系统的经验值。MTTR is an empirical value for a system. 如果 MTTR 超过 RTO,则系统发生故障会导致不可接受的业务中断,因为无法在定义的 RTO 内将系统还原。If MTTR exceeds the RTO, then a failure in the system will cause an unacceptable business disruption, because it won't be possible to restore the system within the defined RTO.

备份-还原Backup-restore

对于基于 VM 的应用程序,最常见的保护方案是使用备份软件。The most common protection scheme for VM-based applications is to use backup software. 备份 VM 通常包括操作系统、操作系统配置、应用程序二进制文件和应用程序数据。Backing up a VM typically includes the operating system, operating system configuration, application binaries, and application data. 可以通过拍摄卷、磁盘或整个 VM 的快照来创建备份。The backups are created by taking a snapshot of the volumes, disks, or the entire VM. 使用 Azure Stack 时,可以灵活地选择是从来宾 OS 的上下文中备份,还是从 Azure Stack 存储和计算 API 备份。With Azure Stack, you have the flexibility of backing up from within the context of the guest OS or from the Azure Stack storage and compute APIs. Azure Stack 不支持在虚拟机监控程序级别进行备份。Azure Stack does not support taking backups at the hypervisor level.

备份-还原

恢复应用程序时,需要将一个或多个 VM 还原到相同的云或新的云。Recovering the application requires restoring one or more VMs to the same cloud or to a new cloud. 可以将数据中心的云或公有云作为目标。You can target a cloud in your datacenter or the public cloud. 选择的云完全由你来控制,并且取决于数据隐私和自主性要求。The cloud you choose is completely within your control and is based on your data privacy and sovereignty requirements.

  • RTO:以小时计量的停机时间RTO: Downtime measured in hours
  • RPO:可变数据丢失(取决于备份频率)RPO: Variable data loss (depending on backup frequency)
  • 部署拓扑:主动/被动Deployment topology: Active/passive

规划备份策略Planning your backup strategy

在计划备份策略和定义缩放要求时,首先应确定需要保护的 VM 实例的数目。Planning your backup strategy and defining scale requirements starts with quantifying the number of VM instances that need to be protected. 一个常见的策略是备份环境中所有服务器的所有 VM。Backing up all VMs across all servers in an environment is a common strategy. 但在使用 Azure Stack 时,某些 VM 不需备份。However, with Azure Stack, there are some VMs that do need to be backed up. 例如,可以将规模集中的 VM 视为暂时性资源,这些资源时来时去,有时甚至不会有通知,这些 VM 就不需要备份。For example, VMs in a scale-set are considered ephemeral resources that can come and go, sometimes without notice. 任何需要保护的持久数据都存储在单独的存储库(例如数据库或对象存储)中。Any durable data that needs to be protected is stored in a separate repository such as a database or object store.

在 Azure Stack 上备份 VM 的重要注意事项:Important considerations for backing up VMs on Azure Stack:

  • 分类Categorization
    • 考虑一个用户参与 VM 备份的模型。Consider a model where users opt in to VM backup.
    • 根据应用程序优先级或对业务的影响来定义恢复服务级别协议 (SLA)。Define a recovery service level agreement (SLA) based on the priority of the applications or the impact to the business.
  • 缩放Scale
    • 在载入大量新的 VM 时考虑进行交错式备份(如果必须备份)。Consider staggered backups when on-boarding a large number of new VMs (if backup is required).
    • 评估可以有效地捕获和传输备份数据的备份产品,尽量减少解决方案上的资源内容。Evaluate backup products that can efficiently capture and transmit backup data to minimize resource content on the solution.
    • 评估可以通过增量备份或差异备份有效存储备份数据的备份产品,尽量减少为环境中的所有 VM 创建完整备份的需求。Evaluate backup products that efficiently store backup data using incremental or differential backups to minimize the need for full backups across all VMs in the environment.
  • RestoreRestore
    • 备份产品可以还原虚拟磁盘、现有 VM 中的应用程序数据,或者整个 VM 资源和关联的虚拟磁盘。Backup products can restore virtual disks, application data within an existing VM, or the entire VM resource and associated virtual disks. 所需的还原方案取决于你计划如何还原应用程序,并且会影响应用程序的恢复时间。The restore scheme you need depends on how you plan to restore the application and it will impact your application time to recovery. 例如,从模板重新部署 SQL Server 并还原数据库而不是还原整个 VM 或 VM 集可能会更容易。For example, it may be easier to redeploy SQL server from a template and then restore the databases instead of restoring the entire VM or set of VMs.

复制/手动故障转移Replication/manual failover

支持高可用性的一个替代方法是将应用程序 VM 复制到另一云并依赖于手动故障转移。An alternate approach to supporting high availability is to replicate your application VMs to another cloud and rely on a manual failover. 可以在 VM 级别或来宾 OS 级别复制操作系统、应用程序二进制文件和应用程序数据。The replication of the operating system, application binaries, and application data can be performed at the VM level or guest OS level. 故障转移是使用不是应用程序一部分的其他软件管理的。The failover is managed using additional software that is not part of the application.

使用此方法时,会将应用程序部署在一个云中,将其 VM 复制到另一个云中。With this approach, the application is deployed in one cloud and its VM is replicated to the other cloud. 如果触发了故障转移,则需在第二个云中启动辅助 VM。If a failover is triggered, the secondary VMs need to be powered on in the second cloud. 在某些情况下,故障转移会创建 VM 并向其附加磁盘。In some scenarios, the failover creates the VMs and attaches disks to them. 此过程可能需要很长时间才能完成,尤其是对于需要采用特定启动顺序的多层应用程序。This process can take a long time to complete, especially with a multi-tiered application that requires a specific start-up sequence. 在应用程序准备好开始为请求提供服务之前,可能还必须运行其他步骤。There may also be steps that must be run before the application is ready to start servicing requests.

复制-手动故障转移

  • RTO:以分钟计量的停机时间RTO: Downtime measured in minutes
  • RPO:可变数据丢失(取决于复制频率)RPO: Variable data loss (depending on replication frequency)
  • 部署拓扑:主动/被动备用Deployment topology: Active/Passive stand-by

高可用性/自动故障转移High availability/automatic failover

如果企业在使用应用程序时只能容忍数秒或数分钟的停机时间和最低程度的数据丢失,则需考虑为此类应用程序提供高可用性配置。For applications where your business can tolerate a few seconds or minutes of downtime and minimal data loss, you will need to consider a high-availability configuration. 根据设计,高可用性应用程序可以自动快速从故障中恢复。High-availability applications are designed to quickly and automatically recover from faults. 对于本地硬件故障,Azure Stack 基础结构使用两个架顶式交换机在物理网络中实现高可用性。For local hardware faults, Azure Stack infrastructure implements high availability in the physical network using two top of rack switches. 对于计算级别故障,Azure Stack 在一个缩放单元中使用多个节点。For compute level faults, Azure Stack uses multiple nodes in a scale unit. 在 VM 级别,可以组合使用规模集与容错域,确保节点故障不会导致应用程序无法使用。At the VM level, you can use scale sets in combination with fault domains to ensure node failures do not take down your application.

与规模集一起组合使用时,应用程序需要本机高可用性支持,或者需要支持使用群集软件。In combination with scale sets, your application will need to support high availability natively or support the use of clustering software. 例如,对于使用同步提交模式的数据库,Microsoft SQL Server 提供本机高可用性支持。For example, Microsoft SQL Server supports high availability natively for databases using synchronous-commit mode. 但是,如果只能支持异步复制,则会存在某种程度的数据丢失。However, if you can only support asynchronous replication, then there will be some data loss. 也可将应用程序部署到故障转移群集,由其中的群集软件处理应用程序的自动故障转移。Applications can also be deployed into a failover cluster where the clustering software handles the automatic failover of the application.

使用此方法时,应用程序只在一个云中处于活动状态,但软件则部署到多个云。Using this approach, the application is only active in one cloud, but the software is deployed to multiple clouds. 其他云处于备用模式,只要触发故障转移就可以启动应用程序。The other clouds are in stand-by mode ready to start the application when the failover is triggered.

  • RTO:以秒衡量的停机时间RTO: Downtime measured in seconds
  • RPO:最少数据丢失量RPO: Minimal data loss
  • 部署拓扑:主动/主动备用Deployment topology: Active/Active stand-by

容错Fault tolerance

Azure Stack 物理冗余和基础结构服务可用性只能针对硬件级别的故障(例如磁盘故障、电源故障、网络端口故障或节点故障)提供保护。Azure Stack physical redundancy and infrastructure service availability only protect against hardware level faults/failures such a disk, power supply, network port, or node. 但是,如果应用程序必须始终可用且不能丢失任何数据,则需在应用程序中实施本机容错,或者使用其他软件来实现容错。However, if your application must always be available and can never lose any data, you will need to implement fault tolerance natively in your application or use additional software to enable fault tolerance.

首先,需确保在部署应用程序 VM 时,使用规模集来应对节点级别故障。First, you need to ensure the application VMs are deployed using scale sets to protect against node-level failures. 为了应对云脱机的情况,必须将同一应用程序部署到另一云,使之能够继续处理请求,不会造成中断。To protect against the cloud going offline, the same application must already be deployed to a different cloud, so it can continue servicing requests without interruption. 此模型通常称为主动-主动部署。This model is typically referred to an active-active deployment.

请记住,每个 Azure Stack 云是互相独立的,因此从基础结构角度来看,始终可以将这些云视为处于活动状态。Keep in mind that each Azure Stack cloud is independent of each other, so the clouds are always considered active from an infrastructure perspective. 在这种情况下,应用程序有多个活动的实例部署到一个或多个活动的云。In this case, multiple active instances of the application are deployed to one or more active clouds.

  • RTO:无停机RTO: No downtime
  • RPO:无数据丢失RPO: No data loss
  • 部署拓扑:主动/主动Deployment topology: Active/Active

不恢复No recovery

环境中的某些应用程序可能不需要针对计划外停机或数据丢失进行保护。Some applications in your environment may not need protection against unplanned downtime or data loss. 例如,用于开发和测试的 VM 通常不需要进行恢复。For example, VMs used for development and testing typically do not need to be recovered. 是否不对应用程序或特定的 VM 进行保护由你自行决定。It is your decision to do without protection for an application or a specific VM. Azure Stack 不通过底层基础结构提供 VM 的备份或复制。Azure Stack does not offer backup or replication of VMs from the underlying infrastructure. 与 Azure 一样,你需要在每个订阅中选择加入才能对每个 VM 进行保护。Similar to Azure, you will need to opt-in to protection for each VM in each of your subscriptions.

  • RTO:无法恢复RTO: Unrecoverable
  • RPO:完全数据丢失RPO: Complete data loss

Azure Stack 部署的重要注意事项:Important considerations for your Azure Stack deployment:

建议Recommendation 注释Comments
将 VM 备份/还原到已部署在数据中心的外部备份目标Backup/restore VMs to an external backup target already deployed in your datacenter 建议Recommended 利用现有的备份基础结构和操作技能。Take advantage of existing backup infrastructure and operational skills. 确保在设置备份基础结构的大小时,能够让它保护其他的 VM 实例。Make sure to size the backup infrastructure so it is ready to protect the additional VM instances. 确保备份基础结构不要紧靠源。Make sure backup infrastructure is not in close proximity to your source. 可以将 VM 还原到源 Azure Stack、辅助 Azure Stack 实例或 Azure。You can restore VMs to the source Azure Stack, to a secondary Azure Stack instance, or Azure.
将 VM 备份/还原到专用于 Azure Stack 的外部备份目标Backup/restore VMs to an external backup target dedicated to Azure Stack 建议Recommended 可以为 Azure Stack 购买新的备份基础结构或预配专用的备份基础结构。You can purchase new backup infrastructure or provision dedicated backup infrastructure for Azure Stack. 确保备份基础结构不要紧靠源。Make sure backup infrastructure is not in close proximity to your source. 可以将 VM 还原到源 Azure Stack、辅助 Azure Stack 实例或 Azure。You can restore VMs to the source Azure Stack, to a secondary Azure Stack instance, or Azure.
将 VM 直接备份/还原到全球版 Azure 或受信任的服务提供商Backup/restore VMs directly to global Azure or a trusted service provider 建议Recommended 只要能够满足数据隐私和法规要求,就可以将备份存储到全球版 Azure 或受信任的服务提供商。As long as you can meet your data privacy and regulatory requirements, you can store your backups in global Azure or a trusted service provider. 理想情况下,该服务提供商也会运行 Azure Stack,因此你在还原时获得的操作体验是一致的。Ideally the service provider is also running Azure Stack so you get consistency in operational experience when you restore.
将 VM 复制/故障转移到单独的 Azure Stack 实例Replicate/failover VMs to a separate Azure Stack instance 建议Recommended 在进行故障转移时,需要有一个运行完全正常的辅助 Azure Stack 云,这样就可以避免应用程序不可用的时间延长。In the failover case, you will need to have a second Azure Stack cloud fully operational so you can avoid extended application downtime.
将 VM 直接复制/故障转移到 Azure 或受信任的服务提供商Replicate/failover VMs directly to Azure or a trusted service provider 建议Recommended 只要能够满足数据隐私和法规要求,就可以将数据复制到全球版 Azure 或受信任的服务提供商。As long as you can meet your data privacy and regulatory requirements, you can replicate your data to global Azure or a trusted service provider. 理想情况下,该服务提供商也会运行 Azure Stack,因此你在故障转移后获得的操作体验是一致的。Ideally the service provider is also running Azure Stack so you get consistency in operational experience after failover.
将备份目标部署到应用程序数据所在的 Azure Stack 云上Deploy backup target on the same Azure Stack cloud with your application data 不建议Not recommended 避免将备份存储在相同的 Azure Stack 云中。Avoid storing backups within the same Azure Stack cloud. 云出现计划外停机可能会导致你无法接触主数据和备份数据。Unplanned downtime of the cloud can keep you from your primary data and backup data. 如果选择将备份目标部署为虚拟设备(目的是针对备份和还原进行优化),则必须确保将所有数据持续复制到外部备份位置。If you choose to deploy a backup target as a virtual appliance (for the purposes of optimization for backup and restore), you must ensure all data is continuously copied to an external backup location.
将物理备份设备部署到安装了 Azure Stack 解决方案的机架Deploy physical backup appliance into the same rack where the Azure Stack solution is installed 不支持Not supported 目前,不能将任何其他设备连接到不属于原始解决方案的架顶式交换机。At this point in time, you cannot connect any other devices to the top of rack switches that are not part of the original solution.

后续步骤Next steps

本文提供了用于保护 Azure Stack 上部署的用户 VM 的一般准则。This article provided general guidelines for protecting user VMs deployed on Azure Stack. 有关使用 Azure 服务保护用户 VM 的信息,请参阅:For information about using Azure services to protect user VMs, refer to:

若要详细了解在 Azure Stack 上提供 VM 保护的合作伙伴产品,请参阅“Protecting applications and data on Azure Stack(保护 Azure Stack 上的应用程序和数据)”。To learn more about the partner products that offer VM protection on Azure Stack, refer to "Protecting applications and data on Azure Stack."