在 Azure 中管理 Windows 虚拟机的可用性Manage the availability of Windows virtual machines in Azure

了解如何设置和管理多个虚拟机,以确保 Windows 应用程序在 Azure 中的高可用性。Learn ways to set up and manage multiple virtual machines to ensure high availability for your Windows application in Azure. 也可以管理 Linux 虚拟机的可用性You can also manage the availability of Linux virtual machines.

有关在使用经典部署模型时创建和使用可用性集的说明,请参阅如何配置可用性集For instructions on creating and using availability sets when using the classic deployment model, see How to Configure an Availability Set.

了解 VM 重启 - 维护和停机Understand VM Reboots - maintenance vs. downtime

有三种情况可能会导致 Azure 中的虚拟机受影响:计划外硬件维护、意外停机、计划内维护。There are three scenarios that can lead to virtual machine in Azure being impacted: unplanned hardware maintenance, unexpected downtime, and planned maintenance.

  • 当 Azure 平台预测硬件或者与物理计算机关联的任何平台组件即将发生故障时,就会发生计划外硬件维护事件。Unplanned Hardware Maintenance Event occurs when the Azure platform predicts that the hardware or any platform component associated to a physical machine, is about to fail. 当预测到故障时,平台会发出计划外硬件维护事件,以便减少对托管在该硬件上的虚拟机的影响。When the platform predicts a failure, it will issue an unplanned hardware maintenance event to reduce the impact to the virtual machines hosted on that hardware. Azure 使用实时迁移技术将虚拟机从故障硬件迁移到正常的物理计算机。Azure uses Live Migration technology to migrate the Virtual Machines from the failing hardware to a healthy physical machine. 实时迁移是一项 VM 保留操作,只能短时间暂停虚拟机。Live Migration is a VM preserving operation that only pauses the Virtual Machine for a short time. 将会保留内存、打开的文件以及网络连接,但事件前后的性能可能会降低。Memory, open files, and network connections are maintained, but performance might be reduced before and/or after the event. 在无法使用实时迁移的情况下,VM 会出现意外停机,如下所述。In cases where Live Migration cannot be used, the VM will experience Unexpected Downtime, as described below.

  • 意外停机指虚拟机的硬件或物理基础设施意外出现故障 。An Unexpected Downtime is when the hardware or the physical infrastructure for the virtual machine fails unexpectedly. 此类故障可能包括:本地网络故障、本地磁盘故障,或者其他机架级别的故障。This can include local network failures, local disk failures, or other rack level failures. 检测到此类故障时,Azure 平台会自动将虚拟机迁移到同一数据中心内的正常物理机(进行修复)。When detected, the Azure platform automatically migrates (heals) your virtual machine to a healthy physical machine in the same datacenter. 在修复过程中,虚拟机会经历停机(重启),在某些情况下会丢失临时驱动器。During the healing procedure, virtual machines experience downtime (reboot) and in some cases loss of the temporary drive. 始终会保留附加的 OS 和数据磁盘。The attached OS and data disks are always preserved.

    在发生会影响整个数据中心甚至整个区域的服务中断或灾难时(这种情况很少见),虚拟机也可能会停机。Virtual machines can also experience downtime in the unlikely event of an outage or disaster that affects an entire datacenter, or even an entire region.

  • 计划内维护事件是指由 Azure 对底层 Azure 平台进行的定期更新,用于改进虚拟机运行时所在的平台基础结构的总体可靠性、性能和安全性。Planned Maintenance events are periodic updates made by Azure to the underlying Azure platform to improve overall reliability, performance, and security of the platform infrastructure that your virtual machines run on. 大多数此类更新在执行时不会影响虚拟机或云服务(请参阅 VM 保留维护)。Most of these updates are performed without any impact upon your Virtual Machines or Cloud Services (see VM Preserving Maintenance). 虽然 Azure 平台会尝试在所有可能的情况下都使用 VM 保留维护,但在罕见情况下,这些更新需要重启虚拟机,否则无法将所需更新应用到底层基础结构。While the Azure platform attempts to use VM Preserving Maintenance in all possible occasions, there are rare instances when these updates require a reboot of your virtual machine to apply the required updates to the underlying infrastructure. 在这种情况下,可以在合适的时间窗口为 VM 启动维护,通过“维护-重新部署”操作来执行 Azure 计划内维护。In this case, you can perform Azure Planned Maintenance with Maintenance-Redeploy operation by initiating the maintenance for their VMs in the suitable time window. 有关详细信息,请参阅虚拟机的计划内维护For more information, see Planned Maintenance for Virtual Machines.

要减轻一个或多个此类事件引发的停机所造成的影响,我们建议遵循以下最佳做法以提高虚拟机的可用性:To reduce the impact of downtime due to one or more of these events, we recommend the following high availability best practices for your virtual machines:

在可用性集中配置多个虚拟机以确保冗余Configure multiple virtual machines in an availability set for redundancy

可用性集是另一种数据中心配置,用于提供 VM 冗余和可用性。Availability sets are another datacenter configuration to provide VM redundancy and availability. 数据中心内的这种配置可以确保在发生计划内或计划外维护事件时,至少有一个虚拟机可用,并满足 99.95% 的 Azure SLA 要求。This configuration within a datacenter ensures that during either a planned or unplanned maintenance event, at least one virtual machine is available and meets the 99.95% Azure SLA. 有关详细信息,请参阅虚拟机的 SLAFor more information, see the SLA for Virtual Machines.


避免将单实例虚拟机单独地置于可用性集中。Avoid leaving a single instance virtual machine in an availability set by itself. 此配置中的 VM 并不符合 SLA 保证,在出现 Azure 计划内维护事件时会停机,除非某个 VM 正在使用 Azure 高级 SSDVMs in this configuration do not qualify for a SLA guarantee and face downtime during Azure planned maintenance events, except when a single VM is using Azure premium SSDs. 对于使用高级 SSD 的单一 VM,Azure SLA 适用。For single VMs using premium SSDs, the Azure SLA applies.

基础 Azure 平台为可用性集中的每个虚拟机分配一个更新域和一个容错域Each virtual machine in your availability set is assigned an update domain and a fault domain by the underlying Azure platform. 对于给定的可用性集,默认情况下会分配五个非用户可配置的更新域(可以增加 Resource Manager 部署以最多提供 20 个更新域),以指示可同时重新启动的虚拟机和底层物理硬件组。For a given availability set, five non-user-configurable update domains are assigned by default (Resource Manager deployments can then be increased to provide up to 20 update domains) to indicate groups of virtual machines and underlying physical hardware that can be rebooted at the same time. 在单个可用性集中配置了 5 个以上的虚拟机时,第 6 个虚拟机将放置在第 1 个虚拟机所在的更新域中,第 7 个虚拟机将放置在第 2 个虚拟机所在的更新域中,依此类推。When more than five virtual machines are configured within a single availability set, the sixth virtual machine is placed into the same update domain as the first virtual machine, the seventh in the same update domain as the second virtual machine, and so on. 在计划内维护期间,更新域的重启顺序可能不会按序进行,但一次只重启一个更新域。The order of update domains being rebooted may not proceed sequentially during planned maintenance, but only one update domain is rebooted at a time. 重启的更新域有 30 分钟的时间进行恢复,此时间过后,就会在另一更新域上启动维护操作。A rebooted update domain is given 30 minutes to recover before maintenance is initiated on a different update domain.

容错域定义一组共用一个通用电源和网络交换机的虚拟机。Fault domains define the group of virtual machines that share a common power source and network switch. 默认情况下,在可用性集中配置的虚拟机隔离在 Resource Manager 部署的最多三个容错域(经典部署的两个容错域)中。By default, the virtual machines configured within your availability set are separated across up to three fault domains for Resource Manager deployments (two fault domains for Classic). 虽然将虚拟机置于可用性集中并不能让应用程序免受特定于操作系统或应用程序的故障的影响,但可以限制潜在物理硬件故障、网络中断或电源中断的影响。While placing your virtual machines into an availability set does not protect your application from operating system or application-specific failures, it does limit the impact of potential physical hardware failures, network outages, or power interruptions.


在可用性集中对 VM 使用托管磁盘Use managed disks for VMs in an availability set

如果当前使用的 VM 没有托管磁盘,则强烈建议在可用性集中转换 VM,以便使用托管磁盘If you are currently using VMs with unmanaged disks, we highly recommend you convert VMs in Availability Set to use Managed Disks.

通过确保可用性集中的 VM 的磁盘彼此之间完全隔离以避免单点故障,托管磁盘为可用性集提供了更佳的可靠性。Managed disks provide better reliability for Availability Sets by ensuring that the disks of VMs in an Availability Set are sufficiently isolated from each other to avoid single points of failure. 为此,会自动将磁盘放置在不同的存储容错域(存储群集)中,并使它们与 VM 容错域一致。It does this by automatically placing the disks in different storage fault domains (storage clusters) and aligning them with the VM fault domain. 如果某个存储容错域因硬件或软件故障而失败,则只有其磁盘在该存储容错域上的 VM 实例会失败。If a storage fault domain fails due to hardware or software failure, only the VM instance with disks on the storage fault domain fails. 托管磁盘 FDManaged disks FDs


托管可用性集的容错域的数目因区域而异 - 每个区域两个。The number of fault domains for managed availability sets varies by region - either two per region. 下表显示了每个区域的数目。The following table shows the number per region.

每个区域的容错域数Number of Fault Domains per region

区域Region 最大容错域数Max # of Fault Domains
中国东部China East 22
中国北部China North 22
中国东部 2China East 2 22
中国北部 2China North 2 22

如果计划使用包含非托管磁盘的 VM,请按下述针对存储帐户的最佳做法进行操作。在这些存储帐户中,VM 的虚拟硬盘 (VHD) 以页 Blob 形式存储。If you plan to use VMs with unmanaged disks, follow below best practices for Storage accounts where virtual hard disks (VHDs) of VMs are stored as page blobs.

  1. 将与同一 VM 关联的所有磁盘(OS 和数据)放置在同一存储帐户中Keep all disks (OS and data) associated with a VM in the same storage account
  2. 在向存储帐户添加更多 VHD 之前,请查看存储帐户中非托管磁盘的数量限制Review the limits on the number of unmanaged disks in a Storage account before adding more VHDs to a storage account
  3. 为可用性集中的每个 VM 使用单独的存储帐户。Use separate storage account for each VM in an Availability Set. 同一可用性集中的多个 VM 不能共享存储帐户。Do not share Storage accounts with multiple VMs in the same Availability Set. 不同可用性集中的 VM 共享存储帐户是可以接受的,只要遵循上述最佳做法即可 托管磁盘 FDIt is acceptable for VMs across different Availability Sets to share storage accounts if above best practices are followed Unmanaged disks FDs

使用计划事件主动响应影响事件的 VMUse scheduled events to proactively respond to VM impacting events

如果订阅计划事件,则将通知 VM 即将发生会对 VM 造成影响的维护事件。When you subscribe to scheduled events, your VM is notified about upcoming maintenance events that can impact your VM. 启用计划事件后,可在执行维护活动之前为虚拟机提供最少的时间。When scheduled events are enabled, your virtual machine is given a minimum amount of time before the maintenance activity is performed. 例如,可能会影响 VM 的主机 OS 更新将作为事件排队等候,通知中将详述其影响,以及在未采取任何操作的情况下执行维护的时间。For example, Host OS updates that might impact your VM are queued up as events that specify the impact, as well as a time at which the maintenance will be performed if no action is taken. 当 Azure 检测到即将发生可能影响 VM 的硬件失败时,计划事件也会排队等候,以便决定执行修复的时间。Schedule events are also queued up when Azure detects imminent hardware failure that might impact your VM, which allows you to decide when the healing should be performed. 客户可以使用事件在维护前执行任务,例如,保存状态、故障转移到辅助 VM 等。Customers can use the event to perform tasks prior to the maintenance, such as saving state, failing over to the secondary, and so on. 完成用于妥善处理维护事件的逻辑后,可批准未完成的计划事件,以允许平台继续进行维护。After you complete your logic for gracefully handling the maintenance event, you can approve the outstanding scheduled event to allow the platform to proceed with maintenance.

将每个应用程序层配置到不同的可用性集中Configure each application tier into separate availability sets

如果虚拟机几乎都是相同的,并且对应用程序的用途是一样的,我们建议针对每个应用程序层配置可用性集。If your virtual machines are all nearly identical and serve the same purpose for your application, we recommend that you configure an availability set for each tier of your application. 如果将两个不同的层置于同一可用性集中,则同一应用程序层中的所有虚拟机可以同时重启。If you place two different tiers in the same availability set, all virtual machines in the same application tier can be rebooted at once. 通过在可用性集中为每个层配置至少两个虚拟机,可以确保每个层中至少有一个虚拟机可用。By configuring at least two virtual machines in an availability set for each tier, you guarantee that at least one virtual machine in each tier is available.

例如,可以将运行 IIS、Apache 和 Nginx 的应用程序前端中的所有虚拟机置于单个可用性集中。For example, you could put all the virtual machines in the front end of your application running IIS, Apache, and Nginx in a single availability set. 请确保仅将前端虚拟机置于同一可用性集中。Make sure that only front-end virtual machines are placed in the same availability set. 同样,请确保仅将数据层虚拟机置于其自身的可用性集中,例如已复制的 SQL Server 虚拟机或 MySQL 虚拟机。Similarly, make sure that only data-tier virtual machines are placed in their own availability set, like your replicated SQL Server virtual machines, or your MySQL virtual machines.


将负载均衡器与可用性集组合在一起Combine a load balancer with availability sets

Azure 负载均衡器 与可用性集组合在一起,以获取最大的应用程序复原能力。Combine the Azure Load Balancer with an availability set to get the most application resiliency. Azure 负载均衡器将流量分布到多个虚拟机中。The Azure Load Balancer distributes traffic between multiple virtual machines. 对于标准层虚拟机来说,Azure 负载均衡器已包括在内。For our Standard tier virtual machines, the Azure Load Balancer is included. 并非所有虚拟机层都包括 Azure 负载均衡器。Not all virtual machine tiers include the Azure Load Balancer. 有关对虚拟机进行负载均衡的更多信息,请阅读对虚拟机进行负载均衡For more information about load balancing your virtual machines, see Load Balancing virtual machines.

如果没有将负载均衡器配置为对多个虚拟机上的流量进行平衡,则任何计划内维护事件都会影响唯一的那个处理流量的虚拟机,导致应用程序层中断。If the load balancer is not configured to balance traffic across multiple virtual machines, then any planned maintenance event affects the only traffic-serving virtual machine, causing an outage to your application tier. 将同一层的多个虚拟机置于相同的负载均衡器和可用性集下可以确保至少有一个虚拟机实例能够持续处理流量。Placing multiple virtual machines of the same tier under the same load balancer and availability set enables traffic to be continuously served by at least one instance.

后续步骤Next steps

若要了解对虚拟机进行负载均衡的详细信息,请参阅对虚拟机进行负载均衡To learn more about load balancing your virtual machines, see Load Balancing virtual machines.