管理 Linux 虚拟机的可用性Manage the availability of Linux virtual machines

了解如何设置和管理多个虚拟机,以确保 Linux 应用程序在 Azure 中的高可用性。Learn ways to set up and manage multiple virtual machines to ensure high availability for your Linux application in Azure. 也可以 管理 Windows 虚拟机的可用性You can also manage the availability of Windows virtual machines.

了解 VM 重启 - 维护和停机Understand VM Reboots - maintenance vs. downtime

有三种情况可能会导致 Azure 中的虚拟机受影响:计划外硬件维护、意外停机、计划内维护。There are three scenarios that can lead to virtual machine in Azure being impacted: unplanned hardware maintenance, unexpected downtime, and planned maintenance.

  • 当 Azure 平台预测硬件或者与物理计算机关联的任何平台组件即将发生故障时,就会发生计划外硬件维护事件。Unplanned Hardware Maintenance Event occurs when the Azure platform predicts that the hardware or any platform component associated to a physical machine, is about to fail. 当预测到故障时,平台会发出计划外硬件维护事件,以便减少对托管在该硬件上的虚拟机的影响。When the platform predicts a failure, it will issue an unplanned hardware maintenance event to reduce the impact to the virtual machines hosted on that hardware. Azure 使用实时迁移技术将虚拟机从故障硬件迁移到正常的物理计算机。Azure uses Live Migration technology to migrate the Virtual Machines from the failing hardware to a healthy physical machine. 实时迁移是一项 VM 保留操作,只能短时间暂停虚拟机。Live Migration is a VM preserving operation that only pauses the Virtual Machine for a short time. 将会保留内存、打开的文件以及网络连接,但事件前后的性能可能会降低。Memory, open files, and network connections are maintained, but performance might be reduced before and/or after the event. 在无法使用实时迁移的情况下,VM 会出现意外停机,如下所述。In cases where Live Migration cannot be used, the VM will experience Unexpected Downtime, as described below.

  • 意外停机指虚拟机的硬件或物理基础设施意外出现故障。An Unexpected Downtime is when the hardware or the physical infrastructure for the virtual machine fails unexpectedly. 此类故障可能包括:本地网络故障、本地磁盘故障,或者其他机架级别的故障。This can include local network failures, local disk failures, or other rack level failures. 检测到此类故障时,Azure 平台会自动将虚拟机迁移到同一数据中心内的正常物理机(进行修复)。When detected, the Azure platform automatically migrates (heals) your virtual machine to a healthy physical machine in the same datacenter. 在修复过程中,虚拟机会经历停机(重启),在某些情况下会丢失临时驱动器。During the healing procedure, virtual machines experience downtime (reboot) and in some cases loss of the temporary drive. 始终会保留附加的 OS 和数据磁盘。The attached OS and data disks are always preserved.

    在发生会影响整个数据中心甚至整个区域的服务中断或灾难时(这种情况很少见),虚拟机也可能会停机。Virtual machines can also experience downtime in the unlikely event of an outage or disaster that affects an entire datacenter, or even an entire region. 针对这种情况,Azure 提供了保护选项,包括配对区域For these scenarios, Azure provides protection options including paired regions.

  • 计划内维护事件是指由 Azure 对底层 Azure 平台进行的定期更新,用于改进虚拟机运行时所在的平台基础结构的总体可靠性、性能和安全性。Planned Maintenance events are periodic updates made by Azure to the underlying Azure platform to improve overall reliability, performance, and security of the platform infrastructure that your virtual machines run on. 大多数此类更新在执行时不会影响虚拟机或云服务(请参阅 VM 保留维护)。Most of these updates are performed without any impact upon your Virtual Machines or Cloud Services (see VM Preserving Maintenance). 虽然 Azure 平台会尝试在所有可能的情况下都使用 VM 保留维护,但在罕见情况下,这些更新需要重启虚拟机,否则无法将所需更新应用到底层基础结构。While the Azure platform attempts to use VM Preserving Maintenance in all possible occasions, there are rare instances when these updates require a reboot of your virtual machine to apply the required updates to the underlying infrastructure. 在这种情况下,可以在合适的时间窗口为 VM 启动维护,通过“维护-重新部署”操作来执行 Azure 计划内维护。In this case, you can perform Azure Planned Maintenance with Maintenance-Redeploy operation by initiating the maintenance for their VMs in the suitable time window. 有关详细信息,请参阅虚拟机的计划内维护For more information, see Planned Maintenance for Virtual Machines.

要减轻一个或多个此类事件引发的停机所造成的影响,我们建议遵循以下最佳做法以提高虚拟机的可用性:To reduce the impact of downtime due to one or more of these events, we recommend the following high availability best practices for your virtual machines:

在可用性集中配置多个虚拟机以确保冗余Configure multiple virtual machines in an availability set for redundancy

可用性集是另一种数据中心配置,用于提供 VM 冗余和可用性。Availability sets are another datacenter configuration to provide VM redundancy and availability. 数据中心内的这种配置可以确保在发生计划内或计划外维护事件时,至少有一个虚拟机可用,并满足 99.95% 的 Azure SLA 要求。This configuration within a datacenter ensures that during either a planned or unplanned maintenance event, at least one virtual machine is available and meets the 99.95% Azure SLA. 有关详细信息,请参阅虚拟机的 SLAFor more information, see the SLA for Virtual Machines.

重要

可用性集内的单实例虚拟机对于所有操作系统磁盘和数据磁盘都应使用高级 SSD,以便满足至少为 99.9% 的虚拟机连接性 SLA。A single instance virtual machine in an availability set by itself should use Premium SSD for all operating system disks and data disks in order to qualify for the SLA for Virtual Machine connectivity of at least 99.9%.

具有标准 SSD 的单实例虚拟机将具有至少 99.5% 的 SLA,而具有标准 HDD 的单实例虚拟机将具有至少 95% 的 SLA。A single instance virtual machine with a Standard SSD will have an SLA of at least 99.5%, while a single instance virtual machine with a Standard HDD will have an SLA of at least 95%. 请参阅虚拟机的 SLASee SLA for Virtual Machines

基础 Azure 平台为可用性集中的每个虚拟机分配一个更新域和一个容错域 。Each virtual machine in your availability set is assigned an update domain and a fault domain by the underlying Azure platform. 对于给定的可用性集,默认情况下会分配五个非用户可配置的更新域(可以增加 Resource Manager 部署以最多提供 20 个更新域),以指示可同时重新启动的虚拟机和底层物理硬件组。For a given availability set, five non-user-configurable update domains are assigned by default (Resource Manager deployments can then be increased to provide up to 20 update domains) to indicate groups of virtual machines and underlying physical hardware that can be rebooted at the same time. 如果单个可用性集中配置了超过 5 个虚拟机,第 6 个虚拟机放置在第 1 个虚拟机所在的更新域中,第 7 个虚拟机放置在第 2 个虚拟机所在的更新域中,依此类推。When more than five virtual machines are configured within a single availability set, the sixth virtual machine is placed into the same update domain as the first virtual machine, the seventh in the same update domain as the second virtual machine, and so on. 在计划内维护期间,更新域的重启顺序可能不会按序进行,但一次只重启一个更新域。The order of update domains being rebooted may not proceed sequentially during planned maintenance, but only one update domain is rebooted at a time. 重启的更新域有 30 分钟的时间进行恢复,此时间过后,就会在另一更新域上启动维护操作。A rebooted update domain is given 30 minutes to recover before maintenance is initiated on a different update domain.

容错域定义一组共用一个通用电源和网络交换机的虚拟机。Fault domains define the group of virtual machines that share a common power source and network switch. 默认情况下,在可用性集中配置的虚拟机隔离在 Resource Manager 部署的最多三个容错域(经典部署的两个容错域)中。By default, the virtual machines configured within your availability set are separated across up to three fault domains for Resource Manager deployments (two fault domains for Classic). 虽然将虚拟机置于可用性集中并不能让应用程序免受特定于操作系统或应用程序的故障的影响,但可以限制潜在物理硬件故障、网络中断或电源中断的影响。While placing your virtual machines into an availability set does not protect your application from operating system or application-specific failures, it does limit the impact of potential physical hardware failures, network outages, or power interruptions.

更新域和容错域配置的概念图

为可用性集中的 VM 使用托管磁盘Use managed disks for VMs in an availability set

如果当前使用的 VM 没有托管磁盘,则强烈建议在可用性集中转换 VM,以便使用托管磁盘If you are currently using VMs with unmanaged disks, we highly recommend you convert VMs in Availability Set to use Managed Disks.

通过确保可用性集中的 VM 的磁盘彼此之间完全隔离以避免单点故障,托管磁盘为可用性集提供了更佳的可靠性。Managed disks provide better reliability for Availability Sets by ensuring that the disks of VMs in an Availability Set are sufficiently isolated from each other to avoid single points of failure. 为此,会自动将磁盘放置在不同的存储容错域(存储群集)中,并使它们与 VM 容错域一致。It does this by automatically placing the disks in different storage fault domains (storage clusters) and aligning them with the VM fault domain. 如果某个存储容错域因硬件或软件故障而失败,则只有其磁盘在该存储容错域上的 VM 实例会失败。If a storage fault domain fails due to hardware or software failure, only the VM instance with disks on the storage fault domain fails. 托管磁盘 FD

重要

托管可用性集的容错域数目前在中国按区域保持固定 - 每个区域两个。The number of fault domains for managed availability sets keep fixed by region in China currently - two per region. 您可以通过运行以下脚本来查看每个区域的容错域。You can see the fault domain for each region by running the following scripts.

Get-AzComputeResourceSku | where{$_.ResourceType -eq 'availabilitySets' -and $_.Name -eq 'Aligned'}
az vm list-skus --resource-type availabilitySets --query '[?name==`Aligned`].{Location:locationInfo[0].location, MaximumFaultDomainCount:capabilities[0].value}' -o Table

备注

在某些情况下,同一可用性集中的 2 个 VM 可能共享同一个容错域。Under certain circumstances, 2 VMs in the same AvailabilitySet could shared the same FaultDomain. 可以通过进入您的可用性集并检查容错域列来确认这一点。This can be confirmed by going into your availability set and checking the Fault Domain column. 这可能是由于部署 VM 时采用了以下顺序而导致的:This can be cause from the following sequence while deploying the VMs:

  • 部署第一个 VMDeploy the 1st VM
  • 停止/解除分配第一个 VMStop/Deallocate the 1st VM
  • 在这种情况下部署第二个 VM,可能会在与第一个 VM 相同的容错域中创建第二个 VM 的 OS 磁盘,因此第二个 VM 也将位于同一容错域中。Deploy the 2nd VM Under these circumstances, the OS Disk of the 2nd VM might be created on the same Fault Domain as the 1st VM, and so the 2nd VM will also land on the same FaultDomain. 若要避免此问题,建议不要在两次部署之间停止/解除分配 VM。To avoid this issue, it's recommended to not stop/deallocate the VMs between deployments.

如果计划使用包含非托管磁盘的 VM,请按下述针对存储帐户的最佳做法进行操作。在这些存储帐户中,VM 的虚拟硬盘 (VHD) 以页 Blob 形式存储。If you plan to use VMs with unmanaged disks, follow below best practices for Storage accounts where virtual hard disks (VHDs) of VMs are stored as page blobs.

  1. 将与同一 VM 关联的所有磁盘(OS 和数据)放置在同一存储帐户中Keep all disks (OS and data) associated with a VM in the same storage account
  2. 在向存储帐户中添加更多 VHD 之前,请查看 Azure 存储帐户中非托管磁盘的数量限制Review the limits on the number of unmanaged disks in an Azure Storage account before adding more VHDs to a storage account
  3. 为可用性集中的每个 VM 使用单独的存储帐户。Use a separate storage account for each VM in an Availability Set. 同一可用性集中的多个 VM 不能共享存储帐户。Do not share Storage accounts with multiple VMs in the same Availability Set. 不同可用性集中的 VM 共享存储帐户是可以接受的,只要遵循上述最佳做法即可 托管磁盘 FD

使用计划事件主动响应影响事件的 VMUse scheduled events to proactively respond to VM impacting events

如果订阅计划事件,则将通知 VM 即将发生会对 VM 造成影响的维护事件。When you subscribe to scheduled events, your VM is notified about upcoming maintenance events that can impact your VM. 启用计划事件后,可在执行维护活动之前为虚拟机提供最少的时间。When scheduled events are enabled, your virtual machine is given a minimum amount of time before the maintenance activity is performed. 例如,可能会影响 VM 的主机 OS 更新将作为事件排队等候,通知中将详述其影响,以及在未采取任何操作的情况下执行维护的时间。For example, Host OS updates that might impact your VM are queued up as events that specify the impact, as well as a time at which the maintenance will be performed if no action is taken. 当 Azure 检测到即将发生可能影响 VM 的硬件失败时,计划事件也会排队等候,以便决定执行修复的时间。Schedule events are also queued up when Azure detects imminent hardware failure that might impact your VM, which allows you to decide when the healing should be performed. 客户可以使用事件在维护前执行任务,例如,保存状态、故障转移到辅助 VM 等。Customers can use the event to perform tasks prior to the maintenance, such as saving state, failing over to the secondary, and so on. 完成用于妥善处理维护事件的逻辑后,可批准未完成的计划事件,以允许平台继续进行维护。After you complete your logic for gracefully handling the maintenance event, you can approve the outstanding scheduled event to allow the platform to proceed with maintenance.

将负载均衡器与可用性集组合在一起Combine a load balancer with availability sets

Azure 负载均衡器与可用性集组合在一起,以获取最大的应用程序复原能力。Combine the Azure Load Balancer with an availability set to get the most application resiliency. Azure 负载均衡器将流量分布到多个虚拟机中。The Azure Load Balancer distributes traffic between multiple virtual machines. 对于标准层虚拟机来说,Azure 负载均衡器已包括在内。For our Standard tier virtual machines, the Azure Load Balancer is included. 并非所有虚拟机层都包括 Azure 负载均衡器。Not all virtual machine tiers include the Azure Load Balancer. 有关对虚拟机进行负载均衡的更多信息,请阅读对虚拟机进行负载均衡For more information about load balancing your virtual machines, see Load Balancing virtual machines.

如果没有将负载均衡器配置为对多个虚拟机上的流量进行平衡,则任何计划内维护事件都会影响唯一的那个处理流量的虚拟机,导致应用程序层中断。If the load balancer is not configured to balance traffic across multiple virtual machines, then any planned maintenance event affects the only traffic-serving virtual machine, causing an outage to your application tier. 将同一层的多个虚拟机置于相同的负载均衡器和可用性集下可以确保至少有一个虚拟机实例能够持续处理流量。Placing multiple virtual machines of the same tier under the same load balancer and availability set enables traffic to be continuously served by at least one instance.

后续步骤Next steps

若要了解对虚拟机进行负载均衡的详细信息,请参阅对虚拟机进行负载均衡To learn more about load balancing your virtual machines, see Load Balancing virtual machines.