了解 Azure VM 的系统重启Understand a system reboot for Azure VM

Azure 虚拟机 (VM) 有时可能会在没有明显原因(没有证据表明你已启动重启操作)的情况下重启。Azure virtual machines (VMs) might sometimes reboot for no apparent reason, without evidence of your having initiated the reboot operation. 本文列出了可导致 VM 重启的操作和事件,并针对如何避免意外重启问题或减少此类问题影响提供见解。This article lists the actions and events that can cause VMs to reboot and provides insight into how to avoid unexpected reboot issues or reduce the impact of such issues.

配置 VM 以实现高可用性Configure the VMs for high availability

若要防止 Azure 上运行的应用程序出现 VM 重启和停机问题,最佳方式是配置 VM 以实现高可用性。The best way to protect an application that's running on Azure against VM reboots and downtime is to configure the VMs for high availability.

若要为应用程序提供此级别的冗余,建议两个或更多 VM 组合到一个可用性集中。To provide this level of redundancy to your application, we recommend that you group two or more VMs in an availability set. 这种配置可确保发生计划内或计划外维护事件时,至少有一个 VM 可用,并满足 99.95% 的 Azure SLA 要求。This configuration ensures that during either a planned or unplanned maintenance event, at least one VM is available and meets the 99.95 percent Azure SLA.

有关可用性集的详细信息,请参阅管理 VM 的可用性For more information about availability sets, see Manage the availability of VMs

资源运行状况信息Resource Health information

Azure 资源运行状况是一项服务,用于公开各个 Azure 资源的运行状况,并为排查问题提供可行的指南。Azure Resource Health is a service that exposes the health of individual Azure resources and provides actionable guidance for troubleshooting problems. 在不可直接访问服务器或基础结构元素的云环境中,资源运行状况的目标是减少你排除故障花费的时间。In a cloud environment where it isn't possible to directly access servers or infrastructure elements, the goal of Resource Health is to reduce the time that you spend on troubleshooting. 具体说来,目的是减少确定问题根源在于应用程序还是在于 Azure 平台内事件所花的时间。In particular, the aim is to reduce the time that you spend determining whether the root of the problem lies in the application or in an event inside the Azure platform. 有关详细信息,请参阅了解和使用资源运行状况For more information, see Understand and use Resource Health.

可能导致 VM 重启的操作和事件Actions and events that can cause the VM to reboot

计划内维护Planned maintenance

Azure 在中国范围内定期执行更新,以提高 VM 所基于主机基础结构的可靠性、性能及安全性。Azure periodically performs updates across China to improve the reliability, performance, and security of the host infrastructure that underlies VMs. 许多更新(包括内存保留更新)在执行时不会对 VM 或云服务产生任何影响。Many of these updates, including memory-preserving updates, are performed without any impact on your VMs or cloud services.

但是,有些更新确实需要重启。However, some updates do require a reboot. 在这种情况下,VM 会在修补基础结构期间关闭,随后再重启。In such cases, the VMs are shut down while we patch the infrastructure, and then the VMs are restarted.

若要了解什么是 Azure 计划内维护,及其如何影响 Linux VM 的可用性,请参阅下面列出的文章。To understand what Azure planned maintenance is and how it can affect the availability of your Linux VMs, see the articles listed here. 这些文章介绍了 Azure 计划内维护过程的背景,以及如何安排计划内维护以进一步减少影响。The articles provide background about the Azure planned maintenance process and how to schedule planned maintenance to further reduce the impact.

内存保留更新Memory-preserving updates

对于 Azure 中的这类更新,用户不会体验到对正在运行的 VM 的任何影响。For this class of updates in Azure, users experience no impact on their running VMs. 其中一些更新主要面向组件或服务,更新时不会干扰正在运行的实例。Many of these updates are to components or services that can be updated without interfering with the running instance. 还有一些是主机操作系统上的平台基础结构更新,应用时无需重启 VM。Some are platform infrastructure updates on the host operating system that can be applied without a reboot of the VMs.

这些内存保留更新通过启用就地实时迁移技术实现。These memory-preserving updates are accomplished with technology that enables in-place live migration. 进行更新时,VM 处于“暂停”状态。 When it is being updated, the VM is placed in a paused state. 该状态可保留 RAM 中的内存,基础主机操作系统则接收必要的更新和补丁。This state preserves the memory in RAM while the underlying host operating system receives the necessary updates and patches. VM 在暂停后 30 秒内恢复正常。The VM is resumed within 30 seconds of being paused. VM 恢复后,其时钟将自动同步。After the VM is resumed, its clock is automatically synchronized.

由于暂停时间短,因此通过这种机制部署更新可以大大减少对 VM 的影响。Because of the short pause period, deploying updates through this mechanism greatly reduces the impact on the VMs. 但是,并非所有更新都可通过这种方式部署。However, not all updates can be deployed in this way.

多实例更新(针对可用性集中的 VM)一次应用一个更新域。Multi-instance updates (for VMs in an availability set) are applied one update domain at a time.

备注

具有旧内核版本的 Linux 计算机在此更新方法期间受内核错误影响。Linux machines that have old kernel versions are affected by a kernel panic during this update method. 若要避免此问题,请更新到内核版本 3.10.0-327.10.1 或更高版本。To avoid this issue, update to kernel version 3.10.0-327.10.1 or later. 有关详细信息,请参阅主机节点升级后基于 3.10 内核的 Azure Linux VM 出现错误For more information, see An Azure Linux VM on a 3.10-based kernel panics after a host node upgrade.

用户发起的重启或关闭操作User-initiated reboot or shutdown actions

如果通过 Azure 门户、Azure PowerShell、 命令行接口或 REST API 执行重启,则可在 Azure 活动日志中找到该事件。If you perform a reboot from the Azure portal, Azure PowerShell, command-line interface, or REST API, you can find the event in the Azure Activity Log.

如果在 VM 的操作系统中执行该操作,则可在系统日志中找到该事件。If you perform the action from the VM's operating system, you can find the event in the system logs.

通常导致 VM 重启的其他方案包括多个配置更改操作。Other scenarios that usually cause the VM to reboot include multiple configuration-change actions. 通常会看到一条指示执行特定操作将导致 VM 重启的警告消息。You'll ordinarily see a warning message indicating that executing a particular action will result in a reboot of the VM. 示例包括任意 VM 大小调整操作、更改管理帐户密码和设置静态 IP 地址。Examples include any VM resize operations, changing the password of the administrative account, and setting a static IP address.

Azure 安全中心和 Windows 更新Azure Security Center and Windows Update

Azure 安全中心每天对 Windows 和 Linux VM 进行监控,以找出缺少的操作系统更新。Azure Security Center monitors daily Windows and Linux VMs for missing operating-system updates. 安全中心从 Windows Update 或 Windows Server Update Services (WSUS) 检索可用的安全更新和关键更新的列表,具体取决于 Windows VM 上配置的服务。Security Center retrieves a list of available security and critical updates from Windows Update or Windows Server Update Services (WSUS), depending on which service is configured on a Windows VM. 安全中心还可检查 Linux 系统的最新更新。Security Center also checks for the latest updates for Linux systems. 如果 VM 缺少系统更新,安全中心会建议你应用系统更新。If your VM is missing a system update, Security Center recommends that you apply system updates. 通过 Azure 门户中的安全中心控制这些系统更新的应用情况。The application of these system updates is controlled through the Security Center in the Azure portal. 应用某些更新后,可能需要重启 VM。After you apply some updates, VM reboots might be required. 有关详细信息,请参阅在 Azure 安全中心应用系统更新For more information, see Apply system updates in Azure Security Center.

与本地服务器一样,Azure 不会向 Windows VM 推送 Windows 更新提供的更新,因为这些虚拟机应由用户进行管理。Like on-premises servers, Azure does not push updates from Windows Update to Windows VMs, because these machines are intended to be managed by their users. 但是,我们依然建议启用 Windows 自动更新设置。You are, however, encouraged to leave the automatic Windows Update setting enabled. 自动安装 Windows 更新提供的更新也会导致应用更新后发生重启。Automatic installation of updates from Windows Update can also cause reboots to occur after the updates are applied. 有关详细信息,请参阅 Windows 更新常见问题解答For more information, see Windows Update FAQ.

影响 VM 可用性的其他情况Other situations affecting the availability of your VM

在其他情况下,Azure 可能主动暂停使用 VM。There are other cases in which Azure might actively suspend the use of a VM. 你会在执行此操作前收到电子邮件通知,因此有机会解决该基础问题。You'll receive email notifications before this action is taken, so you'll have a chance to resolve the underlying issues. 举例来说,影响 VM 可用性的问题包括:违反安全规范、付款方式过期。Examples of issues that affect VM availability include security violations and the expiration of payment methods.

主机服务器错误Host server faults

在 Azure 数据中心内运行的物理服务器上托管 VM。The VM is hosted on a physical server that is running inside an Azure datacenter. 除了其他几个 Azure 组件外,物理服务器也运行名为“主机代理”的代理。The physical server runs an agent called the Host Agent in addition to a few other Azure components. 如果物理服务器上的这些 Azure 软件组件无响应,则监视系统会触发主机服务器重启,尝试恢复。When these Azure software components on the physical server become unresponsive, the monitoring system triggers a reboot of the host server to attempt recovery. VM 通常在五分钟内再次可用,并继续像以前一样存在于同一主机上。The VM is usually available again within five minutes and continues to live on the same host as previously.

服务器错误通常由硬盘或固态硬盘等硬件故障引起。Server faults are usually caused by hardware failure, such as the failure of a hard disk or solid-state drive. Azure 持续监视这些事件,确定基础 bug,并在实现和测试缓解举措后推出更新。Azure continuously monitors these occurrences, identifies the underlying bugs, and rolls out updates after the mitigation has been implemented and tested.

由于某些主机服务器错误可能特定于该服务器,因此可通过手动将 VM 重新部署到其他主机服务器来改善 VM 重复重启的情况。Because some host server faults can be specific to that server, a repeated VM reboot situation might be improved by manually redeploying the VM to another host server. 在 VM 详细信息页上使用“重新部署”选项,或在 Azure 门户中停止并重启 VM,可触发此操作。 This operation can be triggered by using the redeploy option on the details page of the VM, or by stopping and restarting the VM in the Azure portal.

自动恢复Auto-recovery

如果出于某种原因,主机服务器不能重启,Azure 平台会启动自动恢复操作,使发生故障的主机服务器脱离轮换,以便展开进一步调查。If the host server cannot reboot for any reason, the Azure platform initiates an auto-recovery action to take the faulty host server out of rotation for further investigation.

该主机上的所有 VM 均自动重新定位到其他运行正常的主机服务器。All VMs on that host are automatically relocated to a different, healthy host server. 此过程通常在 15 分钟内完成。This process is usually complete within 15 minutes. 若要详细了解自动恢复过程,请参阅 VM 自动恢复To learn more about the auto-recovery process, see Auto-recovery of VMs.

计划外维护Unplanned maintenance

在少数情况下,Azure 运营团队可能需要执行维护活动,确保 Azure 平台总体上正常运行。On rare occasions, the Azure operations team might need to perform maintenance activities to ensure the overall health of the Azure platform. 此行为可能会影响 VM 可用性,并且通常会导致相同的自动恢复操作,如前所述。This behavior might affect VM availability, and it usually results in the same auto-recovery action as described earlier.

计划外维护包括以下内容:Unplanned maintenance include the following:

  • 紧急节点碎片整理Urgent node defragmentation
  • 紧急网络交换机更新Urgent network switch updates

VM 故障VM crashes

VM 可能因自身问题重启。VMs might restart because of issues within the VM itself. 在 VM 上运行的工作负荷或角色可能触发来宾操作系统内的 Bug 检查。The workload or role that's running on the VM might trigger a bug check within the guest operating system. 为帮助确定故障原因,请查看系统和应用程序日志(适用于 Windows VM)和串行日志(适用于 Linux VM)。For help determining the reason for the crash, view the system and application logs for Windows VMs, and the serial logs for Linux VMs.

对于在 Azure 存储基础结构上托管的操作系统和数据存储,Azure 中的 VM 依赖于虚拟磁盘。VMs in Azure rely on virtual disks for operating system and data storage that is hosted on the Azure Storage infrastructure. 只要可用性或者 VM 和关联的虚拟磁盘之间的连接受影响的时间超过 120 秒,Azure 平台就会强制关闭 VM,以免数据损坏。Whenever the availability or connectivity between the VM and the associated virtual disks is affected for more than 120 seconds, the Azure platform performs a forced shutdown of the VMs to avoid data corruption. 存储连接还原后,VM 自动重启。The VMs are automatically powered back on after storage connectivity has been restored.

关机持续时间可短至 5 分钟,也可能非常久。The duration of the shutdown can be as short as five minutes but can be significantly longer. 下面是与存储相关的强制关机具体情况之一:The following is one of the specific cases that is associated with storage-related forced shutdowns:

超过 IO 限制 Exceeding IO limits

如果 I/O 请求因每秒输入/输出操作数 (IOPS) 超出磁盘 I/O 限制而持续受到限制,则可能暂时关闭 VM。VMs might be temporarily shut down when I/O requests are consistently throttled because the volume of I/O operations per second (IOPS) exceeds the I/O limits for the disk. (标准磁盘存储的限制为 500 IOPS。)为缓解此问题,请在来宾 VM 中使用磁盘剥离或配置存储空间,具体情况取决于工作负荷。(Standard disk storage is limited to 500 IOPS.) To mitigate this issue, use disk striping or configure the storage space inside the guest VM, depending on the workload.

其他事件Other incidents

在极少数情况下,普遍的问题可能影响 Azure 数据中心内的多台服务器。In rare circumstances, a widespread issue can affect multiple servers in an Azure datacenter. 如果出现这种问题,Azure 团队会向受影响订阅者发送电子邮件通知。If this issue occurs, the Azure team sends email notifications to the affected subscriptions. 可查看 Azure 服务运行状况仪表板和 Azure 门户,了解正在进行的服务中断和过去事件的状态。You can check the Azure Service Health dashboard and the Azure portal for the status of ongoing outages and past incidents.