对使用 cloud-init 的 VM 预配进行故障排除Troubleshooting VM provisioning with cloud-init

如果你已在创建通用化自定义映像(使用 cloud-init 进行预配),但发现 VM 未正确创建,则需对自定义映像进行故障排除。If you have been creating generalized custom images, using cloud-init to do provisioning, but have found that VM did not create correctly, you will need to troubleshoot your custom images.

预配问题的一些示例:Some examples, of issues with provisioning:

  • VM 停滞在“正在创建”状态达 40 分钟,并且 VM 创建操作被标记为失败VM gets stuck at 'creating' for 40 minutes, and the VM creation is marked as failed
  • CustomData 未得到处理CustomData does not get processed
  • 临时磁盘装载失败The ephemeral disk fails to mount
  • 未创建用户,或存在用户访问问题Users do not get created, or there are user access issues
  • 未正确设置网络Networking is not set up correctly
  • 交换文件或分区故障Swap file or partition failures

本文逐步讲解如何对 cloud-init 进行故障排除。This article steps you through how to troubleshoot cloud-init. 如需更深入的详细信息,请参阅深入探讨 cloud-initFor more in-depth details, see cloud-init deep dive.

步骤 1:在不使用 customData 的情况下测试部署Step 1: Test the deployment without customData

在创建 VM 时,Cloud-init 可以接受传递给它的 customDataCloud-init can accept customData, that is passed to it, when the VM is created. 首先,你应确保这不会导致任何部署问题。First you should ensure this is not causing any issues with deployments. 尽量在不传入任何配置的情况下预配 VM。Try to provisioning the VM without passing in any configuration. 如果你发现 VM 无法预配,请继续执行下面的步骤;如果发现未应用你传递的配置,请转到步骤 4If you find the VM fails to provision, continue with the steps below, if you find the configuration you are passing is not being applied go step 4.

步骤 2:查看映像要求Step 2: Review image requirements

VM 预配失败的主要原因是 OS 映像不满足在 Azure 上运行的先决条件。The primary cause of VM provisioning failure is the OS image doesn't satisfy the prerequisites for running on Azure. 尝试在 Azure 中预配映像之前,请确保已正确准备好映像。Make sure your images are properly prepared before attempting to provision them in Azure.

以下文章演示了准备 Azure 中支持的各种 Linux 发行版的步骤:The following articles illustrate the steps to prepare various linux distributions that are supported in Azure:

对于受支持的 Azure cloud-init 映像,Linux 发行版已准备好所有必需的包和配置,方便用户在 Azure 中正确预配映像。For the supported Azure cloud-init images, the Linux distributions already have all the required packages and configurations in place to correctly provision the image in Azure. 如果发现无法基于你自己的特选映像创建 VM,请尝试使用一个受支持的、已使用你的可选 customData 为其配置了 cloud-init 的 Azure 市场映像。If you find your VM is failing to create from your own curated image, try a supported Azure Marketplace image that already is configured for cloud-init, with your optional customData. 如果 customData 可以在 Azure 市场映像中正常使用,则可能是特选映像出现问题。If the customData works correctly with an Azure Marketplace image, then there is probably an issue with your curated image.

步骤 3:收集和查看 VM 日志Step 3: Collect & review VM logs

当 VM 无法预配时,Azure 会显示“正在创建”状态 20 分钟,然后重启 VM,再等待 20 分钟,最后才会将 VM 部署标记为失败,并使用 OSProvisioningTimedOut 错误来标记它。When the VM fails to provision, Azure will show 'creating' status, for 20 minutes, and then reboot the VM, and wait another 20 minutes before finally marking the VM deployment as failed, before finally marking it with an OSProvisioningTimedOut error.

你需要在 VM 处于运行状态的情况下使用其中的日志来了解预配为何失败。While the VM is running, you will need the logs from the VM to understand why provisioning failed. 若要了解 VM 预配为何失败,请不要停止 VM。To understand why VM provisioning failed, do not stop the VM. 让 VM 保持运行状态。Keep the VM running. 为了收集日志,你需要使发生故障的 VM 保持运行状态。You will need to keep the failed VM in a running state in order to collect logs. 若要收集日志,请使用以下方法之一:To collect the logs, use one of the following methods:

/var/log/cloud-init*
/var/log/waagent*
/var/log/syslog*
/var/log/rsyslog*
/var/log/messages*
/var/log/kern*
/var/log/dmesg*
/var/log/boot*

若要开始进行初始故障排除,请从 cloud-init 日志着手,了解发生故障的位置,然后使用其他日志深入了解情况并获取更多见解。To start initial troubleshooting, start with the cloud-init logs, and understand where the failure occurred, then use the other logs to deep dive, and provide additional insights.

  • /var/log/cloud-init.log/var/log/cloud-init.log
  • /var/log/cloud-init-output.log/var/log/cloud-init-output.log
  • Serial/boot logsSerial/boot logs

在所有日志中,开始搜索“Failed”、“WARNING”、“WARN”、“err”、“error”、“ERROR”。In all logs, start searching for "Failed", "WARNING", "WARN", "err", "error", "ERROR". 建议将配置设置为忽略区分大小写的搜索。Setting configuration to ignore case-sensitive searches is recommended.

提示

如果对自定义映像进行故障排除,应考虑在映像操作过程中添加用户。If you are troubleshooting a custom image, you should consider adding a user during the image. 如果预配无法设置管理员用户,你仍然可以登录到 OS。If the provisioning fails to set the admin user, you can still log in to the OS.

分析日志Analyzing the logs

下面更详细地介绍了要在每个 cloud-init 日志中查找什么内容。Here are more details about what to look for in each cloud-init log.

/var/log/cloud-init.log/var/log/cloud-init.log

默认情况下,所有优先级为“调试”或更高级别的 cloud-init 事件都会写入到 /var/log/cloud-init.log 中。By default, all cloud-init events with a priority of debug or higher, are written to /var/log/cloud-init.log. 这样就可以提供在 cloud-init 初始化期间发生的每个事件的详细日志。This provides verbose logs of every event that occurred during cloud-init initialization.

例如:For example:

2019-10-10 04:51:25,321 - util.py[DEBUG]: Failed mount of '/dev/sr0' as 'auto': Unexpected error while running command.
Command: ['mount', '-o', 'ro,sync', '-t', 'auto', u'/dev/sr0', '/run/cloud-init/tmp/tmpLIrklc']
Exit code: 32
Reason: -
Stdout:
Stderr: mount: unknown filesystem type 'udf'
2020-01-31 00:21:53,352 - DataSourceAzure.py[WARNING]: /dev/sr0 was not mountable

找到错误或警告后,请在 cloud-init 日志中倒退着阅读,了解在出现错误或警告之前 cloud-init 正在尝试执行的操作。Once you have found an error or warning, read backwards in the cloud-init log to understand what cloud-init was attempting before it hit the error or warning. 在许多情况下,在出现错误之前,cloud-init 运行了 OS 命令或执行了预配操作。你可以通过这些命令或操作了解日志中出现错误的原因。In many cases cloud-init will have run OS commands or performed provisioning operations prior to the error, which can provide insights as to why errors appeared in the logs. 以下示例表明,cloud-init 刚好在出现错误之前尝试了装入设备的操作。The following example shows that cloud-init attempted to mount a device right before it hit an error.

2019-10-10 04:51:24,010 - util.py[DEBUG]: Running command ['mount', '-o', 'ro,sync', '-t', 'auto', u'/dev/sr0', '/run/cloud-init/tmp/tmpXXXXX'] with allowed return codes [0] (shell=False, capture=True)

还可以在 /etc/cloud/cloud.cfg.d/05_logging.cfg 中重新配置 /var/log/cloud-init.log 的日志记录。The logging for /var/log/cloud-init.log can also be reconfigured within /etc/cloud/cloud.cfg.d/05_logging.cfg. 有关 cloud-init 日志记录的更多详细信息,请参阅 cloud-init 文档For more details of cloud-init logging, refer to the cloud-init documentation.

/var/log/cloud-init-output.log/var/log/cloud-init-output.log

可以在 cloud-init 的各个阶段获取 stdoutstderr 中的信息。You can get information from the stdout and stderr during the stages of cloud-init. 这通常涉及 cloud-init 的每个阶段的路由表信息、网络信息、ssh 主机密钥验证信息、stdoutstderr,以及每个阶段的时间戳。This normally involves routing table information, networking information, ssh host key verification information, stdout and stderr for each stage of cloud-init, along with the timestamp for each stage. 如果需要,可以通过 /etc/cloud/cloud.cfg.d/05_logging.cfg 重新配置 stderrstdout 日志记录。If desired, stderr and stdout logging can be reconfigured from /etc/cloud/cloud.cfg.d/05_logging.cfg.

Serial/boot logsSerial/boot logs

Cloud-init 有多个依赖项,这些依赖项记录在 Azure 上的映像所需的先决条件(例如网络、存储、装载 ISO 以及装载和格式化临时磁盘的功能)中。Cloud-init has multiple dependencies, these are documented in required prerequisites for images on Azure, such as networking, storage, ability to mount an ISO, and mount and format the temporary disk. 这其中的任何一项都可能会引发错误,导致 cloud-init 失败。Any of these may throw errors and cause cloud-init to fail. 例如,如果 VM 无法获得 DHCP 租约,则 cloud-init 会失败。For example, if the VM cannot get a DHCP lease, cloud-init will fail.

如果仍无法厘清 cloud-init 未能预配的原因,则需了解 cloud-init 的具体阶段和模块的运行时间。If you still cannot isolate why cloud-init failed to provision then you need to understand what cloud-init stages, and when modules run. 如需更多详细信息,请参阅更深入地了解 cloud-initSee Diving deeper into cloud-init for more details.

步骤 4:调查未应用配置的原因Step 4: Investigate why the configuration isn't being applied

并非 cloud-init 中的每次失败都会导致严重的预配故障。Not every failure in cloud-init results in a fatal provisioning failure. 例如,如果在 cloud-init 配置中使用了 runcmd 模块,则运行命令时出现非零退出代码会导致 VM 预配失败。For example, if you are using the runcmd module in a cloud-init config, a non-zero exit code from the command it is running will cause the VM provisioning to fail. 这是因为它在核心预配功能(在 cloud-init 的前 3 个阶段中执行)后运行。This is because it runs after core provisioning functionality that happens in the first 3 stages of cloud-init. 若要排查为何未应用配置,请手动查看步骤 3 中的日志和 cloud-init 模块。To troubleshoot why the configuration did not apply, review the logs in Step 3, and cloud-init modules manually. 例如:For example:

  • runcmd - 脚本在运行时是否未出错?runcmd - do the scripts run without errors? 请从终端手动运行配置,以确保它们按预期运行。Run the configuration manually from the terminal to ensure they run as expected.
  • 安装包 - VM 是否有权访问包存储库?Installing packages - does the VM have access to package repositories?
  • 还应检查提供给 VM 的 customData 数据配置,该配置位于 /var/lib/cloud/instances/<unique-instance-identifier>/user-data.txt 中。You should also check the customData data configuration that was provided to the VM, this is located in /var/lib/cloud/instances/<unique-instance-identifier>/user-data.txt.

后续步骤Next steps

如果仍然无法厘清 cloud-init 未运行配置的原因,则需更细致地了解每个 cloud-init 阶段发生的情况,以及运行模块的时间。If you still cannot isolate why cloud-init did not run the configuration, you need to look more closely at what happens in each cloud-init stage, and when modules run. 有关详细信息,请参阅更深入地了解 cloud-init 配置See Diving deeper into cloud-init configuration for more information.