对环境映像生成进行故障排除Troubleshoot environment image builds

了解如何排查 Docker 环境映像生成和包安装问题。Learn how to troubleshoot issues with Docker environment image builds and package installations.

先决条件Prerequisites

Docker 映像生成失败Docker Image Build Failures

对于大多数映像生成失败,你会在映像生成日志中找到根本原因。For most image build failures, you'll find the root cause in the image build log. 可以在 Azure 机器学习门户中找到映像生成日志 (20_image_build_log.txt),也可以在 Azure 容器注册表任务运行日志中找到该日志。Find the image build log from the Azure Machine Learning portal (20_image_build_log.txt) or from your Azure Container Registry task run logs.

通常可以更轻松地在本地重现错误。It's usually easier to reproduce errors locally. 检查错误种类,并尝试使用以下 setuptools 之一:Check the kind of error and try one of the following setuptools:

  • 在本地安装 Conda 依赖项:conda install suspicious-dependency==X.Y.ZInstall a conda dependency locally: conda install suspicious-dependency==X.Y.Z.
  • 在本地安装 pip 依赖项:pip install suspicious-dependency==X.Y.ZInstall a pip dependency locally: pip install suspicious-dependency==X.Y.Z.
  • 尝试具体化整个环境:conda create -f conda-specification.ymlTry to materialize the entire environment: conda create -f conda-specification.yml.

重要

请确保本地计算群集上的平台和解释器与远程计算群集上的平台和解释器匹配。Make sure that the platform and interpreter on your local compute cluster match the ones on the remote compute cluster.

超时Timeout

以下网络问题可能会导致超时错误:The following network issues can cause timeout errors:

  • 低 Internet 带宽Low internet bandwidth
  • 服务器问题Server issues
  • 无法使用给定的 conda 或 pip 超时设置下载大型依赖项Large dependencies that can't be downloaded with the given conda or pip timeout settings

类似于以下示例的消息会指示此问题:Messages similar to the following examples will indicate the issue:

('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))
ReadTimeoutError("HTTPSConnectionPool(host='****', port=443): Read timed out. (read timeout=15)",)

如果收到错误消息,请尝试以下可能的解决方案之一:If you get an error message, try one of the following possible solutions:

  • 尝试不同的依赖项源,例如镜像、Azure Blob 存储或其他 Python 源。Try a different source, such as mirrors, Azure Blob Storage, or other Python feeds, for the dependency.
  • 更新 conda 或 pip。Update conda or pip. 如果使用了自定义 Docker 文件,请更新超时设置。If you're using a custom Docker file, update the timeout settings.
  • 某些 pip 版本存在已知问题。Some pip versions have known issues. 考虑将特定版本的 pip 添加到环境依赖项中。Consider adding a specific version of pip to the environment dependencies.

找不到包Package not found

下面是映像生成失败最常见的错误:The following errors are most common for image build failures:

  • 找不到 conda 包:Conda package couldn't be found:

    ResolvePackageNotFound: 
    - not-existing-conda-package
    
  • 找不到指定的 pip 包或版本:Specified pip package or version couldn't be found:

    ERROR: Could not find a version that satisfies the requirement invalid-pip-package (from versions: none)
    ERROR: No matching distribution found for invalid-pip-package
    
  • 错误的嵌套 pip 依赖项:Bad nested pip dependency:

    ERROR: No matching distribution found for bad-package==0.0 (from good-package==1.0)
    

请检查包是否存在于指定的源上。Check that the package exists on the specified sources. 使用 pip search 验证 pip 依赖项:Use pip search to verify pip dependencies:

  • pip search azureml-core

对于 conda 依赖项,请使用 conda searchFor conda dependencies, use conda search:

  • conda search conda-forge::numpy

若要了解更多选项,请尝试以下命令:For more options, try:

  • pip search -h
  • conda search -h

安装程序说明Installer Notes

请确保指定的平台和 Python 解释器版本存在所需的发行版。Make sure that the required distribution exists for the specified platform and Python interpreter version.

对于 pip 依赖项,请转到 https://pypi.org/project/[PROJECT NAME]/[VERSION]/#files 来查看所需版本是否可用。For pip dependencies, go to https://pypi.org/project/[PROJECT NAME]/[VERSION]/#files to see if the required version is available. 请转到 https://pypi.org/project/azureml-core/1.11.0/#files 来查看示例。Go to https://pypi.org/project/azureml-core/1.11.0/#files to see an example.

对于 conda 依赖项,请在通道存储库中检查包。For conda dependencies, check the package on the channel repository. 对于由 Anaconda, Inc. 维护的通道,请查看“Anaconda 包”页面For channels maintained by Anaconda, Inc., check the Anaconda Packages page.

Pip 包更新Pip package update

在安装或更新 pip 包的过程中,冲突解决程序可能需要更新已安装的包以满足新的要求。During an installation or an update of a pip package, the resolver might need to update an already-installed package to satisfy the new requirements. 由于与 pip 版本相关的或与依赖项安装方式相关的各种原因,卸载可能会失败。Uninstallation can fail for various reasons related to the pip version or the way the dependency was installed. 最常见的情况是通过 conda 安装的依赖项无法通过 pip 卸载。The most common scenario is that a dependency installed by conda couldn't be uninstalled by pip. 对于这种情况,请考虑使用 conda remove mypackage 卸载依赖项。For this scenario, consider uninstalling the dependency by using conda remove mypackage.

  Attempting uninstall: mypackage
    Found existing installation: mypackage X.Y.Z
ERROR: Cannot uninstall 'mypackage'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

安装程序问题Installer issues

某些安装程序版本在包解析程序中存在可能会导致生成失败的问题。Certain installer versions have issues in the package resolvers that can lead to a build failure.

如果使用自定义基础映像或 Dockerfile,建议使用 conda 4.5.4 或更高版本。If you're using a custom base image or Dockerfile, we recommend using conda version 4.5.4 or later.

安装 pip 依赖项需要使用某个 pip 包。A pip package is required to install pip dependencies. 如果未在环境中指定版本,则会使用最新版本。If a version isn't specified in the environment, the latest version will be used. 建议使用已知版本的 pip 以避免暂时性问题或最新版工具可能会导致的中断性变更。We recommend using a known version of pip to avoid transient issues or breaking changes that the latest version of the tool might cause.

如果看到以下消息,请考虑在你的环境中固定 pip 版本:Consider pinning the pip version in your environment if you see the following message:

Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.

Pip 子进程错误:Pip subprocess error:

ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, update the hashes as well. Otherwise, examine the package contents carefully; someone may have tampered with them.

如果依赖项中存在无法解决的冲突,pip 安装可能会陷入无限循环。Pip installation can be stuck in an infinite loop if there are unresolvable conflicts in the dependencies. 如果在本地使用,请将 pip 版本降级到 20.3 以下。If you're working locally, downgrade the pip version to < 20.3. 在通过 YAML 文件创建的 conda 环境中,只有在 conda-forge 是最高优先级通道的情况下才出现此问题。In a conda environment created from a YAML file, you'll see this issue only if conda-forge is the highest-priority channel. 若要缓解此问题,请在 conda 规范文件中显式指定 pip < 20.3(!=20.3 或 =20.2.4 会固定到其他版本)作为 conda 依赖项。To mitigate the issue, explicitly specify pip < 20.3 (!=20.3 or =20.2.4 pin to other version) as a conda dependency in the conda specification file.

服务端失败Service-side failures

请参阅以下方案来解决可能的服务端失败。See the following scenarios to troubleshoot possible service-side failures.

你无法从容器注册表中拉取映像,或者无法为容器注册表解析地址You're unable to pull an image from a container registry, or the address couldn't be resolved for a container registry

可能的问题:Possible issues:

  • 容器注册表的路径名称可能无法正确解析。The path name to the container registry might not be resolving correctly. 请检查映像名称是否使用了双斜杠,以及 Linux 与 Windows 主机上的斜杠方向是否正确。Check that image names use double slashes and the direction of slashes on Linux versus Windows hosts is correct.
  • 如果虚拟网络后面的容器注册表在不受支持的区域中使用专用终结点,请在门户中使用服务终结点(公共访问)配置容器注册表,然后重试。If a container registry behind a virtual network is using a private endpoint in an unsupported region, configure the container registry by using the service endpoint (public access) from the portal and retry.
  • 将容器注册表置于虚拟网络后面之后,请运行 Azure 资源管理器模板,使工作区可以与容器注册表实例通信。After you put the container registry behind a virtual network, run the Azure Resource Manager template so the workspace can communicate with the container registry instance.

你从工作区容器注册表收到 401 错误You get a 401 error from a workspace container registry

使用 ws.sync_keys() 重新同步存储密钥。Resynchronize storage keys by using ws.sync_keys().

环境持续引发“正在等待其他 conda 操作完成...”错误The environment keeps throwing a "Waiting for other conda operations to finish…" error

当映像生成操作正在进行时,conda 会被 SDK 客户端锁定。When an image build is ongoing, conda is locked by the SDK client. 如果进程崩溃或被用户错误地取消,conda 会保持锁定状态。If the process crashed or was canceled incorrectly by the user, conda stays in the locked state. 若要解决此问题,请手动删除锁定文件。To resolve this issue, manually delete the lock file.

自定义 Docker 映像不在注册表中Your custom Docker image isn't in the registry

请检查是否使用了正确的标记user_managed_dependencies = TrueCheck if the correct tag is used and that user_managed_dependencies = True. Environment.python.user_managed_dependencies = True 会禁用 Conda 并使用用户的已安装包。Environment.python.user_managed_dependencies = True disables Conda and uses the user's installed packages.

你遇到以下常见虚拟网络问题之一You get one of the following common virtual network issues

  • 检查存储帐户、计算群集和容器注册表是否都位于虚拟网络的同一子网中。Check that the storage account, compute cluster, and container registry are all in the same subnet of the virtual network.
  • 如果容器注册表位于虚拟网络后面,则它不能直接用来生成映像。When your container registry is behind a virtual network, it can't directly be used to build images. 你需要使用计算群集来生成映像。You'll need to use the compute cluster to build images.
  • 如果以下情况属实,则你可能需要将存储置于虚拟网络后面:Storage might need to be placed behind a virtual network if you:
    • 使用推理或专用 wheel。Use inferencing or private wheel.
    • 请参阅 403(未经授权)服务错误。See 403 (not authorized) service errors.
    • 无法从 Azure 容器注册表获取映像详细信息。Can't get image details from Azure Container Registry.

尝试访问受网络保护的存储时映像生成失败The image build fails when you're trying to access network protected storage

  • Azure 容器注册表任务在虚拟网络后面不能执行。Azure Container Registry tasks don't work behind a virtual network. 如果用户在虚拟网络后面有容器注册表,则需要使用计算群集来生成映像。If the user has their container registry behind a virtual network, they need to use the compute cluster to build an image.
  • 为了能够从存储中拉取依赖项,应当将存储置于虚拟网络后面。Storage should be behind a virtual network in order to pull dependencies from it.

当存储启用了网络安全时,无法运行试验You can't run experiments when storage has network security enabled

如果你使用默认 Docker 映像并启用了用户管理的依赖项,请使用 MicrosoftContainerRegistry 和 AzureFrontDoor.FirstParty 服务标记将 Azure 容器注册表及其依赖项加入允许列表。If you're using default Docker images and enabling user-managed dependencies, use the MicrosoftContainerRegistry and AzureFrontDoor.FirstParty service tags to allowlist Azure Container Registry and its dependencies.

有关详细信息,请参阅启用虚拟网络For more information, see Enabling virtual networks.

你需要创建一个 ICMYou need to create an ICM

创建 ICM/将 ICM 分配到元存储时,请提供 CSS 支持票证,以便我们更好地了解问题。When you're creating/assigning an ICM to Metastore, include the CSS support ticket so that we can better understand the issue.

后续步骤Next steps