使用 Azure Kubernetes 服务和 Azure 容器实例对模型的 Docker 部署进行故障排除Troubleshoot Docker deployment of models with Azure Kubernetes Service and Azure Container Instances

了解如何使用 Azure 机器学习排查、解决或规避 Azure 容器实例 (ACI) 和 Azure Kubernetes 服务 (AKS) 的常见 Docker 部署错误。Learn how to troubleshoot and solve, or work around, common Docker deployment errors with Azure Container Instances (ACI) and Azure Kubernetes Service (AKS) using Azure Machine Learning.

先决条件Prerequisites

机器学习模型的 Docker 部署步骤Steps for Docker deployment of machine learning models

在 Azure 机器学习中部署模型时,系统将执行大量任务。When deploying a model in Azure Machine Learning, the system performs a number of tasks.

推荐使用的模型部署方法是使用 Model.deploy() API 并以 Environment 对象作为输入参数。The recommended approach for model deployment is via the Model.deploy() API using an Environment object as an input parameter. 在这种情况下,服务将在部署阶段创建一个基础 docker 映像,并在一次调用中装载所需的全部模型。In this case, the service creates a base docker image during deployment stage and mounts the required models all in one call. 基本部署任务包括:The basic deployment tasks are:

  1. 在工作区模型注册表中注册模型。Register the model in the workspace model registry.

  2. 定义推理配置:Define Inference Configuration:

    1. 基于你在环境 yaml 文件中指定的依赖项创建一个 Environment 对象,或者使用我们获得的环境之一。Create an Environment object based on the dependencies you specify in the environment yaml file or use one of our procured environments.
    2. 基于环境和评分脚本创建推理配置(InferenceConfig 对象)。Create an inference configuration (InferenceConfig object) based on the environment and the scoring script.
  3. 将模型部署到 Azure 容器实例 (ACI) 服务或 Azure Kubernetes 服务 (AKS)。Deploy the model to Azure Container Instance (ACI) service or to Azure Kubernetes Service (AKS).

请参阅模型管理简介,详细了解此过程。Learn more about this process in the Model Management introduction.

准备阶段Before you begin

如果遇到任何问题,首先需要将部署任务(上述)分解为单独的步骤,以查出问题所在。If you run into any issue, the first thing to do is to break down the deployment task (previous described) into individual steps to isolate the problem.

假设你使用新的/推荐的部署方法,也就是使用 Model.deploy() API 并以 Environment 对象作为输入参数,则你的代码可以分为三个主要步骤:Assuming you are using the new/recommended deployment method via Model.deploy() API with an Environment object as an input parameter, your code can be broken down into three major steps:

  1. 注册模型。Register the model. 下面是一些示例代码:Here is some sample code:

    from azureml.core.model import Model
    
    
    # register a model out of a run record
    model = best_run.register_model(model_name='my_best_model', model_path='outputs/my_model.pkl')
    
    # or, you can register a file or a folder of files as a model
    model = Model.register(model_path='my_model.pkl', model_name='my_best_model', workspace=ws)
    
  2. 为部署定义推理配置:Define inference configuration for deployment:

    from azureml.core.model import InferenceConfig
    from azureml.core.environment import Environment
    
    
    # create inference configuration based on the requirements defined in the YAML
    myenv = Environment.from_conda_specification(name="myenv", file_path="myenv.yml")
    inference_config = InferenceConfig(entry_script="score.py", environment=myenv)
    
  3. 使用在前面步骤中创建的推理配置来部署模型:Deploy the model using the inference configuration created in the previous step:

    from azureml.core.webservice import AciWebservice
    
    
    # deploy the model
    aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)
    aci_service = Model.deploy(workspace=ws,
                           name='my-service',
                           models=[model],
                           inference_config=inference_config,
                           deployment_config=aci_config)
    aci_service.wait_for_deployment(show_output=True)
    

将部署过程分解为单独任务后,可以查看部分最常见的错误。Once you have broken down the deployment process into individual tasks, we can look at some of the most common errors.

本地调试Debug locally

如果将模型部署到 ACI 或 AKS 时遇到问题,请尝试将其部署为本地 Web 服务。If you encounter problems deploying a model to ACI or AKS, try deploying it as a local web service. 使用本地 Web 服务可简化解决问题的过程。Using a local web service makes it easier to troubleshoot problems. 在本地系统上下载并启动包含模型的 Docker 映像。The Docker image containing the model is downloaded and started on your local system.

可以在 MachineLearningNotebooks 存储库中找到示例本地部署笔记本,以探索可运行的示例。You can find a sample local deployment notebook in the MachineLearningNotebooks repo to explore a runnable example.

警告

生成方案不支持本地 Web 服务部署。Local web service deployments are not supported for production scenarios.

若要进行本地部署,请修改代码以使用 LocalWebservice.deploy_configuration() 创建部署配置。To deploy locally, modify your code to use LocalWebservice.deploy_configuration() to create a deployment configuration. 然后使用 Model.deploy() 部署该服务。Then use Model.deploy() to deploy the service. 以下示例将模型(包含在 model 变量中)部署为本地 Web 服务:The following example deploys a model (contained in the model variable) as a local web service:

from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig, Model
from azureml.core.webservice import LocalWebservice

# Create inference configuration based on the environment definition and the entry script
myenv = Environment.from_conda_specification(name="env", file_path="myenv.yml")
inference_config = InferenceConfig(entry_script="score.py", environment=myenv)

# Create a local deployment, using port 8890 for the web service endpoint
deployment_config = LocalWebservice.deploy_configuration(port=8890)
# Deploy the service
service = Model.deploy(
    ws, "mymodel", [model], inference_config, deployment_config)
# Wait for the deployment to complete
service.wait_for_deployment(True)
# Display the port that the web service is available on
print(service.port)

如果定义你自己的 Conda 规范 YAML,则必须将版本大于等于 1.0.45 的 azureml-defaults 作为 pip 依赖项列出。If you are defining your own conda specification YAML, you must list azureml-defaults with version >= 1.0.45 as a pip dependency. 此包包含将模型作为 Web 服务托管时所需的功能。This package contains the functionality needed to host the model as a web service.

此时,你可以正常使用该服务。At this point, you can work with the service as normal. 例如,以下代码演示了将数据发送到该服务的过程:For example, the following code demonstrates sending data to the service:

import json

test_sample = json.dumps({'data': [
    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
]})

test_sample = bytes(test_sample, encoding='utf8')

prediction = service.run(input_data=test_sample)
print(prediction)

有关自定义 Python 环境的详细信息,请参阅创建和管理用于训练和部署的环境For more information on customizing your Python environment, see Create and manage environments for training and deployment.

更新服务Update the service

在本地测试中,可能需要更新 score.py 文件以添加记录或尝试解决发现的任何问题。During local testing, you may need to update the score.py file to add logging or attempt to resolve any problems that you've discovered. 若要重新加载对 score.py 文件的更改,请使用 reload()To reload changes to the score.py file, use reload(). 例如,以下代码重新加载服务的脚本,然后向其发送数据。For example, the following code reloads the script for the service, and then sends data to it. 使用 score.py 文件对数据进行评分:The data is scored using the updated score.py file:

重要

reload 方法仅适用于本地部署。The reload method is only available for local deployments. 有关将部署更新到其他计算目标的信息,请参阅如何更新 webserviceFor information on updating a deployment to another compute target, see how to update your webservice.

service.reload()
print(service.run(input_data=test_sample))

备注

可从服务所使用的 InferenceConfig 对象指定的位置重新加载该脚本。The script is reloaded from the location specified by the InferenceConfig object used by the service.

若要更改模型、Conda 依赖项或部署配置,请使用 update()To change the model, Conda dependencies, or deployment configuration, use update(). 以下示例更新服务使用的模型:The following example updates the model used by the service:

service.update([different_model], inference_config, deployment_config)

删除服务Delete the service

要删除服务,请使用 delete()To delete the service, use delete().

检查 Docker 日志Inspect the Docker log

可以通过服务对象打印详细的 Docker 引擎日志消息。You can print out detailed Docker engine log messages from the service object. 可以查看 ACI、AKS 和 Local 部署的日志。You can view the log for ACI, AKS, and Local deployments. 以下示例演示如何打印日志。The following example demonstrates how to print the logs.

# if you already have the service object handy
print(service.get_logs())

# if you only know the name of the service (note there might be multiple services with the same name but different version number)
print(ws.webservices['mysvc'].get_logs())

如果在日志中看到行 Booting worker with pid: <pid> 多次出现,则意味着没有足够的内存来启动辅助角色。If you see the line Booting worker with pid: <pid> occurring multiple times in the logs, it means, there isn't enough memory to start the worker. 可以通过增加 deployment_configmemory_gb 的值来处理错误You can address the error by increasing the value of memory_gb in deployment_config

无法计划容器Container cannot be scheduled

将服务部署到 Azure Kubernetes Service 计算目标时,Azure 机器学习将尝试使用请求的资源量来计划服务。When deploying a service to an Azure Kubernetes Service compute target, Azure Machine Learning will attempt to schedule the service with the requested amount of resources. 如果在 5 分钟后,群集中未提供具有相应的可用资源量的节点,则部署会失败,并显示消息“Couldn't Schedule because the kubernetes cluster didn't have available resources after trying for 00:05:00”。If after 5 minutes, there are no nodes available in the cluster with the appropriate amount of resources available, the deployment will fail with the message Couldn't Schedule because the kubernetes cluster didn't have available resources after trying for 00:05:00. 可通过添加更多节点、更改节点的 SKU 或更改服务的资源要求来解决此错误。You can address this error by either adding more nodes, changing the SKU of your nodes or changing the resource requirements of your service.

该错误消息通常会指示你更需要哪一种资源 - 例如,如果看到一条指示“0/3 nodes are available: 3 Insufficient nvidia.com/gpu”的错误消息,则意味着该服务需要 GPU,且群集中有三个节点没有可用的 GPU。The error message will typically indicate which resource you need more of - for instance, if you see an error message indicating 0/3 nodes are available: 3 Insufficient nvidia.com/gpu that means that the service requires GPUs and there are three nodes in the cluster that do not have available GPUs. 如果使用的是 GPU SKU,则可以通过添加更多节点来解决此问题;如果使用的不是 GPU SKU,则可以通过切换到启用 GPU 的 SKU,或将环境更改为不需要 GPU 来解决此问题。This could be addressed by adding more nodes if you are using a GPU SKU, switching to a GPU enabled SKU if you are not or changing your environment to not require GPUs.

服务启动失败Service launch fails

成功生成映像后,系统会尝试使用部署配置启动容器。After the image is successfully built, the system attempts to start a container using your deployment configuration. 作为容器启动过程的一部分,评分脚本中的 init() 函数由系统调用。As part of container starting-up process, the init() function in your scoring script is invoked by the system. 如果 init() 函数中存在未捕获的异常,则可能在错误消息中看到 CrashLoopBackOff 错误。If there are uncaught exceptions in the init() function, you might see CrashLoopBackOff error in the error message.

使用检查 Docker 日志部分中的信息检查日志。Use the info in the Inspect the Docker log section to check the logs.

函数故障:get_model_path()Function fails: get_model_path()

通常情况下,在评分脚本的 init() 函数中,会调用 Model.get_model_path() 来查找容器中的模型文件或模型文件的文件夹。Often, in the init() function in the scoring script, Model.get_model_path() function is called to locate a model file or a folder of model files in the container. 如果找不到模型文件或文件夹,函数将失败。If the model file or folder cannot be found, the function fails. 调试此错误的最简单方法是在容器 shell 中运行以下 Python 代码:The easiest way to debug this error is to run the below Python code in the Container shell:

from azureml.core.model import Model
import logging
logging.basicConfig(level=logging.DEBUG)
print(Model.get_model_path(model_name='my-best-model'))

该示例将打印容器中的本地路径(相对于 /var/azureml-app),其中评分脚本有望找到模型文件或文件夹。This example prints out the local path (relative to /var/azureml-app) in the container where your scoring script is expecting to find the model file or folder. 然后可以验证文件或文件夹是否确实位于其应该在的位置。Then you can verify if the file or folder is indeed where it is expected to be.

将日志记录级别设置为 DEBUG 可能会导致记录其他信息,这可能有助于识别故障。Setting the logging level to DEBUG may cause additional information to be logged, which may be useful in identifying the failure.

函数故障:run(input_data)Function fails: run(input_data)

如果服务部署成功,但在向评分终结点发布数据时崩溃,可在 run(input_data) 函数中添加错误捕获语句,以便转而返回详细的错误消息。If the service is successfully deployed, but it crashes when you post data to the scoring endpoint, you can add error catching statement in your run(input_data) function so that it returns detailed error message instead. 例如:For example:

def run(input_data):
    try:
        data = json.loads(input_data)['data']
        data = np.array(data)
        result = model.predict(data)
        return json.dumps({"result": result.tolist()})
    except Exception as e:
        result = str(e)
        # return error message back to the client
        return json.dumps({"error": result})

注意:通过 run(input_data) 调用返回错误消息应仅用于调试目的。Note: Returning error messages from the run(input_data) call should be done for debugging purpose only. 出于安全原因,不应在生产环境中按此方法返回错误消息。For security reasons, you should not return error messages this way in a production environment.

HTTP 状态代码 502HTTP status code 502

502 状态代码指示服务在 score.py 文件的 run() 方法中引发了异常或已崩溃。A 502 status code indicates that the service has thrown an exception or crashed in the run() method of the score.py file. 使用本文中的信息调试该文件。Use the information in this article to debug the file.

HTTP 状态代码 503HTTP status code 503

Azure Kubernetes 服务部署支持自动缩放,这允许添加副本以支持额外的负载。Azure Kubernetes Service deployments support autoscaling, which allows replicas to be added to support additional load. 但是,自动缩放程序旨在处理负载中的逐步更改。However, the autoscaler is designed to handle gradual changes in load. 如果每秒收到大量请求,客户端可能会收到 HTTP 状态代码 503。If you receive large spikes in requests per second, clients may receive an HTTP status code 503.

有两种方法可以帮助防止 503 状态代码:There are two things that can help prevent 503 status codes:

  • 更改自动缩放创建新副本的利用率。Change the utilization level at which autoscaling creates new replicas.

    默认情况下,自动缩放目标利用率设置为 70%,这意味着服务可以处理高达 30% 的大量每秒请求数 (RPS)。By default, autoscaling target utilization is set to 70%, which means that the service can handle spikes in requests per second (RPS) of up to 30%. 可以通过将 autoscale_target_utilization 设置为较低的值来调整利用率目标。You can adjust the utilization target by setting the autoscale_target_utilization to a lower value.

    重要

    该更改不会导致更快创建副本。This change does not cause replicas to be created faster. 而会以较低的利用率阈值创建副本。Instead, they are created at a lower utilization threshold. 可以在利用率达到 30% 时,通过将值改为 30% 来创建副本,而不是等待该服务的利用率达到 70% 时再创建。Instead of waiting until the service is 70% utilized, changing the value to 30% causes replicas to be created when 30% utilization occurs.

    如果 Web 服务已在使用当前最大副本,然后你仍能看见 503 状态代码,则请增加 autoscale_max_replicas 值以增加副本的最大数量。If the web service is already using the current max replicas and you are still seeing 503 status codes, increase the autoscale_max_replicas value to increase the maximum number of replicas.

  • 更改副本最小数量。Change the minimum number of replicas. 增加最小副本可提供一个更大的池来处理传入峰值。Increasing the minimum replicas provides a larger pool to handle the incoming spikes.

    若要增加副本的最小数量,请将 autoscale_min_replicas 设置为更大的值。To increase the minimum number of replicas, set autoscale_min_replicas to a higher value. 可以使用以下代码计算所需的副本,将值替换为项目指定的值:You can calculate the required replicas by using the following code, replacing values with values specific to your project:

    from math import ceil
    # target requests per second
    targetRps = 20
    # time to process the request (in seconds)
    reqTime = 10
    # Maximum requests per container
    maxReqPerContainer = 1
    # target_utilization. 70% in this example
    targetUtilization = .7
    
    concurrentRequests = targetRps * reqTime / targetUtilization
    
    # Number of container replicas
    replicas = ceil(concurrentRequests / maxReqPerContainer)
    

    备注

    如果收到请求高峰大于新的最小副本可以处理的数量,则可能会再次收到 503 代码。If you receive request spikes larger than the new minimum replicas can handle, you may receive 503s again. 例如,服务流量增加时,可能需要增加最小副本数据。For example, as traffic to your service increases, you may need to increase the minimum replicas.

有关设置 autoscale_target_utilizationautoscale_max_replicasautoscale_min_replicas 的详细信息,请参阅 AksWebservice 模块参考。For more information on setting autoscale_target_utilization, autoscale_max_replicas, and autoscale_min_replicas for, see the AksWebservice module reference.

HTTP 状态代码 504HTTP status code 504

504 状态代码指示请求已超时。默认超时值为 1 分钟。A 504 status code indicates that the request has timed out. The default timeout is 1 minute.

可以通过修改 score.py 删除不必要的调用来增加超时值或尝试加快服务速度。You can increase the timeout or try to speed up the service by modifying the score.py to remove unnecessary calls. 如果这些操作不能解决问题,请使用本文中的信息调试 score.py 文件。If these actions do not correct the problem, use the information in this article to debug the score.py file. 代码可能处于无响应状态或无限循环。The code may be in a non-responsive state or an infinite loop.

高级调试Advanced debugging

某些情况下,可能需要以交互方式调试包含在模型部署中的 Python 代码。In some cases, you may need to interactively debug the Python code contained in your model deployment. 例如,如果输入脚本失败,并且无法通过其他记录确定原因。For example, if the entry script is failing and the reason cannot be determined by additional logging. 通过使用 Visual Studio Code 和 debugpy,可以附加到在 Docker 容器中运行的代码。By using Visual Studio Code and the debugpy, you can attach to the code running inside the Docker container. 有关详细信息,请访问在 VS Code 指南中进行交互式调试For more information, visit the interactive debugging in VS Code guide.

模型部署用户论坛Model deployment user forum

后续步骤Next steps

详细了解部署:Learn more about deployment: