排查机器学习管道问题Troubleshooting machine learning pipelines

在本文中,你将了解如何在 Azure 机器学习 SDKAzure 机器学习设计器中对在运行机器学习管道时出现的错误进行故障排除。In this article, you learn how to troubleshoot when you get errors running a machine learning pipeline in the Azure Machine Learning SDK and Azure Machine Learning designer.

故障排除提示Troubleshooting tips

下表包含管道开发期间出现的一些常见问题,以及可能的解决方法。The following table contains common problems during pipeline development, with potential solutions.

问题Problem 可能的解决方法Possible solution
无法将数据传递给 PipelineData 字典Unable to pass data to PipelineData directory 确保已在脚本中创建了一个目录,该目录对应于管道预期步骤要将数据输出到的位置。Ensure you have created a directory in the script that corresponds to where your pipeline expects the step output data. 大多数情况下,输入参数将定义输出目录,然后你需要显式创建该目录。In most cases, an input argument will define the output directory, and then you create the directory explicitly. 使用 os.makedirs(args.output_dir, exist_ok=True) 创建输出目录。Use os.makedirs(args.output_dir, exist_ok=True) to create the output directory. 有关演示此设计模式的评分脚本示例,请参阅该教程See the tutorial for a scoring script example that shows this design pattern.
依赖项 bugDependency bugs 如果在远程管道中看到在本地测试时未发生的依赖项错误,请确认远程环境依赖项和版本与测试环境中的依赖项和版本匹配。If you see dependency errors in your remote pipeline that did not occur when locally testing, confirm your remote environment dependencies and versions match those in your test environment. 请参阅生成、缓存和重复使用环境(See Environment building, caching, and reuse
计算目标出现不明确的错误Ambiguous errors with compute targets 请尝试删除并重新创建计算目标。Try deleting and re-creating compute targets. 重新创建计算目标是很快的,并且可以解决某些暂时性问题。Re-creating compute targets is quick and can solve some transient issues.
管道未重复使用步骤Pipeline not reusing steps 默认已启用步骤重复使用,但是,请确保未在管道步骤中禁用它。Step reuse is enabled by default, but ensure you haven't disabled it in a pipeline step. 如果已禁用重复使用,则步骤中的 allow_reuse 参数将设置为 FalseIf reuse is disabled, the allow_reuse parameter in the step will be set to False.
管道不必要地重新运行Pipeline is rerunning unnecessarily 为了确保步骤只在其基础数据或脚本发生更改时才重新运行,请分离每个步骤的源代码目录。To ensure that steps only rerun when their underlying data or scripts change, decouple your source-code directories for each step. 如果对多个步骤使用同一个源目录,则可能会遇到不必要的重新运行。If you use the same source directory for multiple steps, you may experience unnecessary reruns. 在管道步骤对象中使用 source_directory 参数以指向该步骤的隔离目录,并确保未对多个步骤使用同一个 source_directory 路径。Use the source_directory parameter on a pipeline step object to point to your isolated directory for that step, and ensure you aren't using the same source_directory path for multiple steps.
在训练时期或其他循环行为中逐步减速Step slowing down over training epochs or other looping behavior 尝试将任何文件写入操作(包括日志记录)从 as_mount() 切换到 as_upload()Try switching any file writes, including logging, from as_mount() to as_upload(). “装载”模式使用远程虚拟化文件系统,在每次将文件追加到该系统时上传整个文件。The mount mode uses a remote virtualized filesystem and uploads the entire file each time it is appended to.

身份验证错误Authentication errors

如果通过远程作业对某个计算目标执行管理操作,会收到以下错误之一:If you perform a management operation on a compute target from a remote job, you will receive one of the following errors:

{"code":"Unauthorized","statusCode":401,"message":"Unauthorized","details":[{"code":"InvalidOrExpiredToken","message":"The request token was either invalid or expired. Please try again with a valid token."}]}
{"error":{"code":"AuthenticationFailed","message":"Authentication failed."}}

例如,如果尝试通过一个为实施远程执行操作而提交的机器学习管道创建或附加计算目标,会收到错误。For example, you will receive an error if you try to create or attach a compute target from an ML Pipeline that is submitted for remote execution.

调试方法Debugging techniques

调试管道有三种主要方法:There are three major techniques for debugging pipelines:

  • 在本地计算机上调试单个管道步骤Debug individual pipeline steps on your local computer
  • 使用日志记录和 Application Insights 来隔离并诊断问题根源Use logging and Application Insights to isolate and diagnose the source of the problem
  • 将远程调试器附加到 Azure 中运行的管道Attach a remote debugger to a pipeline running in Azure

在本地调试脚本Debug scripts locally

管道中最常见的失败情形之一是,域脚本未按预期运行,或者在难以调试的远程计算上下文中包含运行时错误。One of the most common failures in a pipeline is that the domain script does not run as intended, or contains runtime errors in the remote compute context that are difficult to debug.

管道本身无法在本地运行,但在本地计算机上的隔离位置运行脚本可以更快地进行调试,因为无需等待计算和环境生成过程完成。Pipelines themselves cannot be run locally, but running the scripts in isolation on your local machine allows you to debug faster because you don't have to wait for the compute and environment build process. 执行此操作需要完成一些开发工作:Some development work is required to do this:

  • 如果数据位于云数据存储中,则需要下载数据并使其可供脚本使用。If your data is in a cloud datastore, you will need to download data and make it available to your script. 使用较小的数据样本能够很好地在运行时减少系统开销,并快速获取有关脚本行为的反馈Using a small sample of your data is a good way to cut down on runtime and quickly get feedback on script behavior
  • 如果你正在尝试模拟某个中间管道步骤,可能需要手动生成特定脚本预期前一步骤提供的对象类型If you are attempting to simulate an intermediate pipeline step, you may need to manually build the object types that the particular script is expecting from the prior step
  • 还需要定义自己的环境,并复制远程计算环境中定义的依赖项You will also need to define your own environment, and replicate the dependencies defined in your remote compute environment

在本地环境中运行脚本安装后,执行如下所述的调试任务就会容易得多:Once you have a script setup to run on your local environment, it is much easier to do debugging tasks like:

  • 附加自定义调试配置Attaching a custom debug configuration
  • 暂停执行和检查对象状态Pausing execution and inspecting object-state
  • 捕获运行时之前不会公开的类型或逻辑错误Catching type or logical errors that won't be exposed until runtime


确认脚本按预期运行后,合理的下一步是在单步管道中运行该脚本,然后尝试在包含多个步骤的管道中运行该脚本。Once you can verify that your script is running as expected, a good next step is running the script in a single-step pipeline before attempting to run it in a pipeline with multiple steps.

配置、写入和查看管道日志Configure, write to, and review pipeline logs

在开始生成管道之前,在本地测试脚本是调试主要代码段和复杂逻辑的适当方式,但在某个时间点,你可能需要在执行实际管道运行本身期间调试脚本,尤其是在诊断与管道步骤交互期间发生的行为时。Testing scripts locally is a great way to debug major code fragments and complex logic before you start building a pipeline, but at some point you will likely need to debug scripts during the actual pipeline run itself, especially when diagnosing behavior that occurs during the interaction between pipeline steps. 我们建议在步骤脚本中充分使用 print() 语句,以便可以查看远程执行期间的对象状态和预期值,就像在调试 JavaScript 代码时一样。We recommend liberal use of print() statements in your step scripts so that you can see object state and expected values during remote execution, similar to how you would debug JavaScript code.

日志记录选项和行为Logging options and behavior

下表提供了针对管道的各个调试选项的信息。The table below provides information for different debug options for pipelines. 这不是一个详尽的列表,因为除了此处显示的 Azure 机器学习、Python 和 OpenCensus 以外,还有其他选项。It isn't an exhaustive list, as other options exist besides just the Azure Machine Learning, Python, and OpenCensus ones shown here.

Library 类型Type 示例Example 目标Destination 资源Resources
Azure 机器学习 SDKAzure Machine Learning SDK 指标Metric run.log(name, val) Azure 机器学习门户 UIAzure Machine Learning Portal UI 如何跟踪试验How to track experiments
azureml.core.Run 类azureml.core.Run class
Python 打印/日志记录Python printing/logging 日志Log print(val)
驱动程序日志、Azure 机器学习设计器Driver logs, Azure Machine Learning designer 如何跟踪试验How to track experiments

Python 日志记录Python logging
OpenCensus PythonOpenCensus Python 日志Log logger.addHandler(AzureLogHandler())
Application Insights - 跟踪Application Insights - traces 在 Application Insights 中调试管道Debug pipelines in Application Insights

OpenCensus Azure Monitor Exporters(OpenCensus Azure Monitor 导出程序)OpenCensus Azure Monitor Exporters
Python 日志记录指南Python logging cookbook

日志记录选项示例Logging options example

import logging

from azureml.core.run import Run
from opencensus.ext.azure.log_exporter import AzureLogHandler

run = Run.get_context()

# Azure ML Scalar value logging
run.log("scalar_value", 0.95)

# Python print statement
print("I am a python print statement, I will be sent to the driver logs.")

# Initialize python logger
logger = logging.getLogger(__name__)

# Plain python logging statements
logger.debug("I am a plain debug statement, I will be sent to the driver logs.")
logger.info("I am a plain info statement, I will be sent to the driver logs.")

handler = AzureLogHandler(connection_string='<connection string>')

# Python logging with OpenCensus AzureLogHandler
logger.warning("I am an OpenCensus warning statement, find me in Application Insights!")
logger.error("I am an OpenCensus error statement with custom dimensions", {'step_id': run.id})

Azure 机器学习设计器Azure Machine Learning designer

对于在设计器中创建的管道,可以在创作页或管道运行详细信息页中找到 70_driver_log 文件。For pipelines created in the designer, you can find the 70_driver_log file in either the authoring page, or in the pipeline run detail page.

为实时终结点启用日志记录Enable logging for real-time endpoints

若要在设计器中排查和调试实时终结点问题,必须使用 SDK 启用 Application Insight 日志记录。In order to troubleshoot and debug real-time endpoints in the designer, you must enable Application Insight logging using the SDK. 使用日志记录可排查和调试模型部署和使用问题。Logging lets you troubleshoot and debug model deployment and usage issues. 有关详细信息,请参阅对部署的模型进行日志记录For more information, see Logging for deployed models.

从创作页获取日志Get logs from the authoring page

当你提交管道运行并停留在创作页时,可以找到每个模块完成运行时,为每个模块生成的日志文件。When you submit a pipeline run and stay in the authoring page, you can find the log files generated for each module as each module finishes running.

  1. 在创作画布中选择已完成运行的模块。Select a module that has finished running in the authoring canvas.

  2. 在模块的右窗格中,转到“输出 + 日志”选项卡。In the right pane of the module, go to the Outputs + logs tab.

  3. 展开右窗格,然后选择 70_driver_log.txt 并在浏览器中查看该文件。Expand the right pane, and select the 70_driver_log.txt to view the file in browser. 还可以在本地下载日志。You can also download logs locally.


从管道运行获取日志Get logs from pipeline runs

还可以在管道运行详细信息页找到特定运行的日志文件,文件位于工作室的“管道”或“试验”部分。 You can also find the log files for specific runs in the pipeline run detail page, which can be found in either the Pipelines or Experiments section of the studio.

  1. 选择在设计器中创建的一个管道运行。Select a pipeline run created in the designer.


  2. 在预览窗格中选择模块。Select a module in the preview pane.

  3. 在模块的右窗格中,转到“输出 + 日志”选项卡。In the right pane of the module, go to the Outputs + logs tab.

  4. 展开右窗格,在浏览器中查看 70_driver_log.txt 文件,或选择该文件,在本地下载日志。Expand the right pane to view the 70_driver_log.txt file in browser, or select the file to download the logs locally.


若要从管道运行详细信息页更新管道,必须将管道运行克隆到新管道草稿。To update a pipeline from the pipeline run details page, you must clone the pipeline run to a new pipeline draft. 管道运行是管道的快照。A pipeline run is a snapshot of the pipeline. 它类似于日志文件,并且无法更改。It's similar to a log file, and cannot be altered.

Application InsightsApplication Insights

有关以此方式使用 OpenCensus Python 库的详细信息,请参阅此指南:在 Application Insights 中对机器学习管道进行调试和故障排除For more information on using the OpenCensus Python library in this manner, see this guide: Debug and troubleshoot machine learning pipelines in Application Insights

使用 Visual Studio Code 进行交互式调试Interactive debugging with Visual Studio Code

在某些情况下,可能需要以交互方式调试 ML 管道中使用的 Python 代码。In some cases, you may need to interactively debug the Python code used in your ML pipeline. 通过使用 Visual Studio Code (VS Code) 和 debugpy,可以在训练环境中运行代码时附加到该代码。By using Visual Studio Code (VS Code) and debugpy, you can attach to the code as it runs in the training environment. 有关详细信息,请访问在 VS Code 指南中进行交互式调试For more information, visit the interactive debugging in VS Code guide.

后续步骤Next steps