对机器学习管道进行调试和故障排除Debug and troubleshoot machine learning pipelines

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

在本文中,你将在 Azure 机器学习 SDKAzure 机器学习设计器(预览版)中了解如何对机器学习管道进行调试和故障排除。In this article, you learn how to debug and troubleshoot machine learning pipelines in the Azure Machine Learning SDK and Azure Machine Learning designer (preview). 本文提供了有关如何执行以下操作的信息:Information is provided on how to:

  • 使用 Azure 机器学习 SDK 进行调试Debug using the Azure Machine Learning SDK
  • 使用 Azure 机器学习设计器进行调试Debug using the Azure Machine Learning designer
  • 使用 Application Insights 进行调试Debug using Application Insights
  • 使用 Visual Studio Code (VS Code) 和针对 Visual Studio 的 Python 工具 (PTVSD) 以交互方式调试Debug interactively using Visual Studio Code (VS Code) and the Python Tools for Visual Studio (PTVSD)

Azure 机器学习 SDKAzure Machine Learning SDK

以下部分概述了生成管道时的常见陷阱,以及用于调试管道中运行的代码的不同策略。The following sections provide an overview of common pitfalls when building pipelines, and different strategies for debugging your code that's running in a pipeline. 如果在使管道按预期运行时遇到问题,请参考以下提示。Use the following tips when you're having trouble getting a pipeline to run as expected.

在本地测试脚本Testing scripts locally

管道中最常见的失败之一是附加的脚本(数据清理脚本、评分脚本等)不按预期方式运行,或者在远程计算上下文中包含运行时错误,而这些错误在 Azure 机器学习工作室中的工作区内难以调试。One of the most common failures in a pipeline is that an attached script (data cleansing script, scoring script, etc.) is not running as intended, or contains runtime errors in the remote compute context that are difficult to debug in your workspace in the Azure Machine Learning studio.

管道本身无法在本地运行,但在本地计算机上的隔离位置运行脚本可以更快地进行调试,因为无需等待计算和环境生成过程完成。Pipelines themselves cannot be run locally, but running the scripts in isolation on your local machine allows you to debug faster because you don't have to wait for the compute and environment build process. 执行此操作需要完成一些开发工作:Some development work is required to do this:

  • 如果数据位于云数据存储中,则需要下载数据并使其可供脚本使用。If your data is in a cloud datastore, you will need to download data and make it available to your script. 使用较小的数据样本能够很好地在运行时减少系统开销,并快速获取有关脚本行为的反馈Using a small sample of your data is a good way to cut down on runtime and quickly get feedback on script behavior
  • 如果你正在尝试模拟某个中间管道步骤,可能需要手动生成特定脚本预期前一步骤提供的对象类型If you are attempting to simulate an intermediate pipeline step, you may need to manually build the object types that the particular script is expecting from the prior step
  • 还需要定义自己的环境,并复制远程计算环境中定义的依赖项You will also need to define your own environment, and replicate the dependencies defined in your remote compute environment

在本地环境中运行脚本安装后,执行如下所述的调试任务就会容易得多:Once you have a script setup to run on your local environment, it is much easier to do debugging tasks like:

  • 附加自定义调试配置Attaching a custom debug configuration
  • 暂停执行和检查对象状态Pausing execution and inspecting object-state
  • 捕获运行时之前不会公开的类型或逻辑错误Catching type or logical errors that won't be exposed until runtime

提示

确认脚本按预期运行后,合理的下一步是在单步管道中运行该脚本,然后尝试在包含多个步骤的管道中运行该脚本。Once you can verify that your script is running as expected, a good next step is running the script in a single-step pipeline before attempting to run it in a pipeline with multiple steps.

从远程上下文调试脚本Debugging scripts from remote context

在开始生成管道之前,在本地测试脚本是调试主要代码段和复杂逻辑的适当方式,但在某个时间点,你可能需要在执行实际管道运行本身期间调试脚本,尤其是在诊断与管道步骤交互期间发生的行为时。Testing scripts locally is a great way to debug major code fragments and complex logic before you start building a pipeline, but at some point you will likely need to debug scripts during the actual pipeline run itself, especially when diagnosing behavior that occurs during the interaction between pipeline steps. 我们建议在步骤脚本中充分使用 print() 语句,以便可以查看远程执行期间的对象状态和预期值,就像在调试 JavaScript 代码时一样。We recommend liberal use of print() statements in your step scripts so that you can see object state and expected values during remote execution, similar to how you would debug JavaScript code.

日志文件 70_driver_log.txt 包含:The log file 70_driver_log.txt contains:

  • 脚本执行期间输出的所有语句All printed statements during your script's execution
  • 脚本的堆栈跟踪The stack trace for the script

若要在门户中查找此日志文件和其他日志文件,请先单击工作区中的管道运行。To find this and other log files in the portal, first click on the pipeline run in your workspace.

管道运行列表页

导航到管道运行详细信息页。Navigate to the pipeline run detail page.

管道运行详细信息页

单击特定步骤的模块。Click on the module for the specific step. 导航到“日志”选项卡。其他日志包含有关环境映像生成过程和步骤准备脚本的信息。Navigate to the Logs tab. Other logs include information about your environment image build process and step preparation scripts.

管道运行详细信息页日志选项卡

提示

可以在工作区中的“终结点”选项卡中找到“已发布的管道”的运行。Runs for published pipelines can be found in the Endpoints tab in your workspace. 可以在“试验”或“管道”中找到“未发布的管道”的运行。 Runs for non-published pipelines can be found in Experiments or Pipelines.

故障排除提示Troubleshooting tips

下表包含管道开发期间出现的一些常见问题,以及可能的解决方法。The following table contains common problems during pipeline development, with potential solutions.

问题Problem 可能的解决方法Possible solution
无法将数据传递给 PipelineData 字典Unable to pass data to PipelineData directory 确保已在脚本中创建了一个目录,该目录对应于管道预期步骤要将数据输出到的位置。Ensure you have created a directory in the script that corresponds to where your pipeline expects the step output data. 大多数情况下,输入参数将定义输出目录,然后你需要显式创建该目录。In most cases, an input argument will define the output directory, and then you create the directory explicitly. 使用 os.makedirs(args.output_dir, exist_ok=True) 创建输出目录。Use os.makedirs(args.output_dir, exist_ok=True) to create the output directory. 有关演示此设计模式的评分脚本示例,请参阅该教程See the tutorial for a scoring script example that shows this design pattern.
依赖项 bugDependency bugs 如果在本地开发并测试了脚本,但在管道中的远程计算上运行时发现了依赖项问题,请确保计算环境依赖项和版本与测试环境相匹配。If you have developed and tested scripts locally but find dependency issues when running on a remote compute in the pipeline, ensure your compute environment dependencies and versions match your test environment. 请参阅生成、缓存和重复使用环境(See Environment building, caching, and reuse
计算目标出现不明确的错误Ambiguous errors with compute targets 删除再重新创建计算目标可以解决计算目标的某些问题。Deleting and re-creating compute targets can solve certain issues with compute targets.
管道未重复使用步骤Pipeline not reusing steps 默认已启用步骤重复使用,但是,请确保未在管道步骤中禁用它。Step reuse is enabled by default, but ensure you haven't disabled it in a pipeline step. 如果已禁用重复使用,则步骤中的 allow_reuse 参数将设置为 FalseIf reuse is disabled, the allow_reuse parameter in the step will be set to False.
管道不必要地重新运行Pipeline is rerunning unnecessarily 为了确保步骤仅在基础数据或脚本发生更改时才重新运行,请解耦每个步骤的目录。To ensure that steps only rerun when their underlying data or scripts change, decouple your directories for each step. 如果对多个步骤使用同一个源目录,则可能会遇到不必要的重新运行。If you use the same source directory for multiple steps, you may experience unnecessary reruns. 在管道步骤对象中使用 source_directory 参数以指向该步骤的隔离目录,并确保未对多个步骤使用同一个 source_directory 路径。Use the source_directory parameter on a pipeline step object to point to your isolated directory for that step, and ensure you aren't using the same source_directory path for multiple steps.

日志记录选项和行为Logging options and behavior

下表提供了针对管道的各个调试选项的信息。The table below provides information for different debug options for pipelines. 这不是一个详尽的列表,因为除了此处显示的 Azure 机器学习、Python 和 OpenCensus 以外,还有其他选项。It isn't an exhaustive list, as other options exist besides just the Azure Machine Learning, Python, and OpenCensus ones shown here.

Library 类型Type 示例Example 目标Destination 资源Resources
Azure 机器学习 SDKAzure Machine Learning SDK 指标Metric run.log(name, val) Azure 机器学习门户 UIAzure Machine Learning Portal UI 如何跟踪试验How to track experiments
azureml.core.Run 类azureml.core.Run class
Python 打印/日志记录Python printing/logging 日志Log print(val)
logging.info(message)
驱动程序日志、Azure 机器学习设计器Driver logs, Azure Machine Learning designer 如何跟踪试验How to track experiments

Python 日志记录Python logging
OpenCensus PythonOpenCensus Python 日志Log logger.addHandler(AzureLogHandler())
logging.log(message)
Application Insights - 跟踪Application Insights - traces 在 Application Insights 中调试管道Debug pipelines in Application Insights

OpenCensus Azure Monitor Exporters(OpenCensus Azure Monitor 导出程序)OpenCensus Azure Monitor Exporters
Python 日志记录指南Python logging cookbook

日志记录选项示例Logging options example

import logging

from azureml.core.run import Run
from opencensus.ext.azure.log_exporter import AzureLogHandler

run = Run.get_context()

# Azure ML Scalar value logging
run.log("scalar_value", 0.95)

# Python print statement
print("I am a python print statement, I will be sent to the driver logs.")

# Initialize python logger
logger = logging.getLogger(__name__)
logger.setLevel(args.log_level)

# Plain python logging statements
logger.debug("I am a plain debug statement, I will be sent to the driver logs.")
logger.info("I am a plain info statement, I will be sent to the driver logs.")

handler = AzureLogHandler(connection_string='<connection string>')
logger.addHandler(handler)

# Python logging with OpenCensus AzureLogHandler
logger.warning("I am an OpenCensus warning statement, find me in Application Insights!")
logger.error("I am an OpenCensus error statement with custom dimensions", {'step_id': run.id})

Azure 机器学习设计器(预览版)Azure Machine Learning designer (preview)

本部分概述了如何在设计器中对管道进行故障排除。This section provides an overview of how to troubleshoot pipelines in the designer. 对于在设计器中创建的管道,可以在创作页或管道运行详细信息页中找到 70_driver_log 文件。For pipelines created in the designer, you can find the 70_driver_log file in either the authoring page, or in the pipeline run detail page.

为实时终结点启用日志记录Enable logging for real-time endpoints

若要在设计器中排查和调试实时终结点问题,必须使用 SDK 启用 Application Insight 日志记录。In order to troubleshoot and debug real-time endpoints in the designer, you must enable Application Insight logging using the SDK . 使用日志记录可排查和调试模型部署和使用问题。Logging lets you troubleshoot and debug model deployment and usage issues. 有关详细信息,请参阅对部署的模型进行日志记录For more information, see Logging for deployed models.

从创作页获取日志Get logs from the authoring page

当你提交管道运行并停留在创作页时,可以找到每个模块完成运行时,为每个模块生成的日志文件。When you submit a pipeline run and stay in the authoring page, you can find the log files generated for each module as each module finishes running.

  1. 在创作画布中选择已完成运行的模块。Select a module that has finished running in the authoring canvas.

  2. 在模块的右窗格中,转到“输出 + 日志”选项卡。In the right pane of the module, go to the Outputs + logs tab.

  3. 展开右窗格,然后选择 70_driver_log.txt 并在浏览器中查看该文件。Expand the right pane, and select the 70_driver_log.txt to view the file in browser. 还可以在本地下载日志。You can also download logs locally.

    设计器中展开的输出窗格

从管道运行获取日志Get logs from pipeline runs

还可以在管道运行详细信息页找到特定运行的日志文件,文件位于工作室的“管道”或“试验”部分。 You can also find the log files for specific runs in the pipeline run detail page, which can be found in either the Pipelines or Experiments section of the studio.

  1. 选择在设计器中创建的一个管道运行。Select a pipeline run created in the designer.

    管道运行页

  2. 在预览窗格中选择模块。Select a module in the preview pane.

  3. 在模块的右窗格中,转到“输出 + 日志”选项卡。In the right pane of the module, go to the Outputs + logs tab.

  4. 展开右窗格,在浏览器中查看 70_driver_log.txt 文件,或选择该文件,在本地下载日志。Expand the right pane to view the 70_driver_log.txt file in browser, or select the file to download the logs locally.

重要

若要从管道运行详细信息页更新管道,必须将管道运行克隆到新管道草稿。To update a pipeline from the pipeline run details page, you must clone the pipeline run to a new pipeline draft. 管道运行是管道的快照。A pipeline run is a snapshot of the pipeline. 它类似于日志文件,并且无法更改。It's similar to a log file, and cannot be altered.

Application InsightsApplication Insights

有关以此方式使用 OpenCensus Python 库的详细信息,请参阅此指南:在 Application Insights 中对机器学习管道进行调试和故障排除For more information on using the OpenCensus Python library in this manner, see this guide: Debug and troubleshoot machine learning pipelines in Application Insights

Visual Studio CodeVisual Studio Code

在某些情况下,可能需要以交互方式调试 ML 管道中使用的 Python 代码。In some cases, you may need to interactively debug the Python code used in your ML pipeline. 通过使用 Visual Studio Code (VS Code) 和 debugpy,可以在训练环境中运行代码时附加到该代码。By using Visual Studio Code (VS Code) and debugpy, you can attach to the code as it runs in the training environment. 有关详细信息,请访问在 VS Code 指南中进行交互式调试For more information, visit the interactive debugging in VS Code guide.

后续步骤Next steps