排查机器学习管道问题Troubleshooting machine learning pipelines

在本文中,你将了解如何在 Azure 机器学习 SDKAzure 机器学习设计器中对在运行机器学习管道时出现的错误进行故障排除。In this article, you learn how to troubleshoot when you get errors running a machine learning pipeline in the Azure Machine Learning SDK and Azure Machine Learning designer.

故障排除提示Troubleshooting tips

下表包含管道开发期间出现的一些常见问题,以及可能的解决方法。The following table contains common problems during pipeline development, with potential solutions.

问题Problem 可能的解决方法Possible solution
无法将数据传递给 PipelineData 字典Unable to pass data to PipelineData directory 确保已在脚本中创建了一个目录,该目录对应于管道预期步骤要将数据输出到的位置。Ensure you have created a directory in the script that corresponds to where your pipeline expects the step output data. 大多数情况下,输入参数将定义输出目录,然后你需要显式创建该目录。In most cases, an input argument will define the output directory, and then you create the directory explicitly. 使用 os.makedirs(args.output_dir, exist_ok=True) 创建输出目录。Use os.makedirs(args.output_dir, exist_ok=True) to create the output directory. 有关演示此设计模式的评分脚本示例,请参阅该教程See the tutorial for a scoring script example that shows this design pattern.
依赖项 bugDependency bugs 如果在远程管道中看到在本地测试时未发生的依赖项错误,请确认远程环境依赖项和版本与测试环境中的依赖项和版本匹配。If you see dependency errors in your remote pipeline that did not occur when locally testing, confirm your remote environment dependencies and versions match those in your test environment. 请参阅生成、缓存和重复使用环境(See Environment building, caching, and reuse
计算目标出现不明确的错误Ambiguous errors with compute targets 请尝试删除并重新创建计算目标。Try deleting and re-creating compute targets. 重新创建计算目标是很快的,并且可以解决某些暂时性问题。Re-creating compute targets is quick and can solve some transient issues.
管道未重复使用步骤Pipeline not reusing steps 默认已启用步骤重复使用,但是,请确保未在管道步骤中禁用它。Step reuse is enabled by default, but ensure you haven't disabled it in a pipeline step. 如果已禁用重复使用,则步骤中的 allow_reuse 参数将设置为 FalseIf reuse is disabled, the allow_reuse parameter in the step will be set to False.
管道不必要地重新运行Pipeline is rerunning unnecessarily 为了确保步骤只在其基础数据或脚本发生更改时才重新运行,请分离每个步骤的源代码目录。To ensure that steps only rerun when their underlying data or scripts change, decouple your source-code directories for each step. 如果对多个步骤使用同一个源目录,则可能会遇到不必要的重新运行。If you use the same source directory for multiple steps, you may experience unnecessary reruns. 在管道步骤对象中使用 source_directory 参数以指向该步骤的隔离目录,并确保未对多个步骤使用同一个 source_directory 路径。Use the source_directory parameter on a pipeline step object to point to your isolated directory for that step, and ensure you aren't using the same source_directory path for multiple steps.
在训练时期或其他循环行为中逐步减速Step slowing down over training epochs or other looping behavior 尝试将任何文件写入操作(包括日志记录)从 as_mount() 切换到 as_upload()Try switching any file writes, including logging, from as_mount() to as_upload(). “装载”模式使用远程虚拟化文件系统,在每次将文件追加到该系统时上传整个文件。The mount mode uses a remote virtualized filesystem and uploads the entire file each time it is appended to.
计算目标启动时间过长Compute target takes a long time to start 用于计算目标的 Docker 映像是从 Azure 容器注册表 (ACR) 加载的。Docker images for compute targets are loaded from Azure Container Registry (ACR). 在默认情况下,Azure 机器学习会创建一个使用“基本”服务层级的 ACR。By default, Azure Machine Learning creates an ACR that uses the basic service tier. 将工作区的 ACR 更改为“标准”或“高级”层级可能会减少生成和加载映像所需的时间。Changing the ACR for your workspace to standard or premium tier may reduce the time it takes to build and load images. 有关详细信息,请参阅 Azure 容器注册表服务层级For more information, see Azure Container Registry service tiers.

身份验证错误Authentication errors

如果通过远程作业对某个计算目标执行管理操作,会收到以下错误之一:If you perform a management operation on a compute target from a remote job, you will receive one of the following errors:

{"code":"Unauthorized","statusCode":401,"message":"Unauthorized","details":[{"code":"InvalidOrExpiredToken","message":"The request token was either invalid or expired. Please try again with a valid token."}]}
{"error":{"code":"AuthenticationFailed","message":"Authentication failed."}}

例如,如果尝试通过一个为实施远程执行操作而提交的机器学习管道创建或附加计算目标,会收到错误。For example, you will receive an error if you try to create or attach a compute target from an ML Pipeline that is submitted for remote execution.

ParallelRunStep 进行故障排除Troubleshooting ParallelRunStep

ParallelRunStep 的脚本必须包含两个函数:The script for a ParallelRunStep must contain two functions:

  • init():此函数适用于后续推理的任何成本高昂或常见的准备工作。init(): Use this function for any costly or common preparation for later inference. 例如,使用它将模型加载到全局对象。For example, use it to load the model into a global object. 此函数将在进程开始时调用一次。This function will be called only once at beginning of process.
  • run(mini_batch):将针对每个 mini_batch 实例运行此函数。run(mini_batch): The function will run for each mini_batch instance.
    • mini_batch``ParallelRunStep 将调用 run 方法,并将列表或 pandas DataFrame 作为参数传递给该方法。mini_batch: ParallelRunStep will invoke run method and pass either a list or pandas DataFrame as an argument to the method. 如果输入是 FileDataset,则 mini_batch 中的每个条目都将是文件路径;如果输入是 TabularDataset,则是 pandas DataFrameEach entry in mini_batch will be a file path if input is a FileDataset or a pandas DataFrame if input is a TabularDataset.
    • response:run() 方法应返回 pandas DataFrame 或数组。response: run() method should return a pandas DataFrame or an array. 对于 append_row output_action,这些返回的元素将追加到公共输出文件中。For append_row output_action, these returned elements are appended into the common output file. 对于 summary_only,将忽略元素的内容。For summary_only, the contents of the elements are ignored. 对于所有的输出操作,每个返回的输出元素都指示输入微型批处理中输入元素的一次成功运行。For all output actions, each returned output element indicates one successful run of input element in the input mini-batch. 确保运行结果中包含足够的数据,以便将输入映射到运行输出结果。Make sure that enough data is included in run result to map input to run output result. 运行输出将写入输出文件中,并且不保证按顺序写入,你应使用输出中的某个键将其映射到输入。Run output will be written in output file and not guaranteed to be in order, you should use some key in the output to map it to input.
%%writefile digit_identification.py
# Snippets from a sample script.
# Refer to the accompanying digit_identification.py
# (https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/machine-learning-pipelines/parallel-run)
# for the implementation script.

import os
import numpy as np
import tensorflow as tf
from PIL import Image
from azureml.core import Model


def init():
    global g_tf_sess

    # Pull down the model from the workspace
    model_path = Model.get_model_path("mnist")

    # Construct a graph to execute
    tf.reset_default_graph()
    saver = tf.train.import_meta_graph(os.path.join(model_path, 'mnist-tf.model.meta'))
    g_tf_sess = tf.Session()
    saver.restore(g_tf_sess, os.path.join(model_path, 'mnist-tf.model'))


def run(mini_batch):
    print(f'run method start: {__file__}, run({mini_batch})')
    resultList = []
    in_tensor = g_tf_sess.graph.get_tensor_by_name("network/X:0")
    output = g_tf_sess.graph.get_tensor_by_name("network/output/MatMul:0")

    for image in mini_batch:
        # Prepare each image
        data = Image.open(image)
        np_im = np.array(data).reshape((1, 784))
        # Perform inference
        inference_result = output.eval(feed_dict={in_tensor: np_im}, session=g_tf_sess)
        # Find the best probability, and add it to the result list
        best_result = np.argmax(inference_result)
        resultList.append("{}: {}".format(os.path.basename(image), best_result))

    return resultList

如果推理脚本所在的同一目录中包含另一个文件或文件夹,可以通过查找当前工作目录来引用此文件或文件夹。If you have another file or folder in the same directory as your inference script, you can reference it by finding the current working directory.

script_dir = os.path.realpath(os.path.join(__file__, '..',))
file_path = os.path.join(script_dir, "<file_name>")

ParallelRunConfig 的参数Parameters for ParallelRunConfig

ParallelRunConfigParallelRunStep 实例在 Azure 机器学习管道中的主要配置。ParallelRunConfig is the major configuration for ParallelRunStep instance within the Azure Machine Learning pipeline. 使用它来包装脚本并配置所需的参数,包括所有以下条目:You use it to wrap your script and configure necessary parameters, including all of the following entries:

  • entry_script:作为将在多个节点上并行运行的本地文件路径的用户脚本。entry_script: A user script as a local file path that will be run in parallel on multiple nodes. 如果 source_directory 存在,则使用相对路径。If source_directory is present, use a relative path. 否则,请使用计算机上可访问的任何路径。Otherwise, use any path that's accessible on the machine.
  • mini_batch_size:传递给单个 run() 调用的微型批处理的大小。mini_batch_size: The size of the mini-batch passed to a single run() call. (可选;默认值对于 FileDataset10 个文件,对应 TabularDataset1MB。)(optional; the default value is 10 files for FileDataset and 1MB for TabularDataset.)
    • 对于 FileDataset,它是最小值为 1 的文件数。For FileDataset, it's the number of files with a minimum value of 1. 可以将多个文件合并成一个微型批处理。You can combine multiple files into one mini-batch.
    • 对于 TabularDataset,它是数据的大小。For TabularDataset, it's the size of data. 示例值为 10241024KB10MB1GBExample values are 1024, 1024KB, 10MB, and 1GB. 建议值为 1MBThe recommended value is 1MB. TabularDataset 中的微批永远不会跨越文件边界。The mini-batch from TabularDataset will never cross file boundaries. 例如,如果你有各种大小的 .csv 文件,最小的文件为 100 KB,最大的文件为 10 MB。For example, if you have .csv files with various sizes, the smallest file is 100 KB and the largest is 10 MB. 如果设置 mini_batch_size = 1MB,则大小小于 1 MB 的文件将被视为一个微型批处理。If you set mini_batch_size = 1MB, then files with a size smaller than 1 MB will be treated as one mini-batch. 大小大于 1 MB 的文件将被拆分为多个微型批处理。Files with a size larger than 1 MB will be split into multiple mini-batches.
  • error_threshold:在处理过程中应忽略的 TabularDataset 记录失败数和 FileDataset 文件失败数。error_threshold: The number of record failures for TabularDataset and file failures for FileDataset that should be ignored during processing. 如果整个输入的错误计数超出此值,则作业将中止。If the error count for the entire input goes above this value, the job will be aborted. 错误阈值适用于整个输入,而不适用于发送给 run() 方法的单个微型批处理。The error threshold is for the entire input and not for individual mini-batch sent to the run() method. 范围为 [-1, int.max]The range is [-1, int.max]. -1 部分指示在处理过程中忽略所有失败。The -1 part indicates ignoring all failures during processing.
  • output_action:以下值之一指示将如何组织输出:output_action: One of the following values indicates how the output will be organized:
    • summary_only:用户脚本将存储输出。summary_only: The user script will store the output. ParallelRunStep 仅将输出用于错误阈值计算。ParallelRunStep will use the output only for the error threshold calculation.
    • append_row:对于所有输入,仅在输出文件夹中创建一个文件来追加所有按行分隔的输出。append_row: For all inputs, only one file will be created in the output folder to append all outputs separated by line.
  • append_row_file_name:用于自定义 append_row output_action 的输出文件名(可选;默认值为 parallel_run_step.txt)。append_row_file_name: To customize the output file name for append_row output_action (optional; default value is parallel_run_step.txt).
  • source_directory:文件夹的路径,这些文件夹包含要在计算目标上执行的所有文件(可选)。source_directory: Paths to folders that contain all files to execute on the compute target (optional).
  • compute_target:仅支持 AmlComputecompute_target: Only AmlCompute is supported.
  • node_count:用于运行用户脚本的计算节点数。node_count: The number of compute nodes to be used for running the user script.
  • process_count_per_node:每个节点的进程数。process_count_per_node: The number of processes per node. 最佳做法是设置为一个节点具有的 GPU 或 CPU 数量(可选;默认值为 1)。Best practice is to set to the number of GPU or CPU one node has (optional; default value is 1).
  • environment:Python 环境定义。environment: The Python environment definition. 可以将其配置为使用现有的 Python 环境或设置临时环境。You can configure it to use an existing Python environment or to set up a temporary environment. 定义还负责设置所需的应用程序依赖项(可选)。The definition is also responsible for setting the required application dependencies (optional).
  • logging_level:日志详细程度。logging_level: Log verbosity. 递增详细程度的值为:WARNINGINFODEBUGValues in increasing verbosity are: WARNING, INFO, and DEBUG. (可选;默认值为 INFO(optional; the default value is INFO)
  • run_invocation_timeoutrun() 方法调用超时(以秒为单位)。run_invocation_timeout: The run() method invocation timeout in seconds. (可选;默认值为 60(optional; default value is 60)
  • run_max_try:微型批处理的 run() 的最大尝试次数。run_max_try: Maximum try count of run() for a mini-batch. 如果引发异常,则 run() 失败;如果达到 run_invocation_timeout,则不返回任何内容(可选;默认值为 3)。A run() is failed if an exception is thrown, or nothing is returned when run_invocation_timeout is reached (optional; default value is 3).

可以指定 mini_batch_sizenode_countprocess_count_per_nodelogging_levelrun_invocation_timeoutrun_max_try 作为 PipelineParameter 以便在重新提交管道运行时,可以微调参数值。You can specify mini_batch_size, node_count, process_count_per_node, logging_level, run_invocation_timeout, and run_max_try as PipelineParameter, so that when you resubmit a pipeline run, you can fine-tune the parameter values. 在此示例中,对 mini_batch_sizeProcess_count_per_node 使用 PipelineParameter,并在稍后重新提交运行时更改这些值。In this example, you use PipelineParameter for mini_batch_size and Process_count_per_node and you will change these values when resubmit a run later.

用于创建 ParallelRunStep 的参数Parameters for creating the ParallelRunStep

使用脚本、环境配置和参数创建 ParallelRunStep。Create the ParallelRunStep by using the script, environment configuration, and parameters. 将已附加到工作区的计算目标指定为推理脚本的执行目标。Specify the compute target that you already attached to your workspace as the target of execution for your inference script. 使用 ParallelRunStep 创建批处理推理管道步骤,该步骤采用以下所有参数:Use ParallelRunStep to create the batch inference pipeline step, which takes all the following parameters:

  • name:步骤的名称,但具有以下命名限制:唯一、3-32 个字符和正则表达式 ^[a-z]([-a-z0-9]*[a-z0-9])?$。name: The name of the step, with the following naming restrictions: unique, 3-32 characters, and regex ^[a-z]([-a-z0-9]*[a-z0-9])?$.
  • parallel_run_configParallelRunConfig 对象,如前文所述。parallel_run_config: A ParallelRunConfig object, as defined earlier.
  • inputs:要分区以进行并行处理的一个或多个单类型 Azure 机器学习数据集。inputs: One or more single-typed Azure Machine Learning datasets to be partitioned for parallel processing.
  • side_inputs:无需分区就可以用作辅助输入的一个或多个参考数据或数据集。side_inputs: One or more reference data or datasets used as side inputs without need to be partitioned.
  • output:与输出目录相对应的 OutputFileDatasetConfig 对象。output: An OutputFileDatasetConfig object that corresponds to the output directory.
  • arguments:传递给用户脚本的参数列表。arguments: A list of arguments passed to the user script. 使用 unknown_args 在入口脚本中检索它们(可选)。Use unknown_args to retrieve them in your entry script (optional).
  • allow_reuse:当使用相同的设置/输入运行时,该步骤是否应重用以前的结果。allow_reuse: Whether the step should reuse previous results when run with the same settings/inputs. 如果此参数为 False,则在管道执行过程中将始终为此步骤生成新的运行。If this parameter is False, a new run will always be generated for this step during pipeline execution. (可选;默认值为 True。)(optional; the default value is True.)
from azureml.pipeline.steps import ParallelRunStep

parallelrun_step = ParallelRunStep(
    name="predict-digits-mnist",
    parallel_run_config=parallel_run_config,
    inputs=[input_mnist_ds_consumption],
    output=output_dir,
    allow_reuse=True
)

调试方法Debugging techniques

调试管道有三种主要方法:There are three major techniques for debugging pipelines:

  • 在本地计算机上调试单个管道步骤Debug individual pipeline steps on your local computer
  • 使用日志记录和 Application Insights 来隔离并诊断问题根源Use logging and Application Insights to isolate and diagnose the source of the problem
  • 将远程调试器附加到 Azure 中运行的管道Attach a remote debugger to a pipeline running in Azure

在本地调试脚本Debug scripts locally

管道中最常见的失败情形之一是,域脚本未按预期运行,或者在难以调试的远程计算上下文中包含运行时错误。One of the most common failures in a pipeline is that the domain script does not run as intended, or contains runtime errors in the remote compute context that are difficult to debug.

管道本身无法在本地运行,但在本地计算机上的隔离位置运行脚本可以更快地进行调试,因为无需等待计算和环境生成过程完成。Pipelines themselves cannot be run locally, but running the scripts in isolation on your local machine allows you to debug faster because you don't have to wait for the compute and environment build process. 执行此操作需要完成一些开发工作:Some development work is required to do this:

  • 如果数据位于云数据存储中,则需要下载数据并使其可供脚本使用。If your data is in a cloud datastore, you will need to download data and make it available to your script. 使用较小的数据样本能够很好地在运行时减少系统开销,并快速获取有关脚本行为的反馈Using a small sample of your data is a good way to cut down on runtime and quickly get feedback on script behavior
  • 如果你正在尝试模拟某个中间管道步骤,可能需要手动生成特定脚本预期前一步骤提供的对象类型If you are attempting to simulate an intermediate pipeline step, you may need to manually build the object types that the particular script is expecting from the prior step
  • 还需要定义自己的环境,并复制远程计算环境中定义的依赖项You will also need to define your own environment, and replicate the dependencies defined in your remote compute environment

在本地环境中运行脚本安装后,执行如下所述的调试任务就会容易得多:Once you have a script setup to run on your local environment, it is much easier to do debugging tasks like:

  • 附加自定义调试配置Attaching a custom debug configuration
  • 暂停执行和检查对象状态Pausing execution and inspecting object-state
  • 捕获运行时之前不会公开的类型或逻辑错误Catching type or logical errors that won't be exposed until runtime

提示

确认脚本按预期运行后,合理的下一步是在单步管道中运行该脚本,然后尝试在包含多个步骤的管道中运行该脚本。Once you can verify that your script is running as expected, a good next step is running the script in a single-step pipeline before attempting to run it in a pipeline with multiple steps.

配置、写入和查看管道日志Configure, write to, and review pipeline logs

在开始生成管道之前,在本地测试脚本是调试主要代码段和复杂逻辑的适当方式,但在某个时间点,你可能需要在执行实际管道运行本身期间调试脚本,尤其是在诊断与管道步骤交互期间发生的行为时。Testing scripts locally is a great way to debug major code fragments and complex logic before you start building a pipeline, but at some point you will likely need to debug scripts during the actual pipeline run itself, especially when diagnosing behavior that occurs during the interaction between pipeline steps. 我们建议在步骤脚本中充分使用 print() 语句,以便可以查看远程执行期间的对象状态和预期值,就像在调试 JavaScript 代码时一样。We recommend liberal use of print() statements in your step scripts so that you can see object state and expected values during remote execution, similar to how you would debug JavaScript code.

日志记录选项和行为Logging options and behavior

下表提供了针对管道的各个调试选项的信息。The table below provides information for different debug options for pipelines. 这不是一个详尽的列表,因为除了此处显示的 Azure 机器学习、Python 和 OpenCensus 以外,还有其他选项。It isn't an exhaustive list, as other options exist besides just the Azure Machine Learning, Python, and OpenCensus ones shown here.

Library 类型Type 示例Example 目标Destination 资源Resources
Azure 机器学习 SDKAzure Machine Learning SDK 指标Metric run.log(name, val) Azure 机器学习门户 UIAzure Machine Learning Portal UI 如何跟踪试验How to track experiments
azureml.core.Run 类azureml.core.Run class
Python 打印/日志记录Python printing/logging 日志Log print(val)
logging.info(message)
驱动程序日志、Azure 机器学习设计器Driver logs, Azure Machine Learning designer 如何跟踪试验How to track experiments

Python 日志记录Python logging
OpenCensus PythonOpenCensus Python 日志Log logger.addHandler(AzureLogHandler())
logging.log(message)
Application Insights - 跟踪Application Insights - traces 在 Application Insights 中调试管道Debug pipelines in Application Insights

OpenCensus Azure Monitor Exporters(OpenCensus Azure Monitor 导出程序)OpenCensus Azure Monitor Exporters
Python 日志记录指南Python logging cookbook

日志记录选项示例Logging options example

import logging

from azureml.core.run import Run
from opencensus.ext.azure.log_exporter import AzureLogHandler

run = Run.get_context()

# Azure ML Scalar value logging
run.log("scalar_value", 0.95)

# Python print statement
print("I am a python print statement, I will be sent to the driver logs.")

# Initialize python logger
logger = logging.getLogger(__name__)
logger.setLevel(args.log_level)

# Plain python logging statements
logger.debug("I am a plain debug statement, I will be sent to the driver logs.")
logger.info("I am a plain info statement, I will be sent to the driver logs.")

handler = AzureLogHandler(connection_string='<connection string>')
logger.addHandler(handler)

# Python logging with OpenCensus AzureLogHandler
logger.warning("I am an OpenCensus warning statement, find me in Application Insights!")
logger.error("I am an OpenCensus error statement with custom dimensions", {'step_id': run.id})

Azure 机器学习设计器Azure Machine Learning designer

对于在设计器中创建的管道,可以在创作页或管道运行详细信息页中找到 70_driver_log 文件。For pipelines created in the designer, you can find the 70_driver_log file in either the authoring page, or in the pipeline run detail page.

为实时终结点启用日志记录Enable logging for real-time endpoints

若要在设计器中排查和调试实时终结点问题,必须使用 SDK 启用 Application Insight 日志记录。In order to troubleshoot and debug real-time endpoints in the designer, you must enable Application Insight logging using the SDK. 使用日志记录可排查和调试模型部署和使用问题。Logging lets you troubleshoot and debug model deployment and usage issues. 有关详细信息,请参阅对部署的模型进行日志记录For more information, see Logging for deployed models.

从创作页获取日志Get logs from the authoring page

当你提交管道运行并停留在创作页时,可以找到每个模块完成运行时,为每个模块生成的日志文件。When you submit a pipeline run and stay in the authoring page, you can find the log files generated for each module as each module finishes running.

  1. 在创作画布中选择已完成运行的模块。Select a module that has finished running in the authoring canvas.

  2. 在模块的右窗格中,转到“输出 + 日志”选项卡。In the right pane of the module, go to the Outputs + logs tab.

  3. 展开右窗格,然后选择 70_driver_log.txt 并在浏览器中查看该文件。Expand the right pane, and select the 70_driver_log.txt to view the file in browser. 还可以在本地下载日志。You can also download logs locally.

    设计器中展开的输出窗格

从管道运行获取日志Get logs from pipeline runs

还可以在管道运行详细信息页找到特定运行的日志文件,文件位于工作室的“管道”或“试验”部分。 You can also find the log files for specific runs in the pipeline run detail page, which can be found in either the Pipelines or Experiments section of the studio.

  1. 选择在设计器中创建的一个管道运行。Select a pipeline run created in the designer.

    管道运行页

  2. 在预览窗格中选择模块。Select a module in the preview pane.

  3. 在模块的右窗格中,转到“输出 + 日志”选项卡。In the right pane of the module, go to the Outputs + logs tab.

  4. 展开右窗格,在浏览器中查看 70_driver_log.txt 文件,或选择该文件,在本地下载日志。Expand the right pane to view the 70_driver_log.txt file in browser, or select the file to download the logs locally.

重要

若要从管道运行详细信息页更新管道,必须将管道运行克隆到新管道草稿。To update a pipeline from the pipeline run details page, you must clone the pipeline run to a new pipeline draft. 管道运行是管道的快照。A pipeline run is a snapshot of the pipeline. 它类似于日志文件,并且无法更改。It's similar to a log file, and cannot be altered.

Application InsightsApplication Insights

有关以此方式使用 OpenCensus Python 库的详细信息,请参阅此指南:在 Application Insights 中对机器学习管道进行调试和故障排除For more information on using the OpenCensus Python library in this manner, see this guide: Debug and troubleshoot machine learning pipelines in Application Insights

使用 Visual Studio Code 进行交互式调试Interactive debugging with Visual Studio Code

在某些情况下,可能需要以交互方式调试 ML 管道中使用的 Python 代码。In some cases, you may need to interactively debug the Python code used in your ML pipeline. 通过使用 Visual Studio Code (VS Code) 和 debugpy,可以在训练环境中运行代码时附加到该代码。By using Visual Studio Code (VS Code) and debugpy, you can attach to the code as it runs in the training environment. 有关详细信息,请访问在 VS Code 指南中进行交互式调试For more information, visit the interactive debugging in VS Code guide.

后续步骤Next steps