对 ParallelRunStep 进行调试和故障排除Debug and troubleshoot ParallelRunStep

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

本文介绍如何在 Azure 机器学习 SDK 中对 ParallelRunStep 类进行调试和故障排除。In this article, you learn how to debug and troubleshoot the ParallelRunStep class from the Azure Machine Learning SDK.

在本地测试脚本Testing scripts locally

有关机器学习管道的信息,请参阅在本地测试脚本部分See the Testing scripts locally section for machine learning pipelines. ParallelRunStep 作为 ML 管道中的一个步骤运行,因此相同的答案对两种情况均适用。Your ParallelRunStep runs as a step in ML pipelines so the same answer applies to both.

从远程上下文调试脚本Debugging scripts from remote context

要实现从在本地调试评分脚本到在实际管道中调试评分脚本这一飞跃可能很困难。The transition from debugging a scoring script locally to debugging a scoring script in an actual pipeline can be a difficult leap. 有关如何在门户中查找日志的信息,请参阅有关从远程上下文调试脚本的机器学习管道部分For information on finding your logs in the portal, the machine learning pipelines section on debugging scripts from a remote context. 该部分中的信息也适用于 ParallelRunStep。The information in that section also applies to a ParallelRunStep.

例如,日志文件 70_driver_log.txt 包含来自启动 ParallelRunStep 代码的控制器的信息。For example, the log file 70_driver_log.txt contains information from the controller that launches the ParallelRunStep code.

由于 ParallelRunStep 作业具有分布式特性,因此存在来自多个不同源的日志。Because of the distributed nature of ParallelRunStep jobs, there are logs from several different sources. 但是,会创建两个合并文件来提供高级信息:However, two consolidated files are created that provide high-level information:

  • ~/logs/overview.txt:此文件提供了有关到目前为止已创建的微型批处理数(也称为任务数)以及已处理的微型批处理数的高级信息。~/logs/overview.txt: This file provides a high-level info about the number of mini-batches (also known as tasks) created so far and number of mini-batches processed so far. 最后,它会显示作业的结果。At this end, it shows the result of the job. 如果作业失败,它将显示错误消息以及开始进行故障排除的位置。If the job failed, it will show the error message and where to start the troubleshooting.

  • ~/logs/sys/master.txt:此文件提供运行中作业的主节点(也称为业务流程协调程序)视图。~/logs/sys/master.txt: This file provides the principal node (also known as the orchestrator) view of the running job. 包括任务创建、进度监视和运行结果。Includes task creation, progress monitoring, the run result.

使用 EntryScript 帮助程序和 print 语句,通过入口脚本生成的日志将显示在以下文件中:Logs generated from entry script using EntryScript helper and print statements will be found in following files:

  • ~/logs/user/<ip_address>/<node_name>.log.txt:这些文件是使用 EntryScript 帮助程序从 entry_script 写入的日志。~/logs/user/<ip_address>/<node_name>.log.txt: These files are the logs written from entry_script using EntryScript helper. 还包含来自 entry_script 的 print 语句 (stdout)。Also contains print statement (stdout) from entry_script.

要简要了解脚本中的错误,请参阅以下文件:For a concise understanding of errors in your script there is:

  • ~/logs/user/error.txt:此文件将尝试汇总脚本中的错误。~/logs/user/error.txt: This file will try to summarize the errors in your script.

有关脚本中错误的详细信息,请参阅以下文件:For more information on errors in your script, there is:

  • ~/logs/user/error/:包含引发的所有错误和按节点组织的完整堆栈跟踪。~/logs/user/error/: Contains all errors thrown and full stack traces organized by node.

如需全面了解每个节点如何执行评分脚本,请查看每个节点单独的进程日志。When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. 进程日志位于 sys/node 文件夹中,按工作器节点分组:The process logs can be found in the sys/node folder, grouped by worker nodes:

  • ~/logs/sys/node/<node_name>.txt:此文件提供有关每个微型批处理在工作器拾取或完成它时的详细信息。~/logs/sys/node/<node_name>.txt: This file provides detailed info about each mini-batch as it's picked up or completed by a worker. 对于每个微型批处理,此文件包括以下内容:For each mini-batch, this file includes:

    • 工作进程的 IP 地址和 PID。The IP address and the PID of the worker process.
    • 总项数、成功处理的项计数和失败的项计数。The total number of items, successfully processed items count, and failed item count.
    • 开始时间、持续时间、处理时间和运行方法时间。The start time, duration, process time and run method time.

此外,还可以找到有关每个工作进程的资源使用情况的信息。You can also find information on the resource usage of the processes for each worker. 此信息采用 CSV 格式,并且位于 ~/logs/sys/perf/overview.csv 中。This information is in CSV format and is located at ~/logs/sys/perf/overview.csv. 有关每个进程的信息可在 ~logs/sys/processes.csv 下找到。Information about each process is available under ~logs/sys/processes.csv.

如何从远程上下文中的用户脚本记录?How do I log from my user script from a remote context?

ParallelRunStep 可以基于 process_count_per_node 在一个节点上运行多个进程。ParallelRunStep may run multiple processes on one node based on process_count_per_node. 为了组织节点上每个进程的日志并结合 print 和 log 语句,我们建议使用 ParallelRunStep 记录器,如下所示。In order to organize logs from each process on node and combine print and log statement, we recommend using ParallelRunStep logger as shown below. 从 EntryScript 获取记录器,使日志显示在门户的 logs/user 文件夹中。You get a logger from EntryScript and make the logs show up in logs/user folder in the portal.

使用记录器的示例入口脚本:A sample entry script using the logger:

from azureml_user.parallel_run import EntryScript

def init():
    """ Initialize the node."""
    entry_script = EntryScript()
    logger = entry_script.logger
    logger.debug("This will show up in files under logs/user on the Azure portal.")


def run(mini_batch):
    """ Accept and return the list back."""
    # This class is in singleton pattern and will return same instance as the one in init()
    entry_script = EntryScript()
    logger = entry_script.logger
    logger.debug(f"{__file__}: {mini_batch}.")
    ...

    return mini_batch

如何将端输入(如包含查找表的单个或多个文件)传递到所有工作器?How could I pass a side input such as, a file or file(s) containing a lookup table, to all my workers?

用户可以使用 ParalleRunStep 的 side_inputs 参数将引用数据传递给脚本。User can pass reference data to script using side_inputs parameter of ParalleRunStep. 作为 side_inputs 提供的所有数据集将装载到每个工作器节点上。All datasets provided as side_inputs will be mounted on each worker node. 用户可以通过传递参数获取装载的位置。User can get the location of mount by passing argument.

构造包含引用数据的数据集,并将其注册到工作区。Construct a Dataset containing the reference data and register it with your workspace. 将其传递到 ParallelRunStepside_inputs 参数。Pass it to the side_inputs parameter of your ParallelRunStep. 此外,还可以在 arguments 节中添加其路径,以便轻松访问其已装载的路径:Additionally, you can add its path in the arguments section to easily access its mounted path:

label_config = label_ds.as_named_input("labels_input")
batch_score_step = ParallelRunStep(
    name=parallel_step_name,
    inputs=[input_images.as_named_input("input_images")],
    output=output_dir,
    arguments=["--labels_dir", label_config],
    side_inputs=[label_config],
    parallel_run_config=parallel_run_config,
)

之后,可以在推理脚本中访问它(例如在 init() 方法中),如下所示:After that you can access it in your inference script (for example, in your init() method) as follows:

parser = argparse.ArgumentParser()
parser.add_argument('--labels_dir', dest="labels_dir", required=True)
args, _ = parser.parse_known_args()

labels_path = args.labels_dir

后续步骤Next steps

  • 请查看 SDK 参考,获取有关 azureml-pipeline-steps 包的帮助。See the SDK reference for help with the azureml-pipeline-steps package. 查看 ParallelRunStep 类的参考文档View reference documentation for ParallelRunStep class.

  • 按照高级教程操作,将管道与 ParallelRunStep 配合使用。Follow the advanced tutorial on using pipelines with ParallelRunStep. 本教程演示如何将另一个文件作为旁路输入。The tutorial shows how to pass another file as a side input.