在 Python 中启动、监视和取消训练运行Start, monitor, and cancel training runs in Python

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

适用于 Python 的 Azure 机器学习 SDK机器学习 CLI 提供多种方法用于监视、组织和管理训练运行与试验运行。The Azure Machine Learning SDK for Python and Machine Learning CLI provide various methods to monitor, organize, and manage your runs for training and experimentation.

本文演示以下任务的示例:This article shows examples of the following tasks:

  • 监视运行性能。Monitor run performance.
  • 取消运行或使其失败。Cancel or fail runs.
  • 创建子运行。Create child runs.
  • 标记和查找运行。Tag and find runs.

必备条件Prerequisites

需要准备好以下各项:You'll need the following items:

启动运行及其日志记录过程Start a run and its logging process

使用 SDKUsing the SDK

通过从 azureml.core 包导入 WorkspaceExperimentRunScriptRunConfig 类来设置试验。Set up your experiment by importing the Workspace, Experiment, Run, and ScriptRunConfig classes from the azureml.core package.

import azureml.core
from azureml.core import Workspace, Experiment, Run
from azureml.core import ScriptRunConfig

ws = Workspace.from_config()
exp = Experiment(workspace=ws, name="explore-runs")

使用 start_logging() 方法启动运行及其日志记录过程。Start a run and its logging process with the start_logging() method.

notebook_run = exp.start_logging()
notebook_run.log(name="message", value="Hello from run!")

使用 CLIUsing the CLI

若要启动试验的运行,请使用以下步骤:To start a run of your experiment, use the following steps:

  1. 在 shell 中或命令提示符下,使用 Azure CLI 对 Azure 订阅进行身份验证:From a shell or command prompt, use the Azure CLI to authenticate to your Azure subscription:

    az login
    
  2. 将工作区配置附加到包含训练脚本的文件夹。Attach a workspace configuration to the folder that contains your training script. 请将 myworkspace 替换为你的 Azure 机器学习工作区。Replace myworkspace with your Azure Machine Learning workspace. myresourcegroup 替换为包含工作区的 Azure 资源组:Replace myresourcegroup with the Azure resource group that contains your workspace:

    az ml folder attach -w myworkspace -g myresourcegroup
    

    此命令创建包含示例 runconfig 和 conda 环境文件的 .azureml 子目录。This command creates a .azureml subdirectory that contains example runconfig and conda environment files. 此子目录还包含用来与 Azure 机器学习工作区通信的 config.json 文件。It also contains a config.json file that is used to communicate with your Azure Machine Learning workspace.

    有关详细信息,请参阅 az ml folder attachFor more information, see az ml folder attach.

  3. 若要启动运行,请使用以下命令。To start the run, use the following command. 使用此命令时,请为 -c 参数指定 runconfig 文件的名称(如果查看的是文件系统,此名称为 *.runconfig 前面的文本)。When using this command, specify the name of the runconfig file (the text before *.runconfig if you are looking at your file system) against the -c parameter.

    az ml run submit-script -c sklearn -e testexperiment train.py
    

    提示

    az ml folder attach 命令创建了一个 .azureml 子目录,其中包含两个示例 runconfig 文件。The az ml folder attach command created a .azureml subdirectory, which contains two example runconfig files.

    如果你的某个 Python 脚本以编程方式创建运行配置对象,则你可以使用 RunConfig.save() 将此对象另存为 runconfig 文件。If you have a Python script that creates a run configuration object programmatically, you can use RunConfig.save() to save it as a runconfig file.

    有关详细信息,请参阅 az ml run submit-scriptFor more information, see az ml run submit-script.

使用 Azure 机器学习工作室Using Azure Machine Learning studio

若要开始在设计器(预览版)中提交管道运行,请执行以下步骤:To start a submit a pipeline run in the designer (preview), use the following steps:

  1. 为管道设置默认计算目标。Set a default compute target for your pipeline.

  2. 在管道画布顶部选择“运行” 。Select Run at the top of the pipeline canvas.

  3. 选择用于为管道运行分组的试验。Select an Experiment to group your pipeline runs.

监视运行的状态Monitor the status of a run

使用 SDKUsing the SDK

使用 get_status() 方法获取运行的状态。Get the status of a run with the get_status() method.

print(notebook_run.get_status())

若要获取运行 ID、执行时间和有关运行的更多详细信息,请使用 get_details() 方法。To get the run ID, execution time, and additional details about the run, use the get_details() method.

print(notebook_run.get_details())

成功完成运行后,使用 complete() 方法将其标记为已完成。When your run finishes successfully, use the complete() method to mark it as completed.

notebook_run.complete()
print(notebook_run.get_status())

如果使用 Python 的 with...as 设计模式,则当运行超出范围时,该运行会自动将自身标记为已完成。If you use Python's with...as design pattern, the run will automatically mark itself as completed when the run is out of scope. 无需手动将它标记为已完成。You don't need to manually mark the run as completed.

with exp.start_logging() as notebook_run:
    notebook_run.log(name="message", value="Hello from run!")
    print(notebook_run.get_status())

print(notebook_run.get_status())

使用 CLIUsing the CLI

  1. 若要查看试验的运行列表,请使用以下命令。To view a list of runs for your experiment, use the following command. 请将 experiment 替换为你的试验名称。Replace experiment with the name of your experiment:

    az ml run list --experiment-name experiment
    

    此命令返回一个 JSON 文档,其中列出了有关此试验的运行的信息。This command returns a JSON document that lists information about runs for this experiment.

    有关详细信息,请参阅 az ml experiment listFor more information, see az ml experiment list.

  2. 若要查看有关特定运行的信息,请使用以下命令。To view information on a specific run, use the following command. 请将 runid 替换为运行的 ID:Replace runid with the ID of the run:

    az ml run show -r runid
    

    此命令返回一个 JSON 文档,其中列出了有关运行的信息。This command returns a JSON document that lists information about the run.

    有关详细信息,请参阅 az ml run showFor more information, see az ml run show.

使用 Azure 机器学习工作室Using Azure Machine Learning studio

在工作室中查看试验的活动运行数。To view the number of active runs for your experiment in the studio.

  1. 导航到“试验” 部分。Navigate to the Experiments section..

  2. 选择一个试验。Select an experiment.

    在试验页中,可以看到活动计算目标数以及每个运行的持续时间。In the experiment page, you can see the number of active compute targets and the duration for each run.

  3. 选择特定的运行编号。Select a specific run number.

  4. 在“日志” 选项卡中,可以找到管道运行的诊断日志和错误日志。In the Logs tab, you can find diagnostic and error logs for your pipeline run.

取消运行或使其失败Cancel or fail runs

如果发现错误,或者完成运行花费的时间太长,可以取消该运行。If you notice a mistake or if your run is taking too long to finish, you can cancel the run.

使用 SDKUsing the SDK

若要使用 SDK 取消运行,请使用 cancel() 方法:To cancel a run using the SDK, use the cancel() method:

run_config = ScriptRunConfig(source_directory='.', script='hello_with_delay.py')
local_script_run = exp.submit(run_config)
print(local_script_run.get_status())

local_script_run.cancel()
print(local_script_run.get_status())

如果运行已完成但包含错误(例如,使用了错误的训练脚本),可以使用 fail() 方法将其标记为失败。If your run finishes, but it contains an error (for example, the incorrect training script was used), you can use the fail() method to mark it as failed.

local_script_run = exp.submit(run_config)
local_script_run.fail()
print(local_script_run.get_status())

使用 CLIUsing the CLI

若要使用 CLI 取消运行,请使用以下命令。To cancel a run using the CLI, use the following command. 请将 runid 替换为运行的 IDReplace runid with the ID of the run

az ml run cancel -r runid -w workspace_name -e experiment_name

有关详细信息,请参阅 az ml run cancelFor more information, see az ml run cancel.

使用 Azure 机器学习工作室Using Azure Machine Learning studio

若要在工作室中取消某个运行,请执行以下步骤:To cancel a run in the studio, using the following steps:

  1. 转到“试验” 或“管道” 部分中正在运行的管道。Go to the running pipeline in either the Experiments or Pipelines section.

  2. 选择要取消的管道运行编号。Select the pipeline run number you want to cancel.

  3. 在工具栏中,选择“取消” In the toolbar, select Cancel

创建子运行Create child runs

创建子运行可将相关的运行组合到一起,例如,以完成不同的超参数优化迭代。Create child runs to group together related runs, such as for different hyperparameter-tuning iterations.

备注

只能使用 SDK 创建子运行。Child runs can only be created using the SDK.

此代码示例使用 hello_with_children.py 脚本,通过 child_run() 方法从已提交的运行内部创建包含五个子运行的批:This code example uses the hello_with_children.py script to create a batch of five child runs from within a submitted run by using the child_run() method:

!more hello_with_children.py
run_config = ScriptRunConfig(source_directory='.', script='hello_with_children.py')

local_script_run = exp.submit(run_config)
local_script_run.wait_for_completion(show_output=True)
print(local_script_run.get_status())

with exp.start_logging() as parent_run:
    for c,count in enumerate(range(5)):
        with parent_run.child_run() as child:
            child.log(name="Hello from child run", value=c)

备注

当子运行超出范围时,会自动标记为已完成。As they move out of scope, child runs are automatically marked as completed.

若要高效地创建许多子运行,请使用 create_children() 方法。To create many child runs efficiently, use the create_children() method. 由于每次创建操作都会造成网络调用,因此,创建一批运行比逐个创建更为高效。Because each creation results in a network call, creating a batch of runs is more efficient than creating them one by one.

提交子运行Submit child runs

也可以从父运行提交子运行。Child runs can also be submitted from a parent run. 这样,便可以创建父子运行的层次结构,每个层次结构在按通用父运行 ID 连接的不同计算目标上运行。This allows you to create hierarchies of parent and child runs, each running on different compute targets, connected by common parent run ID.

使用 'submit_child()' 方法从父运行内部提交子运行。Use the 'submit_child()' method to submit a child run from within a parent run. 若要在父运行脚本中执行此操作,请获取运行上下文,并使用上下文实例的 submit_child 方法提交子运行。To do this in the parent run script, get the run context and submit the child run using the submit_child method of the context instance.

## In parent run script
parent_run = Run.get_context()
child_run_config = ScriptRunConfig(source_directory='.', script='child_script.py')
parent_run.submit_child(child_run_config)

在子运行内部,可以查看父运行 ID:Within a child run, you can view the parent run ID:

## In child run script
child_run = Run.get_context()
child_run.parent.id

查询子运行Query child runs

若要查询特定父级的子运行,请使用 get_children() 方法。To query the child runs of a specific parent, use the get_children() method. 使用 recursive = True 参数可以查询子级和孙级的嵌套树。The recursive = True argument allows you to query a nested tree of children and grandchildren.

print(parent_run.get_children())

标记和查找运行Tag and find runs

在 Azure 机器学习中,可以使用属性与标记来帮助组织运行,以及查询运行以获取重要信息。In Azure Machine Learning, you can use properties and tags to help organize and query your runs for important information.

添加属性和标记Add properties and tags

使用 SDKUsing the SDK

若要将可搜索的元数据添加到运行,请使用 add_properties() 方法。To add searchable metadata to your runs, use the add_properties() method. 例如,以下代码将 "author" 属性添加到运行:For example, the following code adds the "author" property to the run:

local_script_run.add_properties({"author":"azureml-user"})
print(local_script_run.get_properties())

属性是不可变的,因此它们将创建一条永久记录用于审核目的。Properties are immutable, so they create a permanent record for auditing purposes. 以下代码示例会导致出错,因为我们已在前面的代码中添加了 "azureml-user" 作为 "author" 属性值:The following code example results in an error, because we already added "azureml-user" as the "author" property value in the preceding code:

try:
    local_script_run.add_properties({"author":"different-user"})
except Exception as e:
    print(e)

与属性不同,标记是可变的。Unlike properties, tags are mutable. 若要为试验的使用者添加可搜索且有意义的信息,请使用 tag() 方法。To add searchable and meaningful information for consumers of your experiment, use the tag() method.

local_script_run.tag("quality", "great run")
print(local_script_run.get_tags())

local_script_run.tag("quality", "fantastic run")
print(local_script_run.get_tags())

还可以添加简单的字符串标记。You can also add simple string tags. 当这些标记作为键出现在标记字典中时,它们的值为 NoneWhen these tags appear in the tag dictionary as keys, they have a value of None.

local_script_run.tag("worth another look")
print(local_script_run.get_tags())

使用 CLIUsing the CLI

备注

使用 CLI 只能添加或更新标记。Using the CLI, you can only add or update tags.

若要添加或更新标记,请使用以下命令:To add or update a tag, use the following command:

az ml run update -r runid --add-tag quality='fantastic run'

有关详细信息,请参阅 az ml run updateFor more information, see az ml run update.

查询属性和标记Query properties and tags

可以查询试验中的运行,以返回与特定属性和标记匹配的运行列表。You can query runs within an experiment to return a list of runs that match specific properties and tags.

使用 SDKUsing the SDK

list(exp.get_runs(properties={"author":"azureml-user"},tags={"quality":"fantastic run"}))
list(exp.get_runs(properties={"author":"azureml-user"},tags="worth another look"))

使用 CLIUsing the CLI

Azure CLI 支持 JMESPath 查询,可以使用这些查询基于属性和标记来筛选运行。The Azure CLI supports JMESPath queries, which can be used to filter runs based on properties and tags. 若要在 Azure CLI 中使用 JMESPath 查询,请使用 --query 参数指定该查询。To use a JMESPath query with the Azure CLI, specify it with the --query parameter. 以下示例演示了使用属性和标记的基本查询:The following examples show basic queries using properties and tags:

# list runs where the author property = 'azureml-user'
az ml run list --experiment-name experiment [?properties.author=='azureml-user']
# list runs where the tag contains a key that starts with 'worth another look'
az ml run list --experiment-name experiment [?tags.keys(@)[?starts_with(@, 'worth another look')]]
# list runs where the author property = 'azureml-user' and the 'quality' tag starts with 'fantastic run'
az ml run list --experiment-name experiment [?properties.author=='azureml-user' && tags.quality=='fantastic run']

有关查询 Azure CLI 结果的详细信息,请参阅查询 Azure CLI 命令输出For more information on querying Azure CLI results, see Query Azure CLI command output.

使用 Azure 机器学习工作室Using Azure Machine Learning studio

  1. 导航到“管道” 部分。Navigate to the Pipelines section.

  2. 使用搜索栏按标记、说明、试验名称和提交者姓名筛选管道。Use the search bar to filter pipelines using tags, descriptions, experiment names, and submitter name.

示例笔记本Example notebooks

以下笔记本演示了本文中的概念:The following notebooks demonstrate the concepts in this article:

后续步骤Next steps