在 Python 中启动、监视和取消训练运行Start, monitor, and cancel training runs in Python

适用于 Python 的 Azure 机器学习 SDK机器学习 CLIAzure 机器学习工作室提供多种方法用于监视、组织和管理训练运行与试验运行。The Azure Machine Learning SDK for Python, Machine Learning CLI, and Azure Machine Learning studio provide various methods to monitor, organize, and manage your runs for training and experimentation.

本文演示以下任务的示例:This article shows examples of the following tasks:

  • 监视运行性能。Monitor run performance.
  • 通过电子邮件通知监视运行状态。Monitor the run status by email notification.
  • 标记和查找运行。Tag and find runs.
  • 添加运行说明。Add a run description.
  • 运行搜索。Run search.
  • 取消运行或使其失败。Cancel or fail runs.
  • 创建子运行。Create child runs.

提示

如果要了解如何监视 Azure 机器学习服务及关联的 Azure 服务,请参阅如何监视 Azure 机器学习If you're looking for information on monitoring the Azure Machine Learning service and associated Azure services, see How to monitor Azure Machine Learning. 如果要了解如何监视部署为 Web 服务或 IoT Edge 模块的模型,请参阅收集模型数据使用 Application Insights 进行监视If you're looking for information on monitoring models deployed as web services or IoT Edge modules, see Collect model data and Monitor with Application Insights.

先决条件Prerequisites

需要准备好以下各项:You'll need the following items:

监视运行性能Monitor run performance

  • 启动运行及其日志记录过程Start a run and its logging process

    1. 通过从 azureml.core 包导入 WorkspaceExperimentRunScriptRunConfig 类来设置试验。Set up your experiment by importing the Workspace, Experiment, Run, and ScriptRunConfig classes from the azureml.core package.

      import azureml.core
      from azureml.core import Workspace, Experiment, Run
      from azureml.core import ScriptRunConfig
      
      ws = Workspace.from_config()
      exp = Experiment(workspace=ws, name="explore-runs")
      
    2. 使用 start_logging() 方法启动运行及其日志记录过程。Start a run and its logging process with the start_logging() method.

      notebook_run = exp.start_logging()
      notebook_run.log(name="message", value="Hello from run!")
      
  • 监视运行的状态Monitor the status of a run

    • 使用 get_status() 方法获取运行的状态。Get the status of a run with the get_status() method.

      print(notebook_run.get_status())
      
    • 若要获取运行 ID、执行时间和有关运行的更多详细信息,请使用 get_details() 方法。To get the run ID, execution time, and additional details about the run, use the get_details() method.

      print(notebook_run.get_details())
      
    • 成功完成运行后,使用 complete() 方法将其标记为已完成。When your run finishes successfully, use the complete() method to mark it as completed.

      notebook_run.complete()
      print(notebook_run.get_status())
      
    • 如果使用 Python 的 with...as 设计模式,则当运行超出范围时,该运行会自动将自身标记为已完成。If you use Python's with...as design pattern, the run will automatically mark itself as completed when the run is out of scope. 无需手动将它标记为已完成。You don't need to manually mark the run as completed.

      with exp.start_logging() as notebook_run:
          notebook_run.log(name="message", value="Hello from run!")
          print(notebook_run.get_status())
      
      print(notebook_run.get_status())
      

运行说明Run description

可以将运行说明添加到运行中,以便为运行提供更多上下文和信息。A run description can be added to a run to provide more context and information to the run. 还可以从运行列表中搜索这些说明,并将运行说明作为列添加到运行列表中。You can also search on these descriptions from the runs list and add the run description as a column in the runs list.

导航到运行的“运行详细信息”页,然后选择“编辑”或“铅笔”图标以添加、编辑或删除运行说明。Navigate to the Run Details page for your run and select the edit or pencil icon to add, edit or delete descriptions for your run. 要保留对运行列表的更改,请将所做的更改保存到现有的自定义视图或新的自定义视图。To persist the changes to the runs list, save the changes to your existing Custom View or a new Custom View. 允许对图像进行嵌入和深层链接的运行说明支持 Markdown 格式,如下所示。Markdown format is supported for run descriptions which allows images to be embedded and deep linking as shown below.

屏幕截图:创建运行说明

标记和查找运行Tag and find runs

在 Azure 机器学习中,可以使用属性与标记来帮助组织运行,以及查询运行以获取重要信息。In Azure Machine Learning, you can use properties and tags to help organize and query your runs for important information.

  • 添加属性和标记Add properties and tags

    若要将可搜索的元数据添加到运行,请使用 add_properties() 方法。To add searchable metadata to your runs, use the add_properties() method. 例如,以下代码将 "author" 属性添加到运行:For example, the following code adds the "author" property to the run:

    local_run.add_properties({"author":"azureml-user"})
    print(local_run.get_properties())
    

    属性是不可变的,因此它们将创建一条永久记录用于审核目的。Properties are immutable, so they create a permanent record for auditing purposes. 以下代码示例会导致出错,因为我们已在前面的代码中添加了 "azureml-user" 作为 "author" 属性值:The following code example results in an error, because we already added "azureml-user" as the "author" property value in the preceding code:

    try:
        local_run.add_properties({"author":"different-user"})
    except Exception as e:
        print(e)
    

    与属性不同,标记是可变的。Unlike properties, tags are mutable. 若要为试验的使用者添加可搜索且有意义的信息,请使用 tag() 方法。To add searchable and meaningful information for consumers of your experiment, use the tag() method.

    local_run.tag("quality", "great run")
    print(local_run.get_tags())
    
    local_run.tag("quality", "fantastic run")
    print(local_run.get_tags())
    

    还可以添加简单的字符串标记。You can also add simple string tags. 当这些标记作为键出现在标记字典中时,它们的值为 NoneWhen these tags appear in the tag dictionary as keys, they have a value of None.

    local_run.tag("worth another look")
    print(local_run.get_tags())
    
  • 查询属性和标记Query properties and tags

    可以查询试验中的运行,以返回与特定属性和标记匹配的运行列表。You can query runs within an experiment to return a list of runs that match specific properties and tags.

    list(exp.get_runs(properties={"author":"azureml-user"},tags={"quality":"fantastic run"}))
    list(exp.get_runs(properties={"author":"azureml-user"},tags="worth another look"))
    

取消运行或使其失败Cancel or fail runs

如果发现错误,或者完成运行花费的时间太长,可以取消该运行。If you notice a mistake or if your run is taking too long to finish, you can cancel the run.

若要使用 SDK 取消运行,请使用 cancel() 方法:To cancel a run using the SDK, use the cancel() method:

src = ScriptRunConfig(source_directory='.', script='hello_with_delay.py')
local_run = exp.submit(src)
print(local_run.get_status())

local_run.cancel()
print(local_run.get_status())

如果运行已完成但包含错误(例如,使用了错误的训练脚本),可以使用 fail() 方法将其标记为失败。If your run finishes, but it contains an error (for example, the incorrect training script was used), you can use the fail() method to mark it as failed.

local_run = exp.submit(src)
local_run.fail()
print(local_run.get_status())

创建子运行Create child runs

创建子运行可将相关的运行组合到一起,例如,以完成不同的超参数优化迭代。Create child runs to group together related runs, such as for different hyperparameter-tuning iterations.

备注

只能使用 SDK 创建子运行。Child runs can only be created using the SDK.

此代码示例使用 hello_with_children.py 脚本,通过 child_run() 方法从已提交的运行内部创建包含五个子运行的批:This code example uses the hello_with_children.py script to create a batch of five child runs from within a submitted run by using the child_run() method:

!more hello_with_children.py
src = ScriptRunConfig(source_directory='.', script='hello_with_children.py')

local_run = exp.submit(src)
local_run.wait_for_completion(show_output=True)
print(local_run.get_status())

with exp.start_logging() as parent_run:
    for c,count in enumerate(range(5)):
        with parent_run.child_run() as child:
            child.log(name="Hello from child run", value=c)

备注

当子运行超出范围时,会自动标记为已完成。As they move out of scope, child runs are automatically marked as completed.

若要高效地创建许多子运行,请使用 create_children() 方法。To create many child runs efficiently, use the create_children() method. 由于每次创建操作都会造成网络调用,因此,创建一批运行比逐个创建更为高效。Because each creation results in a network call, creating a batch of runs is more efficient than creating them one by one.

提交子运行Submit child runs

也可以从父运行提交子运行。Child runs can also be submitted from a parent run. 通过此操作可创建父运行和子运行的层次结构。This allows you to create hierarchies of parent and child runs. 你无法创建没有父运行的子运行:即使父运行只启动子运行而不执行任何操作,仍需要创建层次结构。You cannot create a parentless child run: even if the parent run does nothing but launch child runs, it's still necessary to create the hierarchy. 所有运行的状态都是独立的:即使一个或多个子运行已取消或失败,父运行也可以处于 "Completed" 成功状态。The status of all runs are independent: a parent can be in the "Completed" successful state even if one or more child runs were canceled or failed.

你可能会希望子运行使用与父运行不同的运行配置。You may wish your child runs to use a different run configuration than the parent run. 例如,对父运行使用常规的基于 CPU 的配置,而对子运行使用基于 GPU 的配置。For instance, you might use a less-powerful, CPU-based configuration for the parent, while using GPU-based configurations for your children. 另一种常见的需求是向每个子运行传递不同的参数和数据。Another common desire is to pass each child different arguments and data. 若要自定义子运行,请为该子运行创建一个 ScriptRunConfig 对象。To customize a child run, create a ScriptRunConfig object for the child run. 下面的代码执行以下操作:The below code does the following:

  • 从工作区 ws 中检索名为 "gpu-cluster" 的计算资源Retrieve a compute resource named "gpu-cluster" from the workspace ws
  • 循环访问要传递给子 ScriptRunConfig 对象的不同参数值Iterates over different argument values to be passed to the children ScriptRunConfig objects
  • 使用自定义计算资源和参数创建并提交新的子运行Creates and submits a new child run, using the custom compute resource and argument
  • 阻止至所有子运行完成为止Blocks until all of the child runs complete
# parent.py
# This script controls the launching of child scripts
from azureml.core import Run, ScriptRunConfig

compute_target = ws.compute_targets["gpu-cluster"]

run = Run.get_context()

child_args = ['Apple', 'Banana', 'Orange']
for arg in child_args: 
    run.log('Status', f'Launching {arg}')
    child_config = ScriptRunConfig(source_directory=".", script='child.py', arguments=['--fruit', arg], compute_target=compute_target)
    # Starts the run asynchronously
    run.submit_child(child_config)

# Experiment will "complete" successfully at this point. 
# Instead of returning immediately, block until child runs complete

for child in run.get_children():
    child.wait_for_completion()

若要高效创建多个具有相同配置、参数和输入内容的子运行,可使用 create_children() 方法。To create many child runs with identical configurations, arguments, and inputs efficiently, use the create_children() method. 由于每次创建操作都会造成网络调用,因此,创建一批运行比逐个创建更为高效。Because each creation results in a network call, creating a batch of runs is more efficient than creating them one by one.

在子运行内部,可以查看父运行 ID:Within a child run, you can view the parent run ID:

## In child run script
child_run = Run.get_context()
child_run.parent.id

查询子运行Query child runs

若要查询特定父级的子运行,请使用 get_children() 方法。To query the child runs of a specific parent, use the get_children() method. 使用 recursive = True 参数可以查询子级和孙级的嵌套树。The recursive = True argument allows you to query a nested tree of children and grandchildren.

print(parent_run.get_children())

记录到父运行或根运行Log to parent or root run

你可以使用 Run.parent 字段访问启动当前子运行的运行。You can use the Run.parent field to access the run that launched the current child run. 对于这种情况,一个常见的用例是将日志结果合并到单一位置。A common use-case for this is when you wish to consolidate log results in a single place. 请注意,子运行以异步方式执行,只能保证父运行等待其子运行完成,无法保证它们顺序一致和保持同步。Note that child runs execute asynchronously and there is no guarantee of ordering or synchronization beyond the ability of the parent to wait for its child runs to complete.

# in child (or even grandchild) run

def root_run(self : Run) -> Run :
    if self.parent is None : 
        return self
    return root_run(self.parent)

current_child_run = Run.get_context()
root_run(current_child_run).log("MyMetric", f"Data from child run {current_child_run.id}")

示例笔记本Example notebooks

以下笔记本演示了本文中的概念:The following notebooks demonstrate the concepts in this article:

后续步骤Next steps