在 ML 训练运行中启用日志记录Enable logging in ML training runs

Azure 机器学习 Python SDK 允许使用默认的 Python 日志记录包和特定于 SDK 的功能记录实时信息。The Azure Machine Learning Python SDK lets you log real-time information using both the default Python logging package and SDK-specific functionality. 你可以在本地进行记录,并将日志发送到门户中的工作区。You can log locally and send logs to your workspace in the portal.

日志可帮助你诊断错误和警告,或跟踪参数和模型性能等性能指标。Logs can help you diagnose errors and warnings, or track performance metrics like parameters and model performance. 本文介绍如何在以下场景中启用日志记录功能:In this article, you learn how to enable logging in the following scenarios:

  • 交互式训练会话Interactive training sessions
  • 使用 ScriptRunConfig 提交训练作业Submitting training jobs using ScriptRunConfig
  • Python 的原生 logging 设置Python native logging settings
  • 来自其他源的日志记录Logging from additional sources


本文说明如何监视模型训练过程。This article shows you how to monitor the model training process. 如果你希望监视 Azure 机器学习的资源使用情况和事件,例如配额、已完成的训练运行或已完成的模型部署,请参阅监视 Azure 机器学习If you're interested in monitoring resource usage and events from Azure Machine learning, such as quotas, completed training runs, or completed model deployments, see Monitoring Azure Machine Learning.

数据类型Data types

可以记录多个数据类型,包括标量值、列表、表、图像、目录等。You can log multiple data types including scalar values, lists, tables, images, directories, and more. 有关不同数据类型的详细信息和 Python 代码示例,请查看 Run 类参考页For more information, and Python code examples for different data types, see the Run class reference page.

运行指标日志记录Logging Run Metrics

使用日志记录 API 中的以下方法可影响指标可视化效果。Use the following methods in the logging APIs to influence the metrics visualizations. 请注意这些记录的指标的服务限制Note the service limits for these logged metrics.

记录的值Logged Value 示例代码Example code 门户中的格式Format in portal
记录一组数值Log an array of numeric values run.log_list(name='Fibonacci', value=[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]) 单变量折线图single-variable line chart
使用重复使用的相同指标名称记录单个数值(例如在 for 循环中)Log a single numeric value with the same metric name repeatedly used (like from within a for loop) for i in tqdm(range(-10, 10)): run.log(name='Sigmoid', value=1 / (1 + np.exp(-i))) angle = i / 2.0 单变量折线图Single-variable line chart
重复记录包含 2 个数字列的行Log a row with 2 numerical columns repeatedly run.log_row(name='Cosine Wave', angle=angle, cos=np.cos(angle)) sines['angle'].append(angle) sines['sine'].append(np.sin(angle)) 双变量折线图Two-variable line chart
记录包含 2 个数字列的表Log table with 2 numerical columns run.log_table(name='Sine Wave', value=sines) 双变量折线图Two-variable line chart
日志图像Log image run.log_image(name='food', path='./breadpudding.jpg', plot=None, description='desert') 使用此方法在运行中记录图像文件或 matplotlib 图。Use this method to log an image file or a matplotlib plot to the run. 运行记录中可显示和比较这些图像These images will be visible and comparable in the run record

用 MLflow 进行日志记录Logging with MLflow

使用 MLFlowLogger 记录指标。Use MLFlowLogger to log metrics.

from azureml.core import Run
# connect to the workspace from within your running code
run = Run.get_context()
ws = run.experiment.workspace

# workspace has associated ml-flow-tracking-uri
mlflow_url = ws.get_mlflow_tracking_uri()

#Example: PyTorch Lightning
from pytorch_lightning.loggers import MLFlowLogger

mlf_logger = MLFlowLogger(experiment_name=run.experiment.name, tracking_uri=mlflow_url)
mlf_logger._run_id = run.id

交互式日志记录会话Interactive logging session

交互式日志记录会话通常用在笔记本环境中。Interactive logging sessions are typically used in notebook environments. 方法 Experiment.start_logging() 启动交互式日志记录会话。The method Experiment.start_logging() starts an interactive logging session. 试验中会话期间记录的任何指标都会添加到运行记录中。Any metrics logged during the session are added to the run record in the experiment. 方法 run.complete() 结束会话并将运行标记为已完成。The method run.complete() ends the sessions and marks the run as completed.

ScriptRun 日志ScriptRun logs

本部分介绍使用了 ScriptRunConfig 进行配置时,如何在创建的各次运行之内添加记录代码。In this section, you learn how to add logging code inside of runs created when configured with ScriptRunConfig. 可以使用 ScriptRunConfig 类来封装用于可重复运行的脚本和环境。You can use the ScriptRunConfig class to encapsulate scripts and environments for repeatable runs. 还可以使用此选项来显示一个用于监视的 Jupyter Notebooks 视觉小组件。You can also use this option to show a visual Jupyter Notebooks widget for monitoring.

此示例使用 run.log() 方法对 alpha 值执行参数扫描并捕获结果。This example performs a parameter sweep over alpha values and captures the results using the run.log() method.

  1. 创建包含日志记录逻辑的训练脚本 train.pyCreate a training script that includes the logging logic, train.py.

    # Copyright (c) Microsoft. All rights reserved.
    # Licensed under the MIT license.
    from sklearn.datasets import load_diabetes
    from sklearn.linear_model import Ridge
    from sklearn.metrics import mean_squared_error
    from sklearn.model_selection import train_test_split
    from azureml.core.run import Run
    import os
    import numpy as np
    import mylib
    # sklearn.externals.joblib is removed in 0.23
       from sklearn.externals import joblib
    except ImportError:
       import joblib
    os.makedirs('./outputs', exist_ok=True)
    X, y = load_diabetes(return_X_y=True)
    run = Run.get_context()
    X_train, X_test, y_train, y_test = train_test_split(X, y,
    data = {"train": {"X": X_train, "y": y_train},
          "test": {"X": X_test, "y": y_test}}
    # list of numbers from 0.0 to 1.0 with a 0.05 interval
    alphas = mylib.get_alphas()
    for alpha in alphas:
       # Use Ridge algorithm to create a regression model
       reg = Ridge(alpha=alpha)
       reg.fit(data["train"]["X"], data["train"]["y"])
       preds = reg.predict(data["test"]["X"])
       mse = mean_squared_error(preds, data["test"]["y"])
       run.log('alpha', alpha)
       run.log('mse', mse)
       model_file_name = 'ridge_{0:.2f}.pkl'.format(alpha)
       # save model in the outputs folder so it automatically get uploaded
       with open(model_file_name, "wb") as file:
          joblib.dump(value=reg, filename=os.path.join('./outputs/',
       print('alpha is {0:.2f}, and mse is {1:0.2f}'.format(alpha, mse))
  2. 提交要在用户管理的环境中运行的 train.py 脚本。Submit the train.py script to run in a user-managed environment. 整个脚本文件夹都要提交,以便进行训练。The entire script folder is submitted for training.

    from azureml.core import ScriptRunConfig
    src = ScriptRunConfig(source_directory='./', script='train.py')
    src.run_config.environment = user_managed_env
    run = exp.submit(src)

    show_output 参数会启用详细日志记录,让你可以查看训练过程的详细信息,以及有关任何远程资源或计算目标的信息。The show_output parameter turns on verbose logging, which lets you see details from the training process as well as information about any remote resources or compute targets. 请使用以下代码在提交试验时启用详细日志记录。Use the following code to turn on verbose logging when you submit the experiment.

run = exp.submit(src, show_output=True)

还可以在生成的运行上的 wait_for_completion 函数中使用相同的参数。You can also use the same parameter in the wait_for_completion function on the resulting run.


原生 Python 日志记录Native Python logging

SDK 中的某些日志可能包含一个错误,指示你将日志记录级别设置为“调试”。Some logs in the SDK may contain an error that instructs you to set the logging level to DEBUG. 若要设置日志记录级别,请在脚本中添加以下代码。To set the logging level, add the following code to your script.

import logging

其他日志记录源Additional logging sources

Azure 机器学习还可以在训练期间记录其他来源的信息,例如自动化机器学习运行或运行作业的 Docker 容器。Azure Machine Learning can also log information from other sources during training, such as automated machine learning runs, or Docker containers that run the jobs. 这些日志未进行记录,但如果你遇到问题并联系了 Microsoft 支持部门,他们可以在排除故障时使用这些日志。These logs aren't documented, but if you encounter problems and contact Microsoft support, they may be able to use these logs during troubleshooting.

有关在 Azure 机器学习设计器中记录指标的信息,请参阅如何在设计器中记录指标For information on logging metrics in Azure Machine Learning designer, see How to log metrics in the designer

示例笔记本Example notebooks

下面的笔记本展示了本文中的概念:The following notebooks demonstrate concepts in this article:

阅读使用 Jupyter 笔记本探索此服务一文,了解如何运行笔记本。Learn how to run notebooks by following the article Use Jupyter notebooks to explore this service.

后续步骤Next steps

请参阅以下文章,详细了解如何使用 Azure 机器学习:See these articles to learn more on how to use Azure Machine Learning: