记录并查看指标和日志数据Log & view metrics and log files

使用默认的 Python 日志记录包和 Azure 机器学习 Python SDK 特有的功能来记录实时信息。Log real-time information using both the default Python logging package and Azure Machine Learning Python SDK-specific functionality. 你可以在本地进行记录,并将日志发送到门户中的工作区。You can log locally and send logs to your workspace in the portal.

日志可帮助你诊断错误和警告,或跟踪参数和模型性能等性能指标。Logs can help you diagnose errors and warnings, or track performance metrics like parameters and model performance. 本文介绍如何在以下场景中启用日志记录功能:In this article, you learn how to enable logging in the following scenarios:

  • 记录运行指标Log run metrics
  • 交互式训练会话Interactive training sessions
  • 使用 ScriptRunConfig 提交训练作业Submitting training jobs using ScriptRunConfig
  • Python 的原生 logging 设置Python native logging settings
  • 来自其他源的日志记录Logging from additional sources

提示

本文说明如何监视模型训练过程。This article shows you how to monitor the model training process. 如果你希望监视 Azure 机器学习的资源使用情况和事件,例如配额、已完成的训练运行或已完成的模型部署,请参阅监视 Azure 机器学习If you're interested in monitoring resource usage and events from Azure Machine learning, such as quotas, completed training runs, or completed model deployments, see Monitoring Azure Machine Learning.

数据类型Data types

可以记录多个数据类型,包括标量值、列表、表、图像、目录等。You can log multiple data types including scalar values, lists, tables, images, directories, and more. 有关不同数据类型的详细信息和 Python 代码示例,请查看 Run 类参考页For more information, and Python code examples for different data types, see the Run class reference page.

运行指标日志记录Logging run metrics

使用日志记录 API 中的以下方法可影响指标可视化效果。Use the following methods in the logging APIs to influence the metrics visualizations. 请注意这些记录的指标的服务限制Note the service limits for these logged metrics.

记录的值Logged Value 示例代码Example code 门户中的格式Format in portal
记录一组数值Log an array of numeric values run.log_list(name='Fibonacci', value=[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]) 单变量折线图single-variable line chart
使用重复使用的相同指标名称记录单个数值(例如在 for 循环中)Log a single numeric value with the same metric name repeatedly used (like from within a for loop) for i in tqdm(range(-10, 10)): run.log(name='Sigmoid', value=1 / (1 + np.exp(-i))) angle = i / 2.0 单变量折线图Single-variable line chart
重复记录包含 2 个数字列的行Log a row with 2 numerical columns repeatedly run.log_row(name='Cosine Wave', angle=angle, cos=np.cos(angle)) sines['angle'].append(angle) sines['sine'].append(np.sin(angle)) 双变量折线图Two-variable line chart
记录包含 2 个数字列的表Log table with 2 numerical columns run.log_table(name='Sine Wave', value=sines) 双变量折线图Two-variable line chart
日志图像Log image run.log_image(name='food', path='./breadpudding.jpg', plot=None, description='desert') 使用此方法在运行中记录图像文件或 matplotlib 图。Use this method to log an image file or a matplotlib plot to the run. 运行记录中可显示和比较这些图像These images will be visible and comparable in the run record

用 MLflow 进行日志记录Logging with MLflow

使用 MLFlowLogger 记录指标。Use MLFlowLogger to log metrics.

from azureml.core import Run
# connect to the workspace from within your running code
run = Run.get_context()
ws = run.experiment.workspace

# workspace has associated ml-flow-tracking-uri
mlflow_url = ws.get_mlflow_tracking_uri()

#Example: PyTorch Lightning
from pytorch_lightning.loggers import MLFlowLogger

mlf_logger = MLFlowLogger(experiment_name=run.experiment.name, tracking_uri=mlflow_url)
mlf_logger._run_id = run.id

查看运行指标View run metrics

通过 SDKVia the SDK

可以使用 run.get_metrics() 查看训练的模型的指标。You can view the metrics of a trained model using run.get_metrics(). 请参阅以下示例。See the example below.

from azureml.core import Run
run = Run.get_context()
run.log('metric-name', metric_value)

metrics = run.get_metrics()
# metrics is of type Dict[str, List[float]] mapping mertic names
# to a list of the values for that metric in the given run.

metrics.get('metric-name')
# list of metrics in the order they were recorded

在 AML 工作室用户界面查看运行指标View run metrics in AML studio UI

可以在 Azure 机器学习工作室中浏览已完成的运行记录,包括记录的指标。You can browse completed run records, including logged metrics, in the Azure Machine Learning studio.

导航到“试验”选项卡。若要查看工作区中各个试验的所有运行,请选择“所有运行”选项卡。可应用顶部菜单栏中的“试验”筛选器来深入了解特定试验的运行。Navigate to the Experiments tab. To view all your runs in your Workspace across Experiments, select the All runs tab. You can drill down on runs for specific Experiments by applying the Experiment filter in the top menu bar.

对于各个试验视图,请选择“所有试验”选项卡。在“试验运行”仪表板中,可以看到为每次运行跟踪的指标和日志。For the individual Experiment view, select the All experiments tab. On the experiment run dashboard, you can see tracked metrics and logs for each run.

还可以编辑“运行列表”表,以选择多个运行并显示运行的最新记录值、最小记录值或最大记录值。You can also edit the run list table to select multiple runs and display either the last, minimum, or maximum logged value for your runs. 自定义自己的图表,以比较多个运行上的已记录指标值和聚合。Customize your charts to compare the logged metrics values and aggregates across multiple runs. 你可以在图表的 y 轴上绘制多个指标,并自定义 x 轴以绘制记录的指标。You can plot multiple metrics on the y-axis of your chart and customize your x-axis to plot your logged metrics.

查看并下载运行用的日志文件View and download log files for a run

日志文件是用于调试 Azure ML 工作负荷的重要资源。Log files are an essential resource for debugging the Azure ML workloads. 提交训练作业后,向下钻取到特定运行以查看其日志和输出:After submitting a training job, drill down to a specific run to view its logs and outputs:

  1. 导航到“试验”选项卡。Navigate to the Experiments tab.
  2. 选择特定运行的 runID。Select the runID for a specific run.
  3. 选择页面顶部的“输出和日志”。Select Outputs and logs at the top of the page.
  4. 选择“全部下载”,将所有日志下载到 zip 文件夹中。Select Download all to download all your logs into a zip folder.
  5. 还可以通过选择日志文件并选择“下载”来下载单个日志文件You can also download individual log files by choosing the log file and selecting Download

运行用“输出和日志”部分的屏幕截图。

下面的各表显示了此部分显示的文件夹中的日志文件的内容。The tables below show the contents of the log files in the folders you'll see in this section.

备注

每次运行不一定会看到每个文件。You will not necessarily see every file for every run. 例如,仅当生成新映像时(例如更改环境时),才会出现 20_image_build_log*.txt。For example, the 20_image_build_log*.txt only appears when a new image is built (e.g. when you change you environment).

azureml-logs 文件夹azureml-logs folder

文件File 说明Description
20_image_build_log.txt20_image_build_log.txt 训练环境的 Docker 映像生成日志(可选),每次运行都有一个这样的文件。Docker image building log for the training environment, optional, one per run. 仅当更新环境时适用。Only applicable when updating your Environment. 其他情况下,AML 会重用缓存的映像。Otherwise AML will reuse cached image. 如果成功,则包含相应映像的映像注册表详细信息。If successful, contains image registry details for the corresponding image.
55_azureml-execution-<node_id>.txt55_azureml-execution-<node_id>.txt 主机工具的 stdout/stderr 日志,每个节点一个。stdout/stderr log of host tool, one per node. 将映像拉取到计算目标。Pulls image to compute target. 请注意,只有在保护计算资源后,此日志才会出现。Note, this log only appears once you have secured compute resources.
65_job_prep-<node_id>.txt65_job_prep-<node_id>.txt 作业准备脚本的 stdout/stderr 日志,每个节点一个。stdout/stderr log of job preparation script, one per node. 将代码下载到计算目标和数据存储(如果已请求)。Download your code to compute target and datastores (if requested).
70_driver_log(_x).txt70_driver_log(_x).txt AML 控制脚本和客户训练脚本的 stdout/stderr 日志,每个进程一个。stdout/stderr log from AML control script and customer training script, one per process. 来自脚本的标准输出。此文件为代码日志(如 print 语句)的显示位置。Standard output from your script. This file is where your code's logs (for example, print statements) show up. 大多数情况下,你都将在此处监视日志。In the majority of cases, you will monitor the logs here.
70_mpi_log.txt70_mpi_log.txt MPI 框架日志(可选),每个运行一个。MPI framework log, optional, one per run. 仅适用于 MPI 运行。Only for MPI run.
75_job_post-<node_id>.txt75_job_post-<node_id>.txt 作业释放脚本的 stdout/stderr 日志,每个节点一个。stdout/stderr log of job release script, one per node. 发送日志,将计算资源释放回 Azure。Send logs, release the compute resources back to Azure.
process_info.jsonprocess_info.json 显示哪个进程在哪个节点上运行。show which process is running on which node.
process_status.jsonprocess_status.json 显示进程状态,如进程未启动、正在运行还是已完成。show process status, such as if a process is not started, running, or completed.

logs > azureml 文件夹logs > azureml folder

文件File 说明Description
110_azureml.log110_azureml.log
job_prep_azureml.logjob_prep_azureml.log 有关作业准备情况的系统日志system log for job preparation
job_release_azureml.logjob_release_azureml.log 有关作业释放的系统日志system log for job release

logs > azureml > sidecar > node_id 文件夹logs > azureml > sidecar > node_id folder

当启用了挎斗时,作业准备和作业释放脚本会在挎斗容器中运行。When sidecar is enabled, job prep and job release scripts will be run within sidecar container. 每个节点都有一个文件夹。There is one folder for each node.

文件File 说明Description
start_cms.txtstart_cms.txt 挎斗容器启动时启动的进程的日志Log of process that starts when Sidecar Container starts
prep_cmd.txtprep_cmd.txt 运行 job_prep.py 时进入的 ContextManagers 的日志(其中一些内容会流式传输到 azureml-logs/65-job_prepLog for ContextManagers entered when job_prep.py is run (some of this content will be streamed to azureml-logs/65-job_prep)
release_cmd.txtrelease_cmd.txt 运行 job_release.py 时退出的 ComtextManagers 的日志Log for ComtextManagers exited when job_release.py is run

其他文件夹Other folders

对于多个计算群集上的作业训练,将会针对每个节点 IP 提供日志。For jobs training on multi-compute clusters, logs are present for each node IP. 每个节点的结构都与单节点作业相同。The structure for each node is the same as single node jobs. 对于总体执行、stderr 和 stdout 日志,还有一个额外的日志文件夹。There is one more logs folder for overall execution, stderr, and stdout logs.

Azure 机器学习会在训练期间记录来自各种源的信息,例如运行训练作业的 AutoML 或 Docker 容器。Azure Machine Learning logs information from various sources during training, such as AutoML or the Docker container that runs the training job. 其中的许多日志没有详细的阐述。Many of these logs are not documented. 如果遇到问题且联系了 Microsoft 支持部门,他们可以在排除故障时使用这些日志。If you encounter problems and contact Microsoft support, they may be able to use these logs during troubleshooting.

交互式日志记录会话Interactive logging session

交互式日志记录会话通常用在笔记本环境中。Interactive logging sessions are typically used in notebook environments. 方法 Experiment.start_logging() 启动交互式日志记录会话。The method Experiment.start_logging() starts an interactive logging session. 试验中会话期间记录的任何指标都会添加到运行记录中。Any metrics logged during the session are added to the run record in the experiment. 方法 run.complete() 结束会话并将运行标记为已完成。The method run.complete() ends the sessions and marks the run as completed.

ScriptRun 日志ScriptRun logs

本部分介绍使用了 ScriptRunConfig 进行配置时,如何在创建的各次运行之内添加记录代码。In this section, you learn how to add logging code inside of runs created when configured with ScriptRunConfig. 可以使用 ScriptRunConfig 类来封装用于可重复运行的脚本和环境。You can use the ScriptRunConfig class to encapsulate scripts and environments for repeatable runs. 还可以使用此选项来显示一个用于监视的 Jupyter Notebooks 视觉小组件。You can also use this option to show a visual Jupyter Notebooks widget for monitoring.

此示例使用 run.log() 方法对 alpha 值执行参数扫描并捕获结果。This example performs a parameter sweep over alpha values and captures the results using the run.log() method.

  1. 创建包含日志记录逻辑的训练脚本 train.pyCreate a training script that includes the logging logic, train.py.

    # Copyright (c) Microsoft. All rights reserved.
    # Licensed under the MIT license.
    
    from sklearn.datasets import load_diabetes
    from sklearn.linear_model import Ridge
    from sklearn.metrics import mean_squared_error
    from sklearn.model_selection import train_test_split
    from azureml.core.run import Run
    import os
    import numpy as np
    import mylib
    # sklearn.externals.joblib is removed in 0.23
    try:
       from sklearn.externals import joblib
    except ImportError:
       import joblib
    
    os.makedirs('./outputs', exist_ok=True)
    
    X, y = load_diabetes(return_X_y=True)
    
    run = Run.get_context()
    
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                       test_size=0.2,
                                                       random_state=0)
    data = {"train": {"X": X_train, "y": y_train},
          "test": {"X": X_test, "y": y_test}}
    
    # list of numbers from 0.0 to 1.0 with a 0.05 interval
    alphas = mylib.get_alphas()
    
    for alpha in alphas:
       # Use Ridge algorithm to create a regression model
       reg = Ridge(alpha=alpha)
       reg.fit(data["train"]["X"], data["train"]["y"])
    
       preds = reg.predict(data["test"]["X"])
       mse = mean_squared_error(preds, data["test"]["y"])
       run.log('alpha', alpha)
       run.log('mse', mse)
    
       model_file_name = 'ridge_{0:.2f}.pkl'.format(alpha)
       # save model in the outputs folder so it automatically get uploaded
       with open(model_file_name, "wb") as file:
          joblib.dump(value=reg, filename=os.path.join('./outputs/',
                                                       model_file_name))
    
       print('alpha is {0:.2f}, and mse is {1:0.2f}'.format(alpha, mse))
    
  2. 提交要在用户管理的环境中运行的 train.py 脚本。Submit the train.py script to run in a user-managed environment. 整个脚本文件夹都要提交,以便进行训练。The entire script folder is submitted for training.

    from azureml.core import ScriptRunConfig
    
    src = ScriptRunConfig(source_directory='./', script='train.py', environment=user_managed_env)
    
    run = exp.submit(src)
    

    show_output 参数会启用详细日志记录,让你可以查看训练过程的详细信息,以及有关任何远程资源或计算目标的信息。The show_output parameter turns on verbose logging, which lets you see details from the training process as well as information about any remote resources or compute targets. 请使用以下代码在提交试验时启用详细日志记录。Use the following code to turn on verbose logging when you submit the experiment.

run = exp.submit(src, show_output=True)

还可以在生成的运行上的 wait_for_completion 函数中使用相同的参数。You can also use the same parameter in the wait_for_completion function on the resulting run.

run.wait_for_completion(show_output=True)

原生 Python 日志记录Native Python logging

SDK 中的某些日志可能包含一个错误,指示你将日志记录级别设置为“调试”。Some logs in the SDK may contain an error that instructs you to set the logging level to DEBUG. 若要设置日志记录级别,请在脚本中添加以下代码。To set the logging level, add the following code to your script.

import logging
logging.basicConfig(level=logging.DEBUG)

其他日志记录源Other logging sources

Azure 机器学习还可以在训练期间记录其他来源的信息,例如自动化机器学习运行或运行作业的 Docker 容器。Azure Machine Learning can also log information from other sources during training, such as automated machine learning runs, or Docker containers that run the jobs. 这些日志未进行记录,但如果你遇到问题并联系了 Microsoft 支持部门,他们可以在排除故障时使用这些日志。These logs aren't documented, but if you encounter problems and contact Microsoft support, they may be able to use these logs during troubleshooting.

有关在 Azure 机器学习设计器中记录指标的信息,请参阅如何在设计器中记录指标For information on logging metrics in Azure Machine Learning designer, see How to log metrics in the designer

示例笔记本Example notebooks

下面的笔记本展示了本文中的概念:The following notebooks demonstrate concepts in this article:

阅读使用 Jupyter 笔记本探索此服务一文,了解如何运行笔记本。Learn how to run notebooks by following the article Use Jupyter notebooks to explore this service.

后续步骤Next steps

请参阅以下文章,详细了解如何使用 Azure 机器学习:See these articles to learn more on how to use Azure Machine Learning: