监视和查看 ML 运行日志与指标Monitor and view ML run logs and metrics

了解如何监视 Azure 机器学习运行并查看其日志。Learn how to monitor Azure Machine Learning runs and view their logs.

当你运行试验时,系统会为你流式传输日志和指标。When you run an experiment, logs and metrics are streamed for you. 此外,你还可以添加自己的日志和指标。In addition, you can add your own. 若要了解如何进行添加,请参阅在 Azure ML 训练运行中启用日志记录To learn how, see Enable logging in Azure ML training runs.

日志可帮助你诊断你的运行的错误和警告。The logs can help you diagnose errors and warnings for your run. 性能指标(例如参数和模型准确性)可帮助你跟踪和监视你的运行。Performance metrics like parameters and model accuracy help you track and monitor your runs.

本文介绍如何使用以下方法查看日志:In this article, you learn how to view logs using the following methods:

  • 在工作室中监视运行Monitor runs in the studio
  • 使用 Jupyter Notebook 小组件监视运行Monitor runs using the Jupyter Notebook widget
  • 监视自动化机器学习运行Monitor automated machine learning runs
  • 完成时查看输出日志View output logs upon completion
  • 在工作室中查看输出日志View output logs in the studio

有关如何管理试验的常规信息,请参阅启动、监视和取消训练运行For general information on how to manage your experiments, see Start, monitor, and cancel training runs.

使用 Jupyter Notebook 小组件监视运行Monitor runs using the Jupyter notebook widget

使用 ScriptRunConfig 方法提交运行时,可使用 Jupyter 小组件监视运行的进度。When you use the ScriptRunConfig method to submit runs, you can watch the progress of the run using the Jupyter widget. 和运行提交一样,该小组件采用异步方式,并每隔 10-15 秒提供实时更新,直到作业完成。Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

在等待运行完成的期间查看 Jupyter 小组件。View the Jupyter widget while waiting for the run to complete.

from azureml.widgets import RunDetails

Jupyter 笔记本小组件的屏幕截图

也可以在工作区中找到指向此画面的链接。You can also get a link to the same display in your workspace.


监视自动化机器学习运行Monitor automated machine learning runs

对于自动化机器学习运行,若要访问根据以前的运行生成的图表,请将 <<experiment_name>> 替换为相应的试验名称:For automated machine learning runs, to access the charts from a previous run, replace <<experiment_name>> with the appropriate experiment name:

from azureml.widgets import RunDetails
from azureml.core.run import Run

experiment = Experiment (workspace, <<experiment_name>>)
run_id = 'autoML_my_runID' #replace with run_ID
run = Run(experiment, run_id)

自动化机器学习的 Jupyter Notebook 小组件

完成时显示输出Show output upon completion

使用 ScriptRunConfig 时,可以使用 run.wait_for_completion(show_output = True) 在模型定型完成时进行显示。When you use ScriptRunConfig, you can use run.wait_for_completion(show_output = True) to show when the model training is complete. 使用 show_output 标志可查看详细输出。The show_output flag gives you verbose output. 有关详细信息,请参阅如何启用日志记录中的 ScriptRunConfig 部分。For more information, see the ScriptRunConfig section of How to enable logging.

查询运行指标Query run metrics

可以使用 run.get_metrics() 查看训练的模型的指标。You can view the metrics of a trained model using run.get_metrics(). 例如,可以将此方法示例与上面的示例配合使用,通过查找具有最低均方误差 (mse) 值的模型来确定最佳模型。For example, you could use this with the example above to determine the best model by looking for the model with the lowest mean square error (mse) value.

在工作室中查看运行记录View run records in the studio

可以在 Azure 机器学习工作室中浏览已完成的运行记录,包括记录的指标。You can browse completed run records, including logged metrics, in the Azure Machine Learning studio.

导航到“试验”选项卡。若要查看工作区中各个试验的所有运行,请选择“所有运行”选项卡。可应用顶部菜单栏中的“试验”筛选器来深入了解特定试验的运行。Navigate to the Experiments tab. To view all your runs in your Workspace across Experiments, select the All runs tab. You can drill down on runs for specific Experiments by applying the Experiment filter in the top menu bar.

对于各个试验视图,请选择“所有试验”选项卡。在“试验运行”仪表板中,可以看到为每次运行跟踪的指标和日志。For the individual Experiment view, select the All experiments tab. On the experiment run dashboard, you can see tracked metrics and logs for each run.

还可以编辑“运行列表”表,以选择多个运行并显示运行的最新记录值、最小记录值或最大记录值。You can also edit the run list table to select multiple runs and display either the last, minimum, or maximum logged value for your runs. 自定义自己的图表,以比较多个运行上的已记录指标值和聚合。Customize your charts to compare the logged metrics values and aggregates across multiple runs.

Azure 机器学习工作室中的运行详细信息

设置图表格式Format charts

使用日志记录 API 中的以下方法可影响指标可视化效果。Use the following methods in the logging APIs to influence the metrics visualizations.

记录的值Logged Value 示例代码Example code 门户中的格式Format in portal
记录一组数值Log an array of numeric values run.log_list(name='Fibonacci', value=[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]) 单变量折线图single-variable line chart
使用重复使用的相同指标名称记录单个数值(例如在 for 循环中)Log a single numeric value with the same metric name repeatedly used (like from within a for loop) for i in tqdm(range(-10, 10)): run.log(name='Sigmoid', value=1 / (1 + np.exp(-i))) angle = i / 2.0 单变量折线图Single-variable line chart
重复记录包含 2 个数字列的行Log a row with 2 numerical columns repeatedly run.log_row(name='Cosine Wave', angle=angle, cos=np.cos(angle)) sines['angle'].append(angle) sines['sine'].append(np.sin(angle)) 双变量折线图Two-variable line chart
记录包含 2 个数字列的表Log table with 2 numerical columns run.log_table(name='Sine Wave', value=sines) 双变量折线图Two-variable line chart

查看某个运行的日志文件View log files for a run

日志文件是用于调试 Azure ML 工作负荷的重要资源。Log files are an essential resource for debugging the Azure ML workloads. 可以向下钻取到特定运行来查看其日志和输出:Drill down to a specific run to view its logs and outputs:

  1. 导航到“试验”选项卡。Navigate to the Experiments tab.
  2. 选择特定运行的 runID。Select the runID for a specific run.
  3. 选择页面顶部的“输出和日志”。Select Outputs and logs at the top of the page.


下面的各表显示了此部分显示的文件夹中的日志文件的内容。The tables below show the contents of the log files in the folders you'll see in this section.


每次运行不一定会看到每个文件。You will not necessarily see every file for every run. 例如,仅当生成新映像时(例如更改环境时),才会出现 20_image_build_log*.txt。For example, the 20_image_build_log*.txt only appears when a new image is built (e.g. when you change you environment).

azureml-logs 文件夹azureml-logs folder

文件File 说明Description
20_image_build_log.txt20_image_build_log.txt 训练环境的 Docker 映像生成日志(可选),每次运行都有一个这样的文件。Docker image building log for the training environment, optional, one per run. 仅当更新环境时适用。Only applicable when updating your Environment. 其他情况下,AML 会重用缓存的映像。Otherwise AML will reuse cached image. 如果成功,则包含相应映像的映像注册表详细信息。If successful, contains image registry details for the corresponding image.
55_azureml-execution-<node_id>.txt55_azureml-execution-<node_id>.txt 主机工具的 stdout/stderr 日志,每个节点一个。stdout/stderr log of host tool, one per node. 将映像拉取到计算目标。Pulls image to compute target. 请注意,只有在保护计算资源后,此日志才会出现。Note, this log only appears once you have secured compute resources.
65_job_prep-<node_id>.txt65_job_prep-<node_id>.txt 作业准备脚本的 stdout/stderr 日志,每个节点一个。stdout/stderr log of job preparation script, one per node. 将代码下载到计算目标和数据存储(如果已请求)。Download your code to compute target and datastores (if requested).
70_driver_log(_x).txt70_driver_log(_x).txt AML 控制脚本和客户训练脚本的 stdout/stderr 日志,每个进程一个。stdout/stderr log from AML control script and customer training script, one per process. 这是脚本的标准输出。这是代码日志(例如 print 语句)的显示位置。This is the standard output from your script. This is where your code's logs (e.g. print statements) show up. 在大多数情况下,你将在此处监视日志。In the majority of cases you will monitor the logs here.
70_mpi_log.txt70_mpi_log.txt MPI 框架日志(可选),每个运行一个。MPI framework log, optional, one per run. 仅适用于 MPI 运行。Only for MPI run.
75_job_post-<node_id>.txt75_job_post-<node_id>.txt 作业释放脚本的 stdout/stderr 日志,每个节点一个。stdout/stderr log of job release script, one per node. 发送日志,将计算资源释放回 Azure。Send logs, release the compute resources back to Azure.
process_info.jsonprocess_info.json 显示哪个进程在哪个节点上运行。show which process is running on which node.
process_status.jsonprocess_status.json 显示进程状态,即,进程是未启动、正在运行还是已完成。show process status, i.e. if a process is not started, running or completed.

logs > azureml 文件夹logs > azureml folder

文件File 说明Description
job_prep_azureml.logjob_prep_azureml.log 有关作业准备情况的系统日志system log for job preparation
job_release_azureml.logjob_release_azureml.log 有关作业释放的系统日志system log for job release

logs > azureml > sidecar > node_id 文件夹logs > azureml > sidecar > node_id folder

当启用了挎斗时,作业准备和作业释放脚本会在挎斗容器中运行。When sidecar is enabled, job prep and job release scripts will be run within sidecar container. 每个节点都有一个文件夹。There is one folder for each node.

文件File 说明Description
start_cms.txtstart_cms.txt 挎斗容器启动时启动的进程的日志Log of process that starts when Sidecar Container starts
prep_cmd.txtprep_cmd.txt 运行 job_prep.py 时进入的 ContextManagers 的日志(其中一些会流式传输到 azureml-logs/65-job_prepLog for ContextManagers entered when job_prep.py is run (some of this will be streamed to azureml-logs/65-job_prep)
release_cmd.txtrelease_cmd.txt 运行 job_release.py 时退出的 ComtextManagers 的日志Log for ComtextManagers exited when job_release.py is run

其他文件夹Other folders

对于多个计算群集上的作业训练,将会针对每个节点 IP 提供日志。For jobs training on multi-compute clusters, logs are present for each node IP. 每个节点的结构都与单节点作业相同。The structure for each node is the same as single node jobs. 对于总体执行、stderr 和 stdout 日志,还有一个额外的日志文件夹。There is one additional logs folder for overall execution, stderr, and stdout logs.

Azure 机器学习在训练期间记录会从各种源(例如,运行训练作业的 AutoML 或 Docker 容器)记录信息。Azure Machine Learning logs information from a variety of sources during training, such as AutoML or the Docker container that runs the training job. 其中的许多日志没有详细的阐述。Many of these logs are not documented. 如果遇到问题且联系了 Microsoft 支持部门,他们可以在排除故障时使用这些日志。If you encounter problems and contact Microsoft support, they may be able to use these logs during troubleshooting.

监视计算群集Monitor a compute cluster

若要在浏览器中监视特定计算目标的运行,请执行以下步骤:To monitor runs for a specific compute target from your browser, use the following steps:

  1. Azure 机器学习工作室中选择自己的工作区,然后在页面左侧选择“计算”。In the Azure Machine Learning studio, select your workspace, and then select Compute from the left side of the page.

  2. 选择“正在训练群集”,显示用于训练的计算目标列表。Select Training Clusters to display a list of compute targets used for training. 然后选择群集。Then select the cluster.


  3. 选择“运行”。Select Runs. 此时显示使用此群集的运行列表。The list of runs that use this cluster is displayed. 若要查看某个特定运行的详细信息,请点击“运行”列中的链接。To view details for a specific run, use the link in the Run column. 若要查看试验的详细信息,请点击“试验”列中的链接。To view details for the experiment, use the link in the Experiment column.



    由于训练计算目标是共享资源,因此它们可以让多个运行排队或在给定时间处于活动状态。Since training compute targets are a shared resource, they can have multiple runs queued or active at a given time.

    一个运行可以包含多个子级运行,所以一个训练作业可能会产生多个条目。A run can contain child runs, so one training job can result in multiple entries.

完成的运行将不再显示在此页上。Once a run completes, it is no longer displayed on this page. 若要查看已完成运行的信息,请访问工作室的“试验”部分,然后选择试验和运行。To view information on completed runs, visit the Experiments section of the studio and select the experiment and run. 有关详细信息,请参阅查看已完成运行的指标部分。For more information, see the section View metrics for completed runs.

后续步骤Next steps

尝试执行以下后续步骤,了解如何使用 Azure 机器学习:Try these next steps to learn how to use Azure Machine Learning: