监视 Azure ML 试验运行和指标Monitor Azure ML experiment runs and metrics

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

通过跟踪试验并监视运行指标来改进模型创建过程。Enhance the model creation process by tracking your experiments and monitoring run metrics. 本文介绍如何将日志记录代码添加到训练脚本、提交试验运行、监视运行以及在 Azure 机器学习中检查结果。In this article, learn how to add logging code to your training script, submit an experiment run, monitor that run, and inspect the results in Azure Machine Learning.

Note

Azure 机器学习还可以在训练期间记录其他来源的信息,例如自动化机器学习运行或运行训练作业的 Docker 容器。Azure Machine Learning may also log information from other sources during training, such as automated machine learning runs, or the Docker container that runs the training job. 本文不介绍此类日志。These logs are not documented. 如果遇到问题且联系了 Microsoft 支持部门,他们可以在排除故障时使用这些日志。If you encounter problems and contact Microsoft support, they may be able to use these logs during troubleshooting.

Tip

本文档中的信息主要是为希望监视模型训练过程的数据科学家和开发人员提供的。The information in this document is primarily for data scientists and developers who want to monitor the model training process. 如果您是一名管理员,希望监视 Azure 机器学习的资源使用情况和事件,例如配额、已完成的训练运行或已完成的模型部署,请参阅监视 Azure 机器学习If you are an administrator interested in monitoring resource usage and events from Azure Machine learning, such as quotas, completed training runs, or completed model deployments, see Monitoring Azure Machine Learning.

可跟踪的指标Available metrics to track

训练实验时可将以下指标添加到运行中。The following metrics can be added to a run while training an experiment. 若要查看可在运行中跟踪的内容的更详细列表,请参阅 Run 类参考文档To view a more detailed list of what can be tracked on a run, see the Run class reference documentation.

类型Type Python 函数Python function 说明Notes
标量值Scalar values 函数:Function:
run.log(name, value, description='')

示例:Example:
run.log("accuracy", 0.95)run.log("accuracy", 0.95)
使用给定名称将数值或字符串值记录到运行中。Log a numerical or string value to the run with the given name. 在运行中记录某个指标会导致在试验中的运行记录中存储该指标。Logging a metric to a run causes that metric to be stored in the run record in the experiment. 可在一次运行中多次记录同一指标,其结果被视为该指标的一个矢量。You can log the same metric multiple times within a run, the result being considered a vector of that metric.
列表Lists 函数:Function:
run.log_list(name, value, description='')

示例:Example:
run.log_list("accuracies", [0.6, 0.7, 0.87])run.log_list("accuracies", [0.6, 0.7, 0.87])
使用给定名称将值列表记录到运行中。Log a list of values to the run with the given name.
Row 函数:Function:
run.log_row(name, description=None, **kwargs)
示例:Example:
run.log_row("Y over X", x=1, y=0.4)run.log_row("Y over X", x=1, y=0.4)
使用 log_row 创建包含多个列的指标,如 kwargs 中所述。Using log_row creates a metric with multiple columns as described in kwargs. 每个命名的参数会生成一个具有指定值的列。Each named parameter generates a column with the value specified. 可调用 log_row 一次,记录一个任意元组,或在一个循环中调用多次,生成一个完整表格。log_row can be called once to log an arbitrary tuple, or multiple times in a loop to generate a complete table.
Table 函数:Function:
run.log_table(name, value, description='')

示例:Example:
run.log_table("Y over X", {"x":[1, 2, 3], "y":[0.6, 0.7, 0.89]})run.log_table("Y over X", {"x":[1, 2, 3], "y":[0.6, 0.7, 0.89]})
使用给定名称将字典对象记录到运行中。Log a dictionary object to the run with the given name.
映像Images 函数:Function:
run.log_image(name, path=None, plot=None)

示例:Example:
run.log_image("ROC", plot=plt)
将图像记录到运行记录中。Log an image to the run record. 使用 log_image 在运行中记录图像文件或 matplotlib 图。Use log_image to log an image file or a matplotlib plot to the run. 运行记录中可显示和比较这些图像。These images will be visible and comparable in the run record.
标记一个运行Tag a run 函数:Function:
run.tag(key, value=None)

示例:Example:
run.tag("selected", "yes")run.tag("selected", "yes")
使用一个字符串键和可选字符串值标记运行。Tag the run with a string key and optional string value.
上传文件或目录Upload file or directory 函数:Function:
run.upload_file(name, path_or_stream)

示例:Example:
run.upload_file("best_model.pkl", "./model.pkl")run.upload_file("best_model.pkl", "./model.pkl")
将文件上传到运行记录。Upload a file to the run record. 在指定输出目录中自动运行捕获文件,对于大多数运行类型,该目录默认为 "./outputs"。Runs automatically capture file in the specified output directory, which defaults to "./outputs" for most run types. 仅当需要上传其他文件或未指定输出目录时使用 upload_file。Use upload_file only when additional files need to be uploaded or an output directory is not specified. 建议在名称中添加 outputs 以便将其上传到输出目录。We suggest adding outputs to the name so that it gets uploaded to the outputs directory. 可通过调用 run.get_file_names() 列出与此运行记录关联的所有文件You can list all of the files that are associated with this run record by called run.get_file_names()

Note

标量、列表、行和表的指标的类型可以为:float、integer 或 string。Metrics for scalars, lists, rows, and tables can have type: float, integer, or string.

选择日志记录选项Choose a logging option

如果要跟踪或监视试验,须添加代码,用于在提交运行时启动日志记录。If you want to track or monitor your experiment, you must add code to start logging when you submit the run. 以下是触发运行提交的方法:The following are ways to trigger the run submission:

  • Run.start_logging - 将日志记录功能添加到训练脚本,并在指定试验中启动交互式日志记录会话。Run.start_logging - Add logging functions to your training script and start an interactive logging session in the specified experiment. start_logging 可创建笔记本等方案中使用的交互式运行。start_logging creates an interactive run for use in scenarios such as notebooks. 试验中会话期间记录的任何指标都会添加到运行记录中。Any metrics that are logged during the session are added to the run record in the experiment.
  • ScriptRunConfig - 将日志记录功能添加到训练脚本并在运行时加载整个脚本文件夹。ScriptRunConfig - Add logging functions to your training script and load the entire script folder with the run. ScriptRunConfig 是用于设置脚本运行配置的一个类。ScriptRunConfig is a class for setting up configurations for script runs. 使用此选项,可添加监视代码,在运行完成时发出通知,或让视觉小组件执行监视操作。With this option, you can add monitoring code to be notified of completion or to get a visual widget to monitor.

设置工作区Set up the workspace

添加日志记录和提交试验之前,必须设置工作区。Before adding logging and submitting an experiment, you must set up the workspace.

  1. 加载工作区。Load the workspace. 若要详细了解如何设置工作区配置,请参阅工作区配置文件To learn more about setting the workspace configuration, see workspace configuration file.

    from azureml.core import Experiment, Run, Workspace
    import azureml.core
    
    ws = Workspace.from_config()
    

选项 1:使用 start_loggingOption 1: Use start_logging

start_logging 可创建笔记本等方案中使用的交互式运行。start_logging creates an interactive run for use in scenarios such as notebooks. 试验中会话期间记录的任何指标都会添加到运行记录中。Any metrics that are logged during the session are added to the run record in the experiment.

下面的示例在本地 Jupyter 笔记本中本地训练简单的 sklearn 岭模型。The following example trains a simple sklearn Ridge model locally in a local Jupyter notebook. 若要详细了解如何将试验提交到不同的环境,请参阅使用 Azure 机器学习为模型定型设置计算目标To learn more about submitting experiments to different environments, see Set up compute targets for model training with Azure Machine Learning.

加载数据Load the data

本示例使用 scikit-learn 附带的糖尿病数据集(一个众所周知的小型数据集)。This example uses the diabetes dataset, a well-known small dataset that comes with scikit-learn. 此单元格会加载该数据集,并将其拆分为随机训练集和测试集。This cell loads the dataset and splits it into random training and testing sets.

# load diabetes dataset, a well-known small dataset that comes with scikit-learn
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib

X, y = load_diabetes(return_X_y = True)
columns = ['age', 'gender', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
data = {
   "train":{"X": X_train, "y": y_train},        
   "test":{"X": X_test, "y": y_test}
}
reg = Ridge(alpha = 0.03)
reg.fit(data['train']['X'], data['train']['y'])
preds = reg.predict(data['test']['X'])
print('Mean Squared Error is', mean_squared_error(preds, data['test']['y']))
joblib.dump(value = reg, filename = 'model.pkl');

添加跟踪Add tracking

使用 Azure 机器学习 SDK 添加试验跟踪并将持久化模型上传到试验运行记录。Add experiment tracking using the Azure Machine Learning SDK, and upload a persisted model into the experiment run record. 以下代码添加标记、日志,并将模型文件上传到试验运行。The following code adds tags, logs, and uploads a model file to the experiment run.

 # Get an experiment object from Azure Machine Learning
 experiment = Experiment(workspace=ws, name="train-within-notebook")
 
 # Create a run object in the experiment
 run =  experiment.start_logging()
 # Log the algorithm parameter alpha to the run
 run.log('alpha', 0.03)
 
 # Create, fit, and test the scikit-learn Ridge regression model
 regression_model = Ridge(alpha=0.03)
 regression_model.fit(data['train']['X'], data['train']['y'])
 preds = regression_model.predict(data['test']['X'])
 
 # Output the Mean Squared Error to the notebook and to the run
 print('Mean Squared Error is', mean_squared_error(data['test']['y'], preds))
 run.log('mse', mean_squared_error(data['test']['y'], preds))
 
 # Save the model to the outputs directory for capture
 model_file_name = 'outputs/model.pkl'
 
 joblib.dump(value = regression_model, filename = model_file_name)
 
 # upload the model file explicitly into artifacts 
 run.upload_file(name = model_file_name, path_or_stream = model_file_name)
 
 # Complete the run
 run.complete()
The script ends with ```run.complete()```, which marks the run as completed.  This function is typically used in interactive notebook scenarios.

选项 2:使用 ScriptRunConfigOption 2: Use ScriptRunConfig

ScriptRunConfig 是用于设置脚本运行配置的一个类ScriptRunConfig is a class for setting up configurations for script runs. 使用此选项,可添加监视代码,在运行完成时发出通知,或让视觉小组件执行监视操作。With this option, you can add monitoring code to be notified of completion or to get a visual widget to monitor.

此示例在上面的基本 sklearn 岭模型的基础上进行扩展。This example expands on the basic sklearn Ridge model from above. 它会对模型的 alpha 值执行简单的参数扫描以捕获指标,并通过在实验中运行来训练模型。It does a simple parameter sweep to sweep over alpha values of the model to capture metrics and trained models in runs under the experiment. 该示例在一个用户管理的环境中执行本地运行。The example runs locally against a user-managed environment.

  1. 创建定型脚本 train.pyCreate a training script train.py.

    # train.py
    
    import os
    from sklearn.datasets import load_diabetes
    from sklearn.linear_model import Ridge
    from sklearn.metrics import mean_squared_error
    from sklearn.model_selection import train_test_split
    from azureml.core.run import Run
    from sklearn.externals import joblib
    
    import numpy as np
    
    #os.makedirs('./outputs', exist_ok = True)
    
    X, y = load_diabetes(return_X_y = True)
    
    run = Run.get_context()
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
    data = {"train": {"X": X_train, "y": y_train},
           "test": {"X": X_test, "y": y_test}}
    
    # list of numbers from 0.0 to 1.0 with a 0.05 interval
    alphas = mylib.get_alphas()
    
    for alpha in alphas:
       # Use Ridge algorithm to create a regression model
       reg = Ridge(alpha = alpha)
       reg.fit(data["train"]["X"], data["train"]["y"])
    
       preds = reg.predict(data["test"]["X"])
       mse = mean_squared_error(preds, data["test"]["y"])
       # log the alpha and mse values
       run.log('alpha', alpha)
       run.log('mse', mse)
    
       model_file_name = 'ridge_{0:.2f}.pkl'.format(alpha)
       # save model in the outputs folder so it automatically get uploaded
       with open(model_file_name, "wb") as file:
           joblib.dump(value = reg, filename = model_file_name)
    
       # upload the model file explicitly into artifacts 
       run.upload_file(name = model_file_name, path_or_stream = model_file_name)
    
       # register the model
       #run.register_model(file_name = model_file_name)
    
       print('alpha is {0:.2f}, and mse is {1:0.2f}'.format(alpha, mse))
    
    
  2. train.py 脚本引用 mylib.py,通过后者,可获取要在岭模型中使用的 alpha 值的列表。The train.py script references mylib.py which allows you to get the list of alpha values to use in the ridge model.

    # mylib.py
    
    import numpy as np
    
    def get_alphas():
       # list of numbers from 0.0 to 1.0 with a 0.05 interval
       return np.arange(0.0, 1.0, 0.05)
    
  3. 配置用户管理的本地环境。Configure a user-managed local environment.

    from azureml.core.environment import Environment
    
    # Editing a run configuration property on-fly.
    user_managed_env = Environment("user-managed-env")
    
    user_managed_env.python.user_managed_dependencies = True
    
    # You can choose a specific Python environment by pointing to a Python path 
    #user_managed_env.python.interpreter_path = '/home/johndoe/miniconda3/envs/myenv/bin/python'
    
  4. 提交要在用户管理的环境中运行的 train.py 脚本。Submit the train.py script to run in the user-managed environment. 整个脚本文件夹都要提交以进行训练,包括 mylib.py 文件。This whole script folder is submitted for training, including the mylib.py file.

    from azureml.core import ScriptRunConfig
    
    exp = Experiment(workspace=ws, name="train-on-local")
    src = ScriptRunConfig(source_directory='./', script='train.py')
    src.run_config.environment = user_managed_env
    run = exp.submit(src)
    

管理运行Manage a run

启动、监视和取消训练运行一文重点介绍了关于管理试验的特定 Azure 机器学习工作流。The Start, monitor, and cancel training runs article highlights specific Azure Machine Learning workflows for how to manage your experiments.

查看运行详细信息View run details

在浏览器中查看活动/已排队运行View active/queued runs from the browser

用于训练模型的计算目标是共享资源。Compute targets used to train models are a shared resource. 所以在某些时间点,它们可能会有多个排队或活动的运行。As such, they may have multiple runs queued or active at a given time. 若要在浏览器中查看特定计算目标的运行,请执行以下步骤:To see the runs for a specific compute target from your browser, use the following steps:

  1. Azure 机器学习工作室中选择自己的工作区,然后在页面左侧选择“计算” 。From the Azure Machine Learning studio, select your workspace, and then select Compute from the left side of the page.

  2. 选择“正在训练群集”,显示用于训练的计算目标列表 。Select Training Clusters to display a list of compute targets used for training. 然后选择群集。Then select the cluster.

    选择训练群集

  3. 选择“运行” 。Select Runs. 此时显示使用此群集的运行列表。The list of runs that use this cluster is displayed. 若要查看某个特定运行的详细信息,请点击“运行”列中的链接 。To view details for a specific run, use the link in the Run column. 若要查看试验的详细信息,请点击“试验”列中的链接 。To view details for the experiment, use the link in the Experiment column.

    选择训练群集的运行

    Tip

    一个运行可以包含多个子级运行,所以一个训练作业可能会产生多个条目。A run can contain child runs, so one training job can result in multiple entries.

完成的运行将不再显示在此页上。Once a run completes, it is no longer displayed on this page. 若要查看已完成运行的信息,请访问工作室的“试验”部分,然后选择试验和运行 。To view information on completed runs, visit the Experiments section of the studio and select the experiment and run. 有关详细信息,请参阅查询运行指标部分。For more information, see the Query run metrics section.

使用 Jupyter 笔记本小组件监视运行Monitor run with Jupyter notebook widget

使用 ScriptRunConfig 方法提交运行时,可使用 Jupyter 小组件监视运行的进度。When you use the ScriptRunConfig method to submit runs, you can watch the progress of the run with a Jupyter widget. 和运行提交一样,该小组件采用异步方式,并每隔 10-15 秒提供实时更新,直到作业完成。Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

  1. 在等待运行完成的期间查看 Jupyter 小组件。View the Jupyter widget while waiting for the run to complete.

    from azureml.widgets import RunDetails
    RunDetails(run).show()
    

    Jupyter 笔记本小组件的屏幕截图

    也可以在工作区中找到指向此画面的链接。You can also get a link to the same display in your workspace.

    print(run.get_portal_url())
    
  2. [适用于自动化机器学习运行] 从以前的运行访问图表。[For automated machine learning runs] To access the charts from a previous run. <<experiment_name>> 替换为相应的试验名称:Replace <<experiment_name>> with the appropriate experiment name:

    from azureml.widgets import RunDetails
    from azureml.core.run import Run
    
    experiment = Experiment (workspace, <<experiment_name>>)
    run_id = 'autoML_my_runID' #replace with run_ID
    run = Run(experiment, run_id)
    RunDetails(run).show()
    

    自动化机器学习的 Jupyter Notebook 小组件

若要查看某个管道的其他详细信息,请在表中单击要探索的管道,随后,图表将在 Azure 机器学习工作室的弹出窗口中呈现。To view further details of a pipeline click on the Pipeline you would like to explore in the table, and the charts will render in a pop-up from the Azure Machine Learning studio.

完成时获取日志结果Get log results upon completion

模型定型和监视在后台进行,以便在等待时可运行其他任务。Model training and monitoring occur in the background so that you can run other tasks while you wait. 也可以先耐心等待模型完成定型,然后再运行其它代码。You can also wait until the model has completed training before running more code. 使用 ScriptRunConfig 时,可以使用 run.wait_for_completion(show_output = True) 在模型定型完成时进行显示。When you use ScriptRunConfig, you can use run.wait_for_completion(show_output = True) to show when the model training is complete. 使用 show_output 标志可查看详细输出。The show_output flag gives you verbose output.

查询运行指标Query run metrics

可以使用 run.get_metrics() 查看训练的模型的指标。You can view the metrics of a trained model using run.get_metrics(). 现在可以获取上面示例中记录的所有指标以确定最佳模型。You can now get all of the metrics that were logged in the example above to determine the best model.

Azure 机器学习工作室中查看工作区中的试验View the experiment in your workspace in Azure Machine Learning studio

当实验完成运行时,可浏览到试验运行记录。When an experiment has finished running, you can browse to the recorded experiment run record. 可在 Azure 机器学习工作室中访问历史记录。You can access the history from the Azure Machine Learning studio.

导航到“试验”选项卡并选择自己的试验。Navigate to the Experiments tab and select your experiment. 此时会转到试验运行仪表板,可在其中查看为每个运行记录的跟踪指标和图表。You are brought to the experiment run dashboard, where you can see tracked metrics and charts that are logged for each run. 在本例中,记录了 MSE 和 alpha 值。In this case, we logged MSE and the alpha values.

Azure 机器学习工作室中的运行详细信息

还可以向下钻取至特定运行以查看其输出或日志,或下载提交的试验的快照,以便与其他人共享试验文件夹。You can drill down to a specific run to view its outputs or logs, or download the snapshot of the experiment you submitted so you can share the experiment folder with others.

在运行详细信息中查看图表Viewing charts in run details

可通过多种方式使用日志记录 API 在运行期间记录不同类型的指标,然后在 Azure 机器学习工作室中以图表形式查看这些指标。There are various ways to use the logging APIs to record different types of metrics during a run and view them as charts in Azure Machine Learning studio.

记录的值Logged Value 示例代码Example code 在门户中查看View in portal
记录一组数值Log an array of numeric values run.log_list(name='Fibonacci', value=[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]) 单变量折线图single-variable line chart
使用重复使用的相同指标名称记录单个数值(例如在 for 循环中)Log a single numeric value with the same metric name repeatedly used (like from within a for loop) for i in tqdm(range(-10, 10)): run.log(name='Sigmoid', value=1 / (1 + np.exp(-i))) angle = i / 2.0 单变量折线图Single-variable line chart
重复记录包含 2 个数字列的行Log a row with 2 numerical columns repeatedly run.log_row(name='Cosine Wave', angle=angle, cos=np.cos(angle)) sines['angle'].append(angle) sines['sine'].append(np.sin(angle)) 双变量折线图Two-variable line chart
记录包含 2 个数字列的表Log table with 2 numerical columns run.log_table(name='Sine Wave', value=sines) 双变量折线图Two-variable line chart

示例笔记本Example notebooks

下面的笔记本展示了本文中的概念:The following notebooks demonstrate concepts in this article:

阅读使用 Jupyter 笔记本探索此服务一文,了解如何运行笔记本。Learn how to run notebooks by following the article Use Jupyter notebooks to explore this service.

后续步骤Next steps

尝试执行以下后续步骤,了解如何将 Azure 机器学习 SDK 用于 Python:Try these next steps to learn how to use the Azure Machine Learning SDK for Python: