使用 TensorBoard 和 Azure 机器学习可视化试验运行与指标Visualize experiment runs and metrics with TensorBoard and Azure Machine Learning

本文介绍如何使用主要 Azure 机器学习 SDK 中的 tensorboard,在 TensorBoard 中查看试验运行和指标。In this article, you learn how to view your experiment runs and metrics in TensorBoard using the tensorboard package in the main Azure Machine Learning SDK. 检查试验运行后,可以更好地优化和重新训练机器学习模型。Once you've inspected your experiment runs, you can better tune and retrain your machine learning models.

TensorBoard 是一套 Web 应用程序,用于检查和了解试验的结构与性能。TensorBoard is a suite of web applications for inspecting and understanding your experiment structure and performance.

如何在 Azure 机器学习试验中启动 TensorBoard 取决于试验类型:How you launch TensorBoard with Azure Machine Learning experiments depends on the type of experiment:

  • 如果试验(例如 PyTorch、Chainer 和 TensorFlow 试验)原生可以输出可供 TensorBoard 使用的日志文件,则你可以从试验的运行历史记录直接启动 TensorBoardIf your experiment natively outputs log files that are consumable by TensorBoard, such as PyTorch, Chainer and TensorFlow experiments, then you can launch TensorBoard directly from experiment's run history.

  • 对于原生无法输出 TensorBoard 可用文件的试验(例如 Scikit-learn 或 Azure 机器学习试验),请使用 export_to_tensorboard() 方法将运行历史记录导出为 TensorBoard 日志,并从日志启动 TensorBoard。For experiments that don't natively output TensorBoard consumable files, such as like Scikit-learn or Azure Machine Learning experiments, use the export_to_tensorboard() method to export the run histories as TensorBoard logs and launch TensorBoard from there.

提示

本文档中的信息主要面向希望监视模型训练过程的数据科学家和开发人员。The information in this document is primarily for data scientists and developers who want to monitor the model training process. 如果你是一名管理员,希望监视 Azure 机器学习的资源使用情况和事件,例如配额、已完成的训练运行或已完成的模型部署,请参阅监视 Azure 机器学习If you are an administrator interested in monitoring resource usage and events from Azure Machine learning, such as quotas, completed training runs, or completed model deployments, see Monitoring Azure Machine Learning.

先决条件Prerequisites

  • 若要启动 TensorBoard 并查看试验运行历史记录,需要事先为试验启用日志记录,以跟踪其指标和性能。To launch TensorBoard and view your experiment run histories, your experiments need to have previously enabled logging to track its metrics and performance.

  • 本文档中的代码可在以下任一环境中运行:The code in this document can be run in either of the following environments:

选项 1:直接在 TensorBoard 中查看运行历史记录Option 1: Directly view run history in TensorBoard

此选项适用于原生可以输出可供 TensorBoard 使用的日志文件的试验,例如 PyTorch、Chainer 和 TensorFlow 试验。This option works for experiments that natively outputs log files consumable by TensorBoard, such as PyTorch, Chainer, and TensorFlow experiments. 如果你的试验无此功能,请改用 export_to_tensorboard() 方法If that is not the case of your experiment, use the export_to_tensorboard() method instead.

以下示例代码在远程计算目标“Azure 机器学习计算”上使用 TensorFlow 存储库中的 MNIST 演示试验The following example code uses the MNIST demo experiment from TensorFlow's repository in a remote compute target, Azure Machine Learning Compute. 接下来,我们将配置并启动一个运行来训练 TensorFlow 模型,然后针对此 TensorFlow 试验启动 TensorBoard。Next, we will configure and start a run for training the TensorFlow model, and then start TensorBoard against this TensorFlow experiment.

设置试验名称并创建项目文件夹Set experiment name and create project folder

在此处为试验命名,并创建其文件夹。Here we name the experiment and create its folder.

from os import path, makedirs
experiment_name = 'tensorboard-demo'

# experiment folder
exp_dir = './sample_projects/' + experiment_name

if not path.exists(exp_dir):
    makedirs(exp_dir)

下载 TensorFlow 演示试验代码Download TensorFlow demo experiment code

TensorFlow 的存储库包含 MNIST 演示以及丰富的 TensorBoard 检测工具。TensorFlow's repository has an MNIST demo with extensive TensorBoard instrumentation. 我们不会也不需要更改此演示代码的任何部分,即可使其与 Azure 机器学习配合运行。We do not, nor need to, alter any of this demo's code for it to work with Azure Machine Learning. 在以下代码中,我们将下载 MNIST 代码,并将其保存到新建的试验文件夹中。In the following code, we download the MNIST code and save it in our newly created experiment folder.

import requests
import os

tf_code = requests.get("https://raw.githubusercontent.com/tensorflow/tensorflow/r1.8/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py")
with open(os.path.join(exp_dir, "mnist_with_summaries.py"), "w") as file:
    file.write(tf_code.text)

在整个 MNIST 代码文件 mnist_with_summaries.py 中,请注意一些调用 tf.summary.scalar()tf.summary.histogram()tf.summary.FileWriter() 等的行。这些方法将试验的关键指标分组、记录和标记到运行历史记录中。Throughout the MNIST code file, mnist_with_summaries.py, notice that there are lines that call tf.summary.scalar(), tf.summary.histogram(), tf.summary.FileWriter() etc. These methods group, log, and tag key metrics of your experiments into run history. tf.summary.FileWriter() 特别重要,因为它序列化所记录的试验指标中的数据,使 TensorBoard 能够基于这些数据生成可视化效果。The tf.summary.FileWriter() is especially important as it serializes the data from your logged experiment metrics, which allows for TensorBoard to generate visualizations off of them.

配置试验Configure experiment

下面,我们将配置试验,并设置日志和数据的目录。In the following, we configure our experiment and set up directories for logs and data. 这些日志会上传到运行历史记录,供 TensorBoard 稍后访问。These logs will be uploaded to the run history, which TensorBoard accesses later.

备注

对于此 TensorFlow 示例,需要在本地计算机上安装 TensorFlow。For this TensorFlow example, you will need to install TensorFlow on your local machine. 此外,TensorBoard 模块(即 TensorFlow 包含的模块)必须可供此笔记本的内核访问,因为 TensorBoard 在本地计算机上运行。Further, the TensorBoard module (that is, the one included with TensorFlow) must be accessible to this notebook's kernel, as the local machine is what runs TensorBoard.

import azureml.core
from azureml.core import Workspace
from azureml.core import Experiment

ws = Workspace.from_config()

# create directories for experiment logs and dataset
logs_dir = os.path.join(os.curdir, "logs")
data_dir = os.path.abspath(os.path.join(os.curdir, "mnist_data"))

if not path.exists(data_dir):
    makedirs(data_dir)

os.environ["TEST_TMPDIR"] = data_dir

# Writing logs to ./logs results in their being uploaded to the run history,
# and thus, made accessible to our TensorBoard instance.
args = ["--log_dir", logs_dir]

# Create an experiment
exp = Experiment(ws, experiment_name)

为试验创建群集Create a cluster for your experiment

我们将为此试验创建 AmlCompute 群集,但是,可以在任何环境中创建试验,并且仍可以针对试验运行历史记录启动 TensorBoard。We create an AmlCompute cluster for this experiment, however your experiments can be created in any environment and you are still able to launch TensorBoard against the experiment run history.

from azureml.core.compute import ComputeTarget, AmlCompute

cluster_name = "cpu-cluster"

cts = ws.compute_targets
found = False
if cluster_name in cts and cts[cluster_name].type == 'AmlCompute':
   found = True
   print('Found existing compute target.')
   compute_target = cts[cluster_name]
if not found:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', 
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count=None)

# use get_status() to get a detailed status for the current cluster. 
# print(compute_target.get_status().serialize())

配置和提交训练运行Configure and submit training run

通过创建 ScriptRunConfig 对象来配置训练作业。Configure a training job by creating a ScriptRunConfig object.

from azureml.core import ScriptRunConfig
from azureml.core import Environment

# Here we will use the TensorFlow 2.2 curated environment
tf_env = Environment.get(ws, 'AzureML-TensorFlow-2.2-GPU')

src = ScriptRunConfig(source_directory=exp_dir,
                      script='mnist_with_summaries.py',
                      arguments=args,
                      compute_target=compute_target,
                      environment=tf_env)
run = exp.submit(src)

启动 TensorBoardLaunch TensorBoard

可以在运行期间或者在运行完成后启动 TensorBoard。You can launch TensorBoard during your run or after it completes. 下面,我们将创建一个 TensorBoard 对象实例 tb,该实例采用 run 中加载的试验运行历史记录,然后使用 start() 方法启动 TensorBoard。In the following, we create a TensorBoard object instance, tb, that takes the experiment run history loaded in the run, and then launches TensorBoard with the start() method.

TensorBoard 构造函数采用运行数组,因此请确保将其作为单元素数组传入。The TensorBoard constructor takes an array of runs, so be sure and pass it in as a single-element array.

from azureml.tensorboard import Tensorboard

tb = Tensorboard([run])

# If successful, start() returns a string with the URI of the instance.
tb.start()

# After your job completes, be sure to stop() the streaming otherwise it will continue to run. 
tb.stop()

备注

虽然此示例使用了 TensorFlow,但 TensorBoard 可以同样轻松地与 PyTorch 或 Chainer 配合使用。While this example used TensorFlow, TensorBoard can be used as easily with PyTorch or Chainer. TensorFlow 必须在运行 TensorBoard 的计算机上可用,但在执行 PyTorch 或 Chainer 计算的计算机上不是必需的。TensorFlow must be available on the machine running TensorBoard, but is not necessary on the machine doing PyTorch or Chainer computations.

选项 2:将历史记录导出为日志以在 TensorBoard 中查看Option 2: Export history as log to view in TensorBoard

以下代码设置一个示例试验,使用 Azure 机器学习运行历史记录 API 开始日志记录过程,然后将试验运行历史记录导出到 TensorBoard 可用的日志中,以进行可视化。The following code sets up a sample experiment, begins the logging process using the Azure Machine Learning run history APIs, and exports the experiment run history into logs consumable by TensorBoard for visualization.

设置试验Set up experiment

以下代码设置新的试验,并将运行目录命名为 root_runThe following code sets up a new experiment and names the run directory root_run.

from azureml.core import Workspace, Experiment
import azureml.core

# set experiment name and run name
ws = Workspace.from_config()
experiment_name = 'export-to-tensorboard'
exp = Experiment(ws, experiment_name)
root_run = exp.start_logging()

此处我们将加载糖尿病数据集(scikit-learn 随附的内置小型数据集),并将其拆分为测试集和训练集。Here we load the diabetes dataset-- a built-in small dataset that comes with scikit-learn, and split it into test and training sets.

from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
X, y = load_diabetes(return_X_y=True)
columns = ['age', 'gender', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
data = {
    "train":{"x":x_train, "y":y_train},        
    "test":{"x":x_test, "y":y_test}
}

运行试验并记录指标Run experiment and log metrics

对于此代码,我们将训练一个线性回归模型,并在运行历史记录中记录关键指标、alpha 系数 alpha 和均方误差 mseFor this code, we train a linear regression model and log key metrics, the alpha coefficient, alpha, and mean squared error, mse, in run history.

from tqdm import tqdm
alphas = [.1, .2, .3, .4, .5, .6 , .7]
# try a bunch of alpha values in a Linear Regression (aka Ridge regression) mode
for alpha in tqdm(alphas):
  # create child runs and fit lines for the resulting models
  with root_run.child_run("alpha" + str(alpha)) as run:
 
   reg = Ridge(alpha=alpha)
   reg.fit(data["train"]["x"], data["train"]["y"])    
 
   preds = reg.predict(data["test"]["x"])
   mse = mean_squared_error(preds, data["test"]["y"])
   # End train and eval

# log alpha, mean_squared_error and feature names in run history
   root_run.log("alpha", alpha)
   root_run.log("mse", mse)

将运行导出到 TensorBoardExport runs to TensorBoard

使用 SDK 的 export_to_tensorboard() 方法,可将 Azure 机器学习试验的运行历史记录导出到 TensorBoard 日志中,以便可以通过 TensorBoard 查看。With the SDK's export_to_tensorboard() method, we can export the run history of our Azure machine learning experiment into TensorBoard logs, so we can view them via TensorBoard.

在以下代码中,我们将在当前工作目录中创建 logdir 文件夹。In the following code, we create the folder logdir in our current working directory. 我们将在此文件夹中从 root_run 导出试验运行历史记录和日志,然后将该运行标记为已完成。This folder is where we will export our experiment run history and logs from root_run and then mark that run as completed.

from azureml.tensorboard.export import export_to_tensorboard
import os

logdir = 'exportedTBlogs'
log_path = os.path.join(os.getcwd(), logdir)
try:
    os.stat(log_path)
except os.error:
    os.mkdir(log_path)
print(logdir)

# export run history for the project
export_to_tensorboard(root_run, logdir)

root_run.complete()

备注

还可以通过指定运行的名称 (export_to_tensorboard(run_name, logdir)),将特定的运行导出到 TensorBoardYou can also export a particular run to TensorBoard by specifying the name of the run export_to_tensorboard(run_name, logdir)

启动和停止 TensorBoardStart and stop TensorBoard

导出此试验的运行历史记录后,可以使用 start() 方法启动 TensorBoard。Once our run history for this experiment is exported, we can launch TensorBoard with the start() method.

from azureml.tensorboard import Tensorboard

# The TensorBoard constructor takes an array of runs, so be sure and pass it in as a single-element array here
tb = Tensorboard([], local_root=logdir, port=6006)

# If successful, start() returns a string with the URI of the instance.
tb.start()

完成后,请务必调用 TensorBoard 对象的 stop() 方法。When you're done, make sure to call the stop() method of the TensorBoard object. 否则,在关闭笔记本内核之前,TensorBoard 将继续运行。Otherwise, TensorBoard will continue to run until you shut down the notebook kernel.

tb.stop()

后续步骤Next steps

在本操作指南中,你已创建两个试验,并已了解如何针对这些试验的运行历史记录启动 TensorBoard,以识别在哪些方面可以进行优化和重新训练。In this how-to you, created two experiments and learned how to launch TensorBoard against their run histories to identify areas for potential tuning and retraining.