在试验中对数据集进行版本控制和跟踪Version and track datasets in experiments

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

在本文中,你将了解如何对 Azure 机器学习数据集进行版本控制和跟踪,以实现可再现性。In this article, you'll learn how to version and track Azure Machine Learning datasets for reproducibility. 数据集版本控制是为数据状态设置书签的一种方法,方便为将来的试验应用数据集的特定版本。Dataset versioning is a way to bookmark the state of your data so that you can apply a specific version of the dataset for future experiments.

典型的版本控制方案:Typical versioning scenarios:

  • 当新数据可用于重新训练时When new data is available for retraining
  • 应用不同的数据准备或特征工程方法时When you're applying different data preparation or feature engineering approaches

必备条件Prerequisites

对于本教程的内容,你需要:For this tutorial, you need:

注册和检索数据集版本Register and retrieve dataset versions

通过注册数据集,可以对数据集进行版本控制,在试验之间以及与同事重复使用和共享数据集。By registering a dataset, you can version, reuse, and share it across experiments and with colleagues. 你可以使用相同的名称注册多个数据集,并按名称和版本号检索特定版本。You can register multiple datasets under the same name and retrieve a specific version by name and version number.

注册数据集版本Register a dataset version

下面的代码通过将 create_new_version 参数设置为 True 来注册 titanic_ds 数据集的新版本。The following code registers a new version of the titanic_ds dataset by setting the create_new_version parameter to True. 如果没有向工作区注册现有 titanic_ds 数据集,则代码会创建一个名为 titanic_ds 的新数据集,并将其版本设置为 1。If there's no existing titanic_ds dataset registered with the workspace, the code creates a new dataset with the name titanic_ds and sets its version to 1.

titanic_ds = titanic_ds.register(workspace = workspace,
                                 name = 'titanic_ds',
                                 description = 'titanic training data',
                                 create_new_version = True)

按名称检索数据集Retrieve a dataset by name

默认情况下,Dataset 类中的 get_by_name() 方法返回已注册到工作区的数据集的最新版本。By default, the get_by_name() method on the Dataset class returns the latest version of the dataset registered with the workspace.

下面的代码获取 titanic_ds 数据集的版本 1。The following code gets version 1 of the titanic_ds dataset.

from azureml.core import Dataset
# Get a dataset by name and version number
titanic_ds = Dataset.get_by_name(workspace = workspace,
                                 name = 'titanic_ds', 
                                 version = 1)

版本控制最佳做法Versioning best practice

创建数据集版本时, 不会使用工作区创建额外的数据副本。When you create a dataset version, you're not creating an extra copy of data with the workspace. 由于数据集是对存储服务中数据的引用,因此,你有单个由存储服务管理的事实来源。Because datasets are references to the data in your storage service, you have a single source of truth, managed by your storage service.

重要

如果数据集引用的数据被覆盖或删除,则调用特定版本的数据集不会 还原更改。If the data referenced by your dataset is overwritten or deleted, calling a specific version of the dataset does not revert the change.

从数据集加载数据时,始终会加载数据集引用的当前数据内容。When you load data from a dataset, the current data content referenced by the dataset is always loaded. 如果要确保每个数据集版本都是可再现的,建议你不要修改数据集版本引用的数据内容。If you want to make sure that each dataset version is reproducible, we recommend that you not modify data content referenced by the dataset version. 当新数据进入时,将新数据文件保存到单独的数据文件夹中,然后创建新的数据集版本以包含该新文件夹中的数据。When new data comes in, save new data files into a separate data folder and then create a new dataset version to include data from that new folder.

下图和示例代码展示了对数据文件夹进行构造的建议方式,以及创建引用那些文件夹的数据集版本的建议方式:The following image and sample code show the recommended way to structure your data folders and to create dataset versions that reference those folders:

文件夹结构

from azureml.core import Dataset

# get the default datastore of the workspace
datastore = workspace.get_default_datastore()

# create & register weather_ds version 1 pointing to all files in the folder of week 27
datastore_path1 = [(datastore, 'Weather/week 27')]
dataset1 = Dataset.File.from_files(path=datastore_path1)
dataset1.register(workspace = workspace,
                  name = 'weather_ds',
                  description = 'weather data in week 27',
                  create_new_version = True)

# create & register weather_ds version 2 pointing to all files in the folder of week 27 and 28
datastore_path2 = [(datastore, 'Weather/week 27'), (datastore, 'Weather/week 28')]
dataset2 = Dataset.File.from_files(path = datastore_path2)
dataset2.register(workspace = workspace,
                  name = 'weather_ds',
                  description = 'weather data in week 27, 28',
                  create_new_version = True)

对管道输出数据集进行版本控制Version a pipeline output dataset

可以使用数据集作为每个机器学习管道步骤的输入和输出。You can use a dataset as the input and output of each Machine Learning pipeline step. 重新运行管道时,每个管道步骤的输出将注册为一个新的数据集版本。When you rerun pipelines, the output of each pipeline step is registered as a new dataset version.

因为机器学习管道每次重新运行时,管道都会将每个步骤的输出填充到一个新文件夹中,所以带版本的输出数据集是可再现的。Because Machine Learning pipelines populate the output of each step into a new folder every time the pipeline reruns, the versioned output datasets are reproducible. 详细了解 管道中的数据集Learn more about datasets in pipelines.

from azureml.core import Dataset
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.core. runconfig import CondaDependencies, RunConfiguration

# get input dataset 
input_ds = Dataset.get_by_name(workspace, 'weather_ds')

# register pipeline output as dataset
output_ds = PipelineData('prepared_weather_ds', datastore=datastore).as_dataset()
output_ds = output_ds.register(name='prepared_weather_ds', create_new_version=True)

conda = CondaDependencies.create(
    pip_packages=['azureml-defaults', 'azureml-dataprep[fuse,pandas]'], 
    pin_sdk_version=False)

run_config = RunConfiguration()
run_config.environment.docker.enabled = True
run_config.environment.python.conda_dependencies = conda

# configure pipeline step to use dataset as the input and output
prep_step = PythonScriptStep(script_name="prepare.py",
                             inputs=[input_ds.as_named_input('weather_ds')],
                             outputs=[output_ds],
                             runconfig=run_config,
                             compute_target=compute_target,
                             source_directory=project_folder)

在试验中跟踪数据集Track datasets in experiments

对于每个机器学习试验,可以通过试验 Run 对象轻松跟踪用作输入的数据集。For each Machine Learning experiment, you can easily trace the datasets used as the input through the experiment Run object.

下面的代码使用 get_details() 方法跟踪试验运行时使用哪些输入数据集:The following code uses the get_details() method to track which input datasets were used with the experiment run:

# get input datasets
inputs = run.get_details()['inputDatasets']
input_dataset = inputs[0]['dataset']

# list the files referenced by input_dataset
input_dataset.to_path()

还可以使用 https://studio.ml.azure.cn/ 从试验中查找 input_datasetsYou can also find the input_datasets from experiments by using https://studio.ml.azure.cn//.

下图展示了在 Azure 机器学习工作室中从何处查找试验的输入数据集。The following image shows where to find the input dataset of an experiment on Azure Machine Learning studio. 对于此示例,请转到“试验”窗格,并打开试验 keras-mnist 的特定运行的“属性”选项卡。For this example, go to your Experiments pane and open the Properties tab for a specific run of your experiment, keras-mnist.

输入数据集

使用以下代码向数据集注册模型:Use the following code to register models with datasets:

model = run.register_model(model_name='keras-mlp-mnist',
                           model_path=model_path,
                           datasets =[('training data',train_dataset)])

注册后,可以使用 Python 或转到 https://studio.ml.azure.cn/ 查看已注册到数据集中的模型列表。After registration, you can see the list of models registered with the dataset by using Python or go to https://studio.ml.azure.cn/.

以下视图来自“资产”下的“数据集” 窗格。The following view is from the Datasets pane under Assets. 选择数据集,然后选择“模型” 选项卡以获取向数据集注册的模型的列表。Select the dataset and then select the Models tab for a list of the models that are registered with the dataset.

输入数据集模型

后续步骤Next steps