使用 Azure 机器学习 Python 客户端库通过 Python 访问数据集Access datasets with Python using the Azure Machine Learning Python client library

Microsoft Azure 机器学习 Python 客户端库的预览可允许从本地 Python 环境安全访问 Azure 机器学习数据集,以及允许在工作区创建并管理数据集。The preview of Microsoft Azure Machine Learning Python client library can enable secure access to your Azure Machine Learning datasets from a local Python environment and enables the creation and management of datasets in a workspace.

本主题说明如何执行以下操作:This topic provides instructions on how to:

  • 安装机器学习 Python 客户端库install the Machine Learning Python client library
  • 访问和上传数据集,包括如何获取授权以从本地 Python 环境访问 Azure 机器学习数据集的说明access and upload datasets, including instructions on how to get authorization to access Azure Machine Learning datasets from your local Python environment
  • 从实验访问中间数据集access intermediate datasets from experiments
  • 使用 Python 客户端库枚举数据集、访问元数据、读取数据集内容、创建新的数据集及更新现有数据集use the Python client library to enumerate datasets, access metadata, read the contents of a dataset, create new datasets, and update existing datasets

先决条件Prerequisites

已在以下环境下测试 Python 客户端库:The Python client library has been tested under the following environments:

  • Windows、Mac 和 LinuxWindows, Mac and Linux
  • Python 2.7、3.3 和 3.4Python 2.7, 3.3 and 3.4

在以下包中具有依赖项:It has a dependency on the following packages:

  • 请求requests
  • python-dateutilpython-dateutil
  • pandaspandas

建议使用 Python 分发,如 Python 随附的 AnacondaCanopy,IPython 和上面列出的三个包已安装。We recommend using a Python distribution such as Anaconda or Canopy, which come with Python, IPython and the three packages listed above installed. 虽然 IPython 不是绝对必需的,但对于交互式操作和可视化数据,它是非常理想的环境。Although IPython is not strictly required, it is a great environment for manipulating and visualizing data interactively.

如何安装 Azure 机器学习 Python 客户端库How to install the Azure Machine Learning Python client library

安装 Azure 机器学习 Python 客户端库,以完成本主题所述的任务。Install the Azure Machine Learning Python client library to complete the tasks outlined in this topic. 此库可从 Python 包索引中获取。This library is available from the Python Package Index. 若要在 Python 环境中进行安装,请从本地 Python 环境中运行以下命令:To install it in your Python environment, run the following command from your local Python environment:

pip install azureml

或者,可以从 GitHub 上的源中下载和安装。Alternatively, you can download and install from the sources on GitHub.

python setup.py install

如果已在计算机上安装 git,则可以使用 pip 从 git 存储库中直接进行安装:If you have git installed on your machine, you can use pip to install directly from the git repository:

pip install git+https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python.git

使用代码片段访问数据集Use code snippets to access datasets

通过 Python 客户端库,能够以编程方式从已运行的实验中访问现有的数据集。The Python client library gives you programmatic access to your existing datasets from experiments that have been run.

从 Azure 机器学习工作室(经典)Web 界面,可生成代码片段,其中包括下载和反序列化数据集作为本地计算机上 pandas DataFrame 对象的所有必要信息。From the Azure Machine Learning Studio (classic) web interface, you can generate code snippets that include all the necessary information to download and deserialize datasets as pandas DataFrame objects on your local machine.

数据访问的安全Security for data access

Azure 机器学习工作室(经典)提供的用于 Python 客户端库的代码片段包括工作区 ID 和授权令牌。The code snippets provided by Azure Machine Learning Studio (classic) for use with the Python client library includes your workspace ID and authorization token. 这些将提供工作区的完全访问,必须受到保护,如密码。These provide full access to your workspace and must be protected, like a password.

出于安全原因,代码片段功能仅适用于其角色设置为工作区所有者的用户。For security reasons, the code snippet functionality is only available to users that have their role set as Owner for the workspace. 在“设置”下“用户”页面上的 Azure 机器学习工作室(经典)中会显示你的角色。Your role is displayed in Azure Machine Learning Studio (classic) on the USERS page under Settings.

安全性

如果角色未设置为所有者,可以请求重新邀请为所有者,或询问工作区所有者以提供代码片段。If your role is not set as Owner, you can either request to be reinvited as an owner, or ask the owner of the workspace to provide you with the code snippet.

若要获取授权令牌,可以选择以下选项之一:To obtain the authorization token, you may choose one of these options:

  • 向所有者请求令牌。Ask for a token from an owner. 所有者可在 Azure 机器学习工作室(经典)中其工作区的“设置”页面中访问授权令牌。Owners can access their authorization tokens from the Settings page of their workspace in Azure Machine Learning Studio (classic). 从左窗格中选择“设置”,单击“授权令牌”以查看主要和次要令牌。Select Settings from the left pane and click AUTHORIZATION TOKENS to see the primary and secondary tokens. 尽管主要或次要授权都可在代码片段中使用,但是建议所有者只共享次要授权令牌。Although either the primary or the secondary authorization tokens can be used in the code snippet, it is recommended that owners only share the secondary authorization tokens.

    授权令牌

  • 要求提升为所有者角色:当前工作区所有者需要先将你从工作区中删除,然后重新邀请你作为所有者加入。Ask to be promoted to role of owner: a current owner of the workspace needs to first remove you from the workspace then reinvite you to it as an owner.

开发人员获取工作区 ID 和授权令牌之后,能够使用代码片段访问工作区,无论其角色是什么。Once developers have obtained the workspace ID and authorization token, they are able to access the workspace using the code snippet regardless of their role.

在“设置”下的“授权令牌”上托管授权令牌。Authorization tokens are managed on the AUTHORIZATION TOKENS page under SETTINGS. 可以撤销它们,但是此过程会撤销上一个令牌的访问权限。You can regenerate them, but this procedure revokes access to the previous token.

从本地 Python 应用程序访问数据集Access datasets from a local Python application

  1. 在机器学习工作室(经典)中,单击左侧导航栏中的“数据集”。In Machine Learning Studio (classic), click DATASETS in the navigation bar on the left.

  2. 选择想要访问的数据集。Select the dataset you would like to access. 可以从“我的数据集”列表或“示例”列表中选择任一数据集。You can select any of the datasets from the MY DATASETS list or from the SAMPLES list.

  3. 从底部工具栏,单击“生成数据访问代码”。From the bottom toolbar, click Generate Data Access Code. 如果数据格式与 Python 客户端库不兼容,则禁用此按钮。If the data is in a format incompatible with the Python client library, this button is disabled.

    数据集

  4. 从显示的窗口中选择代码片段,并将其复制到剪贴板。Select the code snippet from the window that appears and copy it to your clipboard.

    “生成数据访问代码”按钮

  5. 将代码粘贴到本地 Python 应用程序的 Notebook 中。Paste the code into the notebook of your local Python application.

    将代码粘贴到笔记本中

从机器学习试验访问中间数据集Access intermediate datasets from Machine Learning experiments

在机器学习工作室(经典)中运行实验之后,就可以从模块的输出节点中访问中间数据集。After an experiment is run in Machine Learning Studio (classic), it is possible to access the intermediate datasets from the output nodes of modules. 中间数据集是指模型工具运行后,已创建并用于中间步骤的数据。Intermediate datasets are data that has been created and used for intermediate steps when a model tool has been run.

只要数据格式符合 Python 客户端库,就可以访问中间数据集。Intermediate datasets can be accessed as long as the data format is compatible with the Python client library.

支持以下格式(这些格式的常量位于 azureml.DataTypeIds 类中):The following formats are supported (constants for these formats are in the azureml.DataTypeIds class):

  • PlainTextPlainText
  • GenericCSVGenericCSV
  • GenericTSVGenericTSV
  • GenericCSVNoHeaderGenericCSVNoHeader
  • GenericTSVNoHeaderGenericTSVNoHeader

将鼠标悬停在模块输出节点上,可确定格式。You can determine the format by hovering over a module output node. 工具提示中同时会显示节点名。It is displayed along with the node name, in a tooltip.

一些模块(如拆分模块)会输出名为 Dataset 的格式,而 Python 客户端库不支持此格式。Some of the modules, such as the Split module, output to a format named Dataset, which is not supported by the Python client library.

数据集格式

需要使用转换模块,如转换为 CSV,以将输出转换为受支持的格式。You need to use a conversion module, such as Convert to CSV, to get an output into a supported format.

GenericCSV 格式

以下步骤介绍创建实验、运行实验并访问中间数据集的示例。The following steps show an example that creates an experiment, runs it and accesses the intermediate dataset.

  1. 创建新实验。Create a new experiment.

  2. 插入“成年人口收入二元分类数据集”模块。Insert an Adult Census Income Binary Classification dataset module.

  3. 插入拆分模块,并将其输入连接到数据集模块输出。Insert a Split module, and connect its input to the dataset module output.

  4. 插入转换为 CSV 模块,并将其输入连接到其中一个拆分模块输出。Insert a Convert to CSV module and connect its input to one of the Split module outputs.

  5. 保存此试验并运行,然后等待作业完成。Save the experiment, run it, and wait for the job to finish.

  6. 单击转换为 CSV 模块上的输出节点。Click the output node on the Convert to CSV module.

  7. 上下文菜单出现时,选择“生成数据访问代码”。When the context menu appears, select Generate Data Access Code.

    上下文菜单

  8. 从显示的窗口中选择代码片段,并将其复制到剪贴板。Select the code snippet and copy it to your clipboard from the window that appears.

    从上下文菜单生成访问代码

  9. 将代码粘贴到 Notebook 中。Paste the code in your notebook.

    将代码粘贴到笔记本中

  10. 可使用 matplotlib 将数据可视化。You can visualize the data using matplotlib. 这会在直方图的年龄列中显示:This displays in a histogram for the age column:

    直方图

使用机器学习 Python 客户端库来访问、读取、创建和管理数据集Use the Machine Learning Python client library to access, read, create, and manage datasets

工作区Workspace

工作区是 Python 客户端库的入口点。The workspace is the entry point for the Python client library. Workspace 类提供工作区 ID 和授权令牌以创建实例:Provide the Workspace class with your workspace ID and authorization token to create an instance:

ws = Workspace(workspace_id='4c29e1adeba2e5a7cbeb0e4f4adfb4df',
               authorization_token='f4f3ade2c6aefdb1afb043cd8bcf3daf')

枚举数据集Enumerate datasets

枚举给定工作区中所有数据集:To enumerate all datasets in a given workspace:

for ds in ws.datasets:
    print(ds.name)

只枚举用户创建的数据集:To enumerate just the user-created datasets:

for ds in ws.user_datasets:
    print(ds.name)

只枚举示例数据集:To enumerate just the example datasets:

for ds in ws.example_datasets:
    print(ds.name)

可以按照名称(区分大小写)访问数据集:You can access a dataset by name (which is case-sensitive):

ds = ws.datasets['my dataset name']

或者按照索引访问数据集:Or you can access it by index:

ds = ws.datasets[0]

MetadataMetadata

除了内容,数据集还具有元数据。Datasets have metadata, in addition to content. (中间数据集是例外,它不具有任何元数据。)(Intermediate datasets are an exception to this rule and do not have any metadata.)

某些元数据值在创建时由用户分配:Some metadata values are assigned by the user at creation time:

  • print(ds.name)
  • print(ds.description)
  • print(ds.family_id)
  • print(ds.data_type_id)

其他值由 Azure ML 分配:Others are values assigned by Azure ML:

  • print(ds.id)
  • print(ds.created_date)
  • print(ds.size)

请参阅 SourceDataset 类以获取关于可用元数据的详细信息。See the SourceDataset class for more on the available metadata.

读取内容Read contents

机器学习工作室(经典)提供的代码片段会自动将数据集下载并反序列化到 pandas DataFrame 对象。The code snippets provided by Machine Learning Studio (classic) automatically download and deserialize the dataset to a pandas DataFrame object. 通过 to_dataframe 方法完成此操作:This is done with the to_dataframe method:

frame = ds.to_dataframe()

如果想要下载原始数据,并自己执行反序列化,这是一个选项。If you prefer to download the raw data, and perform the deserialization yourself, that is an option. 目前,对于 Python 客户端库无法反序列化的格式(如“ARFF”),这是唯一的选项。At the moment, this is the only option for formats such as 'ARFF', which the Python client library cannot deserialize.

将内容读取为文本:To read the contents as text:

text_data = ds.read_as_text()

将内容读取为二进制:To read the contents as binary:

binary_data = ds.read_as_binary()

还可以只将流打开到内容:You can also just open a stream to the contents:

with ds.open() as file:
    binary_data_chunk = file.read(1000)

创建新的数据集Create a new dataset

Python 客户端库允许从 Python 程序上传数据集。The Python client library allows you to upload datasets from your Python program. 然后这些数据集便可在工作区中使用。These datasets are then available for use in your workspace.

如果具有 pandas DataFrame 中的数据,请使用以下代码:If you have your data in a pandas DataFrame, use the following code:

from azureml import DataTypeIds

dataset = ws.datasets.add_from_dataframe(
    dataframe=frame,
    data_type_id=DataTypeIds.GenericCSV,
    name='my new dataset',
    description='my description'
)

如果数据已序列化,则可以使用:If your data is already serialized, you can use:

from azureml import DataTypeIds

dataset = ws.datasets.add_from_raw_data(
    raw_data=raw_data,
    data_type_id=DataTypeIds.GenericCSV,
    name='my new dataset',
    description='my description'
)

Python 客户端库可以将 pandas DataFrame 序列化为以下格式(这些的常量是 azureml.DataTypeIds 类):The Python client library is able to serialize a pandas DataFrame to the following formats (constants for these are in the azureml.DataTypeIds class):

  • PlainTextPlainText
  • GenericCSVGenericCSV
  • GenericTSVGenericTSV
  • GenericCSVNoHeaderGenericCSVNoHeader
  • GenericTSVNoHeaderGenericTSVNoHeader

更新现有数据集Update an existing dataset

如果尝试上传具有与现有数据集匹配的名称的新数据集,则会收到冲突错误。If you try to upload a new dataset with a name that matches an existing dataset, you should get a conflict error.

若要更新现有数据集,首先需要获取对现有数据集的引用:To update an existing dataset, you first need to get a reference to the existing dataset:

dataset = ws.datasets['existing dataset']

print(dataset.data_type_id) # 'GenericCSV'
print(dataset.name)         # 'existing dataset'
print(dataset.description)  # 'data up to jan 2015'

然后使用 update_from_dataframe 序列化并替换 Azure 上数据集的内容:Then use update_from_dataframe to serialize and replace the contents of the dataset on Azure:

dataset = ws.datasets['existing dataset']

dataset.update_from_dataframe(frame2)

print(dataset.data_type_id) # 'GenericCSV'
print(dataset.name)         # 'existing dataset'
print(dataset.description)  # 'data up to jan 2015'

如果想要将数据序列化为其他格式,请为可选的 data_type_id 参数指定值。If you want to serialize the data to a different format, specify a value for the optional data_type_id parameter.

from azureml import DataTypeIds

dataset = ws.datasets['existing dataset']

dataset.update_from_dataframe(
    dataframe=frame2,
    data_type_id=DataTypeIds.GenericTSV,
)

print(dataset.data_type_id) # 'GenericTSV'
print(dataset.name)         # 'existing dataset'
print(dataset.description)  # 'data up to jan 2015'

(可选)可通过为 description 参数指定值来设置新的说明。You can optionally set a new description by specifying a value for the description parameter.

dataset = ws.datasets['existing dataset']

dataset.update_from_dataframe(
    dataframe=frame2,
    description='data up to feb 2015',
)

print(dataset.data_type_id) # 'GenericCSV'
print(dataset.name)         # 'existing dataset'
print(dataset.description)  # 'data up to feb 2015'

(可选)可通过为 name 参数指定值来设置新的名称。You can optionally set a new name by specifying a value for the name parameter. 从现在起,将只使用新名称来检索数据集。From now on, you'll retrieve the dataset using the new name only. 以下代码将更新数据、名称和说明。The following code updates the data, name, and description.

dataset = ws.datasets['existing dataset']

dataset.update_from_dataframe(
    dataframe=frame2,
    name='existing dataset v2',
    description='data up to feb 2015',
)

print(dataset.data_type_id)                    # 'GenericCSV'
print(dataset.name)                            # 'existing dataset v2'
print(dataset.description)                     # 'data up to feb 2015'

print(ws.datasets['existing dataset v2'].name) # 'existing dataset v2'
print(ws.datasets['existing dataset'].name)    # IndexError

data_type_idnamedescription 参数是可选项,并默认为其以前的值。The data_type_id, name and description parameters are optional and default to their previous value. dataframe 参数始终是必需的。The dataframe parameter is always required.

如果数据已序列化,则使用 update_from_raw_data 而不是 update_from_dataframeIf your data is already serialized, use update_from_raw_data instead of update_from_dataframe. 如果只需传递到 raw_data 而不是 dataframe 中,则使用相同方式操作。If you just pass in raw_data instead of dataframe, it works in a similar way.