“导入数据”模块Import Data module

本文介绍 Azure 机器学习设计器中的一个模块。This article describes a module in Azure Machine Learning designer.

使用此模块可将数据从现有的云数据服务载入机器学习管道。Use this module to load data into a machine learning pipeline from existing cloud data services.

备注

此模块提供的所有功能可以通过工作区登陆页中的 数据存储数据集 来实现。All functionality provided by this module can be done by datastore and datasets in the worksapce landing page. 建议使用 数据存储数据集 ,因为它们包括数据监视等附加功能。We recommend you use datastore and dataset which includes additional features like data monitoring. 有关详细信息,请参阅文章如何访问数据如何注册数据集To learn more, see How to Access Data and How to Register Datasets article. 注册数据集后,可以在设计器界面中的“数据集” -> “我的数据集”类别中找到它。 After you register a dataset, you can find it in the Datasets -> My Datasets category in designer interface. 此模块是为工作室(经典版)用户保留的,以便为他们提供熟悉的体验。This module is reserved for Studio(classic) users to for a familiar experience.

“导入数据”模块支持从以下源读取数据:The Import Data module support read data from following sources:

  • 通过 HTTP 从 URL 读取URL via HTTP
  • 通过 数据存储)从 Azure 云存储Azure cloud storages through Datastores)
    • Azure Blob 容器Azure Blob Container
    • Azure 文件共享Azure File Share
    • Azure Data LakeAzure Data Lake
    • Azure Data Lake Gen2Azure Data Lake Gen2
    • Azure SQL 数据库Azure SQL Database
    • Azure PostgreSQLAzure PostgreSQL

在使用云存储之前,需要先在 Azure 机器学习工作区中注册一个数据存储。Before using cloud storage, you need to register a datastore in your Azure Machine Learning workspace first. 有关详细信息,请参阅如何访问数据For more information, see How to Access Data.

定义所需的数据并连接到源后, 导入数据 将根据每个列包含的值推断该列的数据类型,并将数据载入设计器管道。After you define the data you want and connect to the source, Import Data infers the data type of each column based on the values it contains, and loads the data into your designer pipeline. “导入数据”的输出是可在任何设计器管道中使用的数据集。The output of Import Data is a dataset that can be used with any designer pipeline.

如果源数据发生更改,可以通过重新运行导入数据来刷新数据集并添加新数据。If your source data changes, you can refresh the dataset and add new data by rerunning Import Data.

警告

如果工作区位于虚拟网络中,则必须将数据存储配置为使用设计器的数据可视化功能。If your workspace is in a virtual network, you must configure your datastores to use the designer's data visualization features. 有关如何在虚拟网络中使用数据存储和数据集的详细信息,请参阅在 Azure 虚拟网络中使用 Azure 机器学习工作室For more information on how to use datastores and datasets in a virtual network, see Use Azure Machine Learning studio in an Azure virtual network.

如何配置“导入数据”How to configure Import Data

  1. 将“导入数据”模块添加到管道。Add the Import Data module to your pipeline. 可以在设计器的“数据输入和输出”类别中找到此模块。You can find this module in the Data Input and Output category in the designer.

  2. 选择该模块以打开右窗格。Select the module to open the right pane.

  3. 选择“数据源”,然后选择数据源类型。Select Data source , and choose the data source type. 该类型可以是 HTTP 或数据存储。It could be HTTP or datastore.

    如果选择数据存储,则可以选择已注册到 Azure 机器学习工作区的现有数据存储,或创建新的数据存储。If you choose datastore, you can select existing datastores that already registered to your Azure Machine Learning workspace or create a new datastore. 然后,定义数据在数据存储中的导入路径。Then define the path of data to import in the datastore. 可以通过单击“浏览路径”来轻松浏览路径 import-data-pathYou can easily browse the path by click Browse Path import-data-path

  4. 选择“预览架构”以筛选要包含的列。Select the preview schema to filter the columns you want to include. 还可以在“分析”选项中定义高级设置,例如“分隔符”。You can also define advanced settings like Delimiter in Parsing options.

    import-data-preview

  5. 复选框“重新生成输出”决定是否在运行时执行模块以重新生成输出。The checkbox, Regenerate output , decides whether to execute the module to regenerate output at running time.

    默认情况下,它处于未选中状态,这意味着如果以前已经以相同的参数执行了模块,系统将重用上次运行的输出以减少运行时间。It's by default unselected, which means if the module has been executed with the same parameters previously, the system will reuse the output from last run to reduce run time.

    如果选中它,系统将再次执行模块以重新生成输出。If it is selected, the system will execute the module again to regenerate output. 因此,更新存储中的基础数据时,选择此选项可以帮助获取最新数据。So select this option when underlying data in storage is updated, it can help to get the latest data.

  6. 提交管道。Submit the pipeline.

    当“导入数据”将数据载入设计器时,它会根据每个列包含的值推断该列的数据类型:数字或分类。When Import Data loads the data into the designer, it infers the data type of each column based on the values it contains, either numerical or categorical.

    如果标题存在,则使用该标题来命名输出数据集的列。If a header is present, the header is used to name the columns of the output dataset.

    如果数据中没有现有的列标题,将使用以下格式生成新的列名称:col1, col2,…。If there are no existing column headers in the data, new column names are generated using the format col1, col2,… , coln*。, coln*.

结果Results

导入完成后,请单击输出数据集,然后选择“可视化”查看是否已成功导入数据。When import completes, click the output dataset and select Visualize to see if the data was imported successfully.

如果希望保存数据以供重用,而不是每次运行管道时都导入一组新数据,请在模块右侧面板的“输出”选项卡下选择“注册数据集”图标。If you want to save the data for reuse, rather than importing a new set of data each time the pipeline is run, select the Register dataset icon under the Outputs tab in the right panel of the module. 选择数据集的名称。Choose a name for the dataset. 保存的数据集将保留单击保存时存在的数据;重新运行管道时不会更新数据集,即使管道中的数据集发生更改,也是如此。The saved dataset preserves the data at the time of saving, the dataset is not updated when the pipeline is rerun, even if the dataset in the pipeline changes. 这有助于创建数据快照。This can be useful for taking snapshots of data.

导入数据后,可能需要对它进行一些额外的准备,才能将它用于建模和分析:After importing the data, it might need some additional preparations for modeling and analysis:

  • 使用编辑元数据更改列名、将列处理为不同的数据类型,或指示某些列是标签或特征。Use Edit Metadata to change column names, to handle a column as a different data type, or to indicate that some columns are labels or features.

  • 使用选择数据集中的列选择要转换的或要在建模中使用的列子集。Use Select Columns in Dataset to select a subset of columns to transform or use in modeling. 使用添加列模块可以轻松地将已转换的或已删除的列重新加入原始数据集。The transformed or removed columns can easily be rejoined to the original dataset by using the Add Columns module.

  • 使用分区和采样来分割数据集、执行采样或获取排名靠前的 n 行。Use Partition and Sample to divide the dataset, perform sampling, or get the top n rows.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.