Azure 数据工厂中的数据集Datasets in Azure Data Factory

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

本文介绍了数据集的涵义,采用 JSON 格式定义数据集的方式以及数据集在 Azure 数据工厂管道中的用法。This article describes what datasets are, how they are defined in JSON format, and how they are used in Azure Data Factory pipelines.

如果对数据工厂不熟悉,请参阅 Azure 数据工厂简介了解相关概述。If you are new to Data Factory, see Introduction to Azure Data Factory for an overview.

概述Overview

数据工厂可以包含一个或多个数据管道。A data factory can have one or more pipelines. “管道”是共同执行一项任务的活动的逻辑分组。A pipeline is a logical grouping of activities that together perform a task. 管道中的活动定义对数据执行的操作。The activities in a pipeline define actions to perform on your data. 现在,数据集这一名称的意义已经变为看待数据的一种方式,就是以输入和输出的形式指向或引用活动中要使用的数据 。Now, a dataset is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs. 数据集可识别不同数据存储(如表、文件、文件夹和文档)中的数据。Datasets identify data within different data stores, such as tables, files, folders, and documents. 例如,Azure Blob 数据集可在 Blob 存储中指定供活动读取数据的 Blob 容器和文件夹。For example, an Azure Blob dataset specifies the blob container and folder in Blob storage from which the activity should read the data.

创建数据集之前,必须创建链接服务,将数据存储链接到数据工厂。Before you create a dataset, you must create a linked service to link your data store to the data factory. 链接的服务类似于连接字符串,它定义数据工厂连接到外部资源时所需的连接信息。Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. 不妨这样考虑:数据集代表链接的数据存储中的数据结构,而链接服务则定义到数据源的连接。Think of it this way; the dataset represents the structure of the data within the linked data stores, and the linked service defines the connection to the data source. 例如,Azure 存储链接服务可将存储帐户链接到数据工厂。For example, an Azure Storage linked service links a storage account to the data factory. Azure Blob 数据集表示该 Azure 存储帐户中包含要处理的输入 Blob 的 Blob 容器和文件夹。An Azure Blob dataset represents the blob container and the folder within that Azure Storage account that contains the input blobs to be processed.

下面是一个示例方案。Here is a sample scenario. 要将数据从 Blob 存储复制到 SQL 数据库,请创建以下两个链接服务:Azure Blob 存储和 Azure SQL 数据库。To copy data from Blob storage to a SQL Database, you create two linked services: Azure Blob Storage and Azure SQL Database. 然后创建两个数据集:带分隔符的文本数据集(假设将文本文件作为源,则它指的是 Azure Blob 存储链接服务)和 Azure SQL 表数据集(即 Azure SQL 数据库链接服务)。Then, create two datasets: Delimited Text dataset (which refers to the Azure Blob Storage linked service, assuming you have text files as source) and Azure SQL Table dataset (which refers to the Azure SQL Database linked service). Azure Blob 存储和 Azure SQL 数据库链接服务分别包含数据工厂在运行时用于连接到 Azure 存储和 Azure SQL 数据库的连接字符串。The Azure Blob Storage and Azure SQL Database linked services contain connection strings that Data Factory uses at runtime to connect to your Azure Storage and Azure SQL Database, respectively. 带分隔符的文本数据集指定 blob 容器和 blob 文件夹,该文件夹包含 Blob 存储中的输入 blob 以及与格式相关的设置。The Delimited Text dataset specifies the blob container and blob folder that contains the input blobs in your Blob storage, along with format-related settings. Azure SQL 表数据集指定你的 SQL 数据库中要将数据复制到其中的 SQL 表。The Azure SQL Table dataset specifies the SQL table in your SQL Database to which the data is to be copied.

下图显示了数据工厂中管道、活动、数据集和链接服务之间的关系:The following diagram shows the relationships among pipeline, activity, dataset, and linked service in Data Factory:

管道、活动、数据集和链接服务之间的关系

数据集 JSONDataset JSON

数据工厂中的数据集采用以下 JSON 格式定义:A dataset in Data Factory is defined in the following JSON format:

{
    "name": "<name of dataset>",
    "properties": {
        "type": "<type of dataset: DelimitedText, AzureSqlTable etc...>",
        "linkedServiceName": {
                "referenceName": "<name of linked service>",
                "type": "LinkedServiceReference",
        },
        "schema":[

        ],
        "typeProperties": {
            "<type specific property>": "<value>",
            "<type specific property 2>": "<value 2>",
        }
    }
}

下表描述了上述 JSON 中的属性:The following table describes properties in the above JSON:

属性Property 说明Description 必须Required
namename 数据集名称。Name of the dataset. 请参阅 Azure 数据工厂 - 命名规则See Azure Data Factory - Naming rules. Yes
typetype 数据集的类型。Type of the dataset. 指定数据工厂支持的类型之一(例如:DelimitedText、AzureSqlTable)。Specify one of the types supported by Data Factory (for example: DelimitedText, AzureSqlTable).

有关详细信息,请参阅数据集类型For details, see Dataset types.
Yes
架构schema 数据集的架构表示物理数据类型和形状。Schema of the dataset, represents the physical data type and shape. No
typePropertiestypeProperties 每种类型的类型属性各不相同。The type properties are different for each type. 若要详细了解受支持的类型及其属性,请参阅数据集类型For details on the supported types and their properties, see Dataset type. Yes

导入数据集的架构时,请选择“导入架构”按钮,然后选择从源或本地文件导入。When you import the schema of dataset, select the Import Schema button and choose to import from the source or from a local file. 在大多数情况下,将直接从源导入架构。In most cases, you'll import the schema directly from the source. 但是,如果已有本地架构文件(Parquet 文件或带标头的 CSV),则可以指示数据工厂根据该文件定义架构。But if you already have a local schema file (a Parquet file or CSV with headers), you can direct Data Factory to base the schema on that file.

在复制活动中,数据集用于源和接收器。In copy activity, datasets are used in source and sink. 数据集中定义的架构可选作引用。Schema defined in dataset is optional as reference. 如果要在源和接收器之间应用列/字段映射,请参阅架构和类型映射If you want to apply column/field mapping between source and sink, refer to Schema and type mapping.

数据集类型Dataset type

Azure 数据工厂支持多种数据集类型,具体取决于使用的数据存储。Azure Data Factory supports many different types of datasets, depending on the data stores you use. 可以从连接器概述一文中找到数据工厂支持的数据存储列表。You can find the list of data stores supported by Data Factory from Connector overview article. 单击数据存储,了解如何创建链接服务及其数据集。Click a data store to learn how to create a linked service and a dataset for it.

例如,对于带分隔符的文本数据集,数据集类型设为“DelimitedText”,如下方的 JSON 示例所示:For example, for a Delimited Text dataset, the dataset type is set to DelimitedText as shown in the following JSON sample:

{
    "name": "DelimitedTextInput",
    "properties": {
        "linkedServiceName": {
            "referenceName": "AzureBlobStorage",
            "type": "LinkedServiceReference"
        },
        "annotations": [],
        "type": "DelimitedText",
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "fileName": "input.log",
                "folderPath": "inputdata",
                "container": "adfgetstarted"
            },
            "columnDelimiter": ",",
            "escapeChar": "\\",
            "quoteChar": "\""
        },
        "schema": []
    }
}

创建数据集Create datasets

可以使用以下任一工具或 SDK 创建数据集:.NET APIPowerShellREST API、Azure 资源管理器模板和 Azure 门户You can create datasets by using one of these tools or SDKs: .NET API, PowerShell, REST API, Azure Resource Manager Template, and Azure portal

后续步骤Next steps

请参阅以下教程,了解使用下列某个工具或 SDK 创建管道和数据集的分步说明。See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one of these tools or SDKs.