Azure 数据工厂中的数据集Datasets in Azure Data Factory

本文介绍了数据集的涵义,采用 JSON 格式定义数据集的方式以及数据集在 Azure 数据工厂管道中的用法。This article describes what datasets are, how they are defined in JSON format, and how they are used in Azure Data Factory pipelines.

如果对数据工厂不熟悉,请参阅 Azure 数据工厂简介了解相关概述。If you are new to Data Factory, see Introduction to Azure Data Factory for an overview.

概述Overview

数据工厂可以包含一个或多个数据管道。A data factory can have one or more pipelines. “管道” 是共同执行一项任务的活动 的逻辑分组。A pipeline is a logical grouping of activities that together perform a task. 管道中的活动定义对数据执行的操作。The activities in a pipeline define actions to perform on your data. 现在,数据集这一名称的意义已经变为看待数据的一种方式,就是以输入和输出的形式指向或引用活动中要使用的数据 。Now, a dataset is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs. 数据集可识别不同数据存储(如表、文件、文件夹和文档)中的数据。Datasets identify data within different data stores, such as tables, files, folders, and documents. 例如,Azure Blob 数据集可在 Blob 存储中指定供活动读取数据的 Blob 容器和文件夹。For example, an Azure Blob dataset specifies the blob container and folder in Blob storage from which the activity should read the data.

创建数据集之前,必须创建链接服务,将数据存储链接到数据工厂 。Before you create a dataset, you must create a linked service to link your data store to the data factory. 链接的服务类似于连接字符串,它定义数据工厂连接到外部资源时所需的连接信息。Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. 不妨这样考虑:数据集代表链接的数据存储中的数据结构,而链接服务则定义到数据源的连接。Think of it this way; the dataset represents the structure of the data within the linked data stores, and the linked service defines the connection to the data source. 例如,Azure 存储链接服务可将存储帐户链接到数据工厂。For example, an Azure Storage linked service links a storage account to the data factory. Azure Blob 数据集表示 blob 容器以及包含要处理的输入 blob 的 Azure 存储帐户的文件夹。An Azure Blob dataset represents the blob container and the folder within that Azure storage account that contains the input blobs to be processed.

下面是一个示例方案。Here is a sample scenario. 要将数据从 Blob 存储复制到 SQL 数据库,请创建以下两个链接服务:Azure 存储和 Azure SQL 数据库。To copy data from Blob storage to a SQL database, you create two linked services: Azure Storage and Azure SQL Database. 然后创建两个数据集:Azure Blob 数据集(即 Azure 存储链接服务)和 Azure SQL 表数据集(即 Azure SQL 数据库链接服务)。Then, create two datasets: Azure Blob dataset (which refers to the Azure Storage linked service) and Azure SQL Table dataset (which refers to the Azure SQL Database linked service). Azure 存储和 Azure SQL 数据库链接服务分别包含数据工厂在运行时用于连接到 Azure 存储和 Azure SQL 数据库的连接字符串。The Azure Storage and Azure SQL Database linked services contain connection strings that Data Factory uses at runtime to connect to your Azure Storage and Azure SQL Database, respectively. Azure Blob 数据集指定 blob 容器和 blob 文件夹,该文件夹包含 Blob 存储中的输入 blob。The Azure Blob dataset specifies the blob container and blob folder that contains the input blobs in your Blob storage. Azure SQL 表数据集指定要向其复制数据的 SQL 数据库中的 SQL 表。The Azure SQL Table dataset specifies the SQL table in your SQL database to which the data is to be copied.

下图显示了数据工厂中管道、活动、数据集和链接服务之间的关系:The following diagram shows the relationships among pipeline, activity, dataset, and linked service in Data Factory:

管道、活动、数据集和链接服务之间的关系

数据集 JSONDataset JSON

数据工厂中的数据集采用以下 JSON 格式定义:A dataset in Data Factory is defined in the following JSON format:

{
    "name": "<name of dataset>",
    "properties": {
        "type": "<type of dataset: AzureBlob, AzureSql etc...>",
        "linkedServiceName": {
                "referenceName": "<name of linked service>",
                "type": "LinkedServiceReference",
        },
        "structure": [
            {
                "name": "<Name of the column>",
                "type": "<Name of the type>"
            }
        ],
        "typeProperties": {
            "<type specific property>": "<value>",
            "<type specific property 2>": "<value 2>",
        }
    }
}

下表描述了上述 JSON 中的属性:The following table describes properties in the above JSON:

属性Property 说明Description 必须Required
namename 数据集名称。Name of the dataset. 请参阅 Azure 数据工厂 - 命名规则See Azure Data Factory - Naming rules. Yes
typetype 数据集的类型。Type of the dataset. 指定数据工厂支持的类型之一(例如:AzureBlob、AzureSqlTable)。Specify one of the types supported by Data Factory (for example: AzureBlob, AzureSqlTable).

有关详细信息,请参阅数据集类型For details, see Dataset types.
Yes
structurestructure 数据集的架构。Schema of the dataset. 有关详细信息,请参阅数据集架构For details, see Dataset schema. No
typePropertiestypeProperties 每种类型的类型属性各不相同(例如:Azure Blob、Azure SQL 表)。The type properties are different for each type (for example: Azure Blob, Azure SQL table). 若要详细了解受支持的类型及其属性,请参阅数据集类型For details on the supported types and their properties, see Dataset type. Yes

数据集示例Dataset example

在以下示例中,数据集表示 SQL 数据库中名为 MyTable 的表。In the following example, the dataset represents a table named MyTable in a SQL database.

{
    "name": "DatasetSample",
    "properties": {
        "type": "AzureSqlTable",
        "linkedServiceName": {
                "referenceName": "MyAzureSqlLinkedService",
                "type": "LinkedServiceReference",
        },
        "typeProperties":
        {
            "tableName": "MyTable"
        },
    }
}

请注意以下几点:Note the following points:

  • type 设置为 AzureSqlTable。type is set to AzureSqlTable.
  • tableName 类型属性(特定于 AzureSqlTable 类型)设置为 MyTable。tableName type property (specific to AzureSqlTable type) is set to MyTable.
  • linkedServiceName 引用 AzureSqlDatabase 类型的链接服务,该类型在下一 JSON 片段中定义。linkedServiceName refers to a linked service of type AzureSqlDatabase, which is defined in the next JSON snippet.

数据集类型Dataset type

数据集的类型很多,具体取决于使用的数据存储。There are many different types of datasets, depending on the data store you use. 可以从连接器概述一文中找到数据工厂支持的存储数据列表。You can find the list of data stored supported by Data Factory from Connector overview article. 单击数据存储,了解如何创建链接服务和该数据存储的数据集。Click a data store to learn how to create a linked service and a dataset for that data store.

在上一节中的示例中,数据集的类型设置为 AzureSqlTable 。In the example in the previous section, the type of the dataset is set to AzureSqlTable. 同样,对于 Azure Blob 数据集,数据集的类型设置为 AzureBlob ,如以下 JSON 中所示:Similarly, for an Azure Blob dataset, the type of the dataset is set to AzureBlob, as shown in the following JSON:

{
    "name": "AzureBlobInput",
    "properties": {
        "type": "AzureBlob",
        "linkedServiceName": {
                "referenceName": "MyAzureStorageLinkedService",
                "type": "LinkedServiceReference",
        },

        "typeProperties": {
            "fileName": "input.log",
            "folderPath": "adfgetstarted/inputdata",
            "format": {
                "type": "TextFormat",
                "columnDelimiter": ","
            }
        }
    }
}

数据集结构Dataset structure

使用结构部分提供用于隐藏类型以及将列从源映射到目标的类型信息。You use the structure section to provide type information that is used to convert types and map columns from the source to the destination.

结构中的每个列都包含以下属性:Each column in the structure contains the following properties:

属性Property 说明Description 必须Required
namename 列的名称。Name of the column. Yes
typetype 列的数据类型。Data type of the column. 数据工厂支持以下临时数据类型作为允许的值:Int16、Int32、Int64、Single、Double、Decimal、Byte[]、Boolean、String、Guid、Datetime、Datetimeoffset 和 TimespanData Factory supports the following interim data types as allowed values: Int16, Int32, Int64, Single, Double, Decimal, Byte[], Boolean, String, Guid, Datetime, Datetimeoffset, and Timespan No
cultureculture 类型为 .NET 类型 DatetimeDatetimeoffset 时要使用的基于 .NET 的区域性。.NET-based culture to be used when the type is a .NET type: Datetime or Datetimeoffset. 默认为 en-usThe default is en-us. No
formatformat 类型为 .NET 类型 DatetimeDatetimeoffset 时要使用的格式字符串。Format string to be used when the type is a .NET type: Datetime or Datetimeoffset. 请参阅自定义日期和时间格式字符串,了解如何设置日期时间格式。Refer to Custom Date and Time Format Strings on how to format datetime. No

示例Example

在下面的示例中,假设源 Blob 数据采用 CSV 格式,并且包含三列: userid、name 和 lastlogindate。In the following example, suppose the source Blob data is in CSV format and contains three columns: userid, name, and lastlogindate. 它们的类型分别为 Int64、String 和 Datetime,并采用使用星期几的缩写法语名称的自定义日期时间格式。They are of type Int64, String, and Datetime with a custom datetime format using abbreviated French names for day of the week.

将按如下所示定义 Blob 数据集结构并指定列的类型定义:Define the Blob dataset structure as follows along with type definitions for the columns:

"structure":
[
    { "name": "userid", "type": "Int64"},
    { "name": "name", "type": "String"},
    { "name": "lastlogindate", "type": "Datetime", "culture": "fr-fr", "format": "ddd-MM-YYYY"}
]

指南Guidance

若要了解何时加入结构信息以及在结构 部分包含哪些信息,请参阅以下指南。The following guidelines help you understand when to include structure information, and what to include in the structure section. 详细了解数据工厂如何通过架构和类型映射将源数据映射到接收器,以及何时指定结构信息。Learn more on how data factory maps source data to sink and when to specify structure information from Schema and type mapping.

  • 对于强架构数据源,仅当要将源列映射到接收器列且其名称不同时,才指定“结构”部分。For strong schema data sources, specify the structure section only if you want map source columns to sink columns, and their names are not the same. 此类结构化的数据源将存储数据架构和类型信息,以及数据本身。This kind of structured data source stores data schema and type information along with the data itself. 结构化的数据源的示例包括 SQL Server、Oracle 和 Azure SQL 数据库。Examples of structured data sources include SQL Server, Oracle, and Azure SQL Database.

    由于类型信息已可用于结构化数据源,因此包含结构部分时不应包含类型信息。As type information is already available for structured data sources, you should not include type information when you do include the structure section.
  • 对于无/弱架构数据源(例如 blob 存储中的文本文件) ,当数据集是复制活动的输入且应将源数据集的数据类型转换为接收器的本机类型时,请加入结构。For no/weak schema data sources e.g. text file in blob storage, include structure when the dataset is an input for a copy activity, and data types of source dataset should be converted to native types for the sink. 另外,当需要将源列映射到接收器列时,请加入结构。And include structure when you want to map source columns to sink columns..

创建数据集Create datasets

可以使用以下任一工具或 SDK 创建数据集:.NET APIPowerShellREST API、Azure 资源管理器模板和 Azure 门户You can create datasets by using one of these tools or SDKs: .NET API, PowerShell, REST API, Azure Resource Manager Template, and Azure portal

后续步骤Next steps

请参阅以下教程,了解使用下列某个工具或 SDK 创建管道和数据集的分步说明。See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one of these tools or SDKs.