使用 Azure 数据工厂在 Azure Data Lake Storage Gen2 中复制和转换数据Copy and transform data in Azure Data Lake Storage Gen2 using Azure Data Factory

适用于: Azure 数据工厂 Azure Synapse Analytics

Azure Data Lake Storage Gen2 (ADLS Gen2) 是一组专用于大数据分析的功能,内置于 Azure Blob 存储中。Azure Data Lake Storage Gen2 (ADLS Gen2) is a set of capabilities dedicated to big data analytics built into Azure Blob storage. 它可使用文件系统和对象存储范例与数据进行交互。You can use it to interface with your data by using both file system and object storage paradigms.

本文概述了如何使用 Azure 数据工厂中的 Copy 活动从/向 Azure Data Lake Storage Gen2 复制数据,并使用数据流在 Azure Data Lake Storage Gen2 中转换数据。This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Data Lake Storage Gen2, and use Data Flow to transform data in Azure Data Lake Storage Gen2. 若要了解 Azure 数据工厂,请阅读介绍性文章To learn about Azure Data Factory, read the introductory article.

提示

对于数据湖或数据仓库迁移方案,请从使用 Azure 数据工厂将数据从数据湖或数据仓库迁移到 Azure 了解更多信息。For data lake or data warehouse migration scenario, learn more from Use Azure Data Factory to migrate data from your data lake or data warehouse to Azure.

支持的功能Supported capabilities

以下活动支持此 Azure Data Lake Storage Gen2 连接器:This Azure Data Lake Storage Gen2 connector is supported for the following activities:

对于复制活动,使用此连接器可以:For Copy activity, with this connector you can:

入门Get started

提示

有关使用 Data Lake Storage Gen2 连接器的演练,请参阅将数据加载到 Azure Data Lake Storage Gen2For a walk-through of how to use the Data Lake Storage Gen2 connector, see Load data into Azure Data Lake Storage Gen2.

若要使用管道执行复制活动,可以使用以下工具或 SDK 之一:To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:

以下部分介绍了用于定义 Data Lake Storage Gen2 特定的数据工厂实体的属性。The following sections provide information about properties that are used to define Data Factory entities specific to Data Lake Storage Gen2.

链接服务属性Linked service properties

Azure Data Lake Storage Gen2 连接器支持以下身份验证类型。The Azure Data Lake Storage Gen2 connector supports the following authentication types. 请参阅相应部分的了解详细信息:See the corresponding sections for details:

备注

  • 如果要通过利用 Azure 存储防火墙上启用的“允许受信任的 Microsoft 服务访问此存储帐户”选项使用公共 Azure 集成运行时连接到 Data Lake Storage Gen2,则必须使用托管标识身份验证If want to use the public Azure integration runtime to connect to the Data Lake Storage Gen2 by leveraging the Allow trusted Microsoft services to access this storage account option enabled on Azure Storage firewall, you must use managed identity authentication.
  • 在使用 PolyBase 或 COPY 语句将数据载入 Azure Synapse Analytics 时,如果源或暂存Data Lake Storage Gen2 配置了 Azure 虚拟网络终结点,则必须按照 Synapse 的要求使用托管标识身份验证。When you use PolyBase or COPY statement to load data into Azure Synapse Analytics, if your source or staging Data Lake Storage Gen2 is configured with an Azure Virtual Network endpoint, you must use managed identity authentication as required by Synapse. 请参阅托管标识身份验证部分,其中提供了更多配置先决条件。See the managed identity authentication section with more configuration prerequisites.

帐户密钥身份验证Account key authentication

若要使用存储帐户密钥身份验证,需支持以下属性:To use storage account key authentication, the following properties are supported:

propertiesProperty 说明Description 必需Required
typetype type 属性必须设置为 AzureBlobFS。The type property must be set to AzureBlobFS. Yes
urlurl Data Lake Storage Gen2 的终结点,其模式为 https://<accountname>.dfs.core.chinacloudapi.cnEndpoint for Data Lake Storage Gen2 with the pattern of https://<accountname>.dfs.core.chinacloudapi.cn. Yes
accountKeyaccountKey Data Lake Storage Gen2 的帐户密钥Account key for Data Lake Storage Gen2. 将此字段标记为 SecureString 以安全地将其存储在数据工厂中或引用存储在 Azure Key Vault 中的机密Mark this field as a SecureString to store it securely in Data Factory, or reference a secret stored in Azure Key Vault. Yes
connectViaconnectVia 用于连接到数据存储的集成运行时The integration runtime to be used to connect to the data store. 可使用 Azure Integration Runtime 或自承载集成运行时(如果数据存储位于专用网络)。You can use the Azure integration runtime or a self-hosted integration runtime if your data store is in a private network. 如果未指定此属性,则使用默认的 Azure Integration Runtime。If this property isn't specified, the default Azure integration runtime is used. No

备注

使用帐户密钥身份验证时,不支持辅助 ADLS 文件系统终结点。Secondary ADLS file system endpoint is not supported when using account key authentication. 可以使用其他身份验证类型。You can use other authentication types.

示例:Example:

{
    "name": "AzureDataLakeStorageGen2LinkedService",
    "properties": {
        "type": "AzureBlobFS",
        "typeProperties": {
            "url": "https://<accountname>.dfs.core.chinacloudapi.cn", 
            "accountkey": { 
                "type": "SecureString", 
                "value": "<accountkey>" 
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

服务主体身份验证Service principal authentication

若要使用服务主体身份验证,请执行以下步骤。To use service principal authentication, follow these steps.

  1. 遵循将应用程序注册到 Azure AD 租户,在 Azure Active Directory (Azure AD) 中注册一个应用程序实体。Register an application entity in Azure Active Directory (Azure AD) by following the steps in Register your application with an Azure AD tenant. 记下下面的值,这些值用于定义链接服务:Make note of the following values, which you use to define the linked service:

    • 应用程序 IDApplication ID
    • 应用程序密钥Application key
    • 租户 IDTenant ID
  2. 向服务主体授予适当权限。Grant the service principal proper permission. 查看文件和目录上的访问控制列表中的一些示例,来了解 Data Lake Storage Gen2 中的权限的工作原理。See examples on how permission works in Data Lake Storage Gen2 from Access control lists on files and directories

    • 作为源:在存储资源管理器中,至少授予对所有上游文件夹和文件系统的“执行”权限,并授予对文件的“读取”权限以进行复制 。As source: In Storage Explorer, grant at least Execute permission for ALL upstream folders and the file system, along with Read permission for the files to copy. 或者,在访问控制 (IAM) 中,授予至少“存储 Blob 数据读取者”角色。Alternatively, in Access control (IAM), grant at least the Storage Blob Data Reader role.
    • 作为接收器:在存储资源管理器中,至少授予对所有上游文件夹和文件系统的“执行”权限,并授予对接收器文件夹的“写入”权限 。As sink: In Storage Explorer, grant at least Execute permission for ALL upstream folders and the file system, along with Write permission for the sink folder. 或者,在访问控制 (IAM) 中,授予至少“存储 Blob 数据参与者”角色。Alternatively, in Access control (IAM), grant at least the Storage Blob Data Contributor role.

备注

如果使用数据工厂 UI 进行创作,并且服务主体未在 IAM 中设置为“存储 Blob 数据读取者/参与者”角色,则在执行测试连接或在文件夹中浏览/导航时,请选择“测试到文件路径的连接”或“从指定路径浏览”,然后指定具有“读取 + 执行”权限的路径以继续。If you use Data Factory UI to author and the service principal is not set with "Storage Blob Data Reader/Contributor" role in IAM, when doing test connection or browsing/navigating folders, choose "Test connection to file path" or "Browse from specified path", and specify a path with Read + Execute permission to continue.

链接服务支持以下属性:These properties are supported for the linked service:

propertiesProperty 说明Description 必需Required
typetype type 属性必须设置为 AzureBlobFS。The type property must be set to AzureBlobFS. Yes
urlurl Data Lake Storage Gen2 的终结点,其模式为 https://<accountname>.dfs.core.chinacloudapi.cnEndpoint for Data Lake Storage Gen2 with the pattern of https://<accountname>.dfs.core.chinacloudapi.cn. Yes
servicePrincipalIdservicePrincipalId 指定应用程序的客户端 ID。Specify the application's client ID. Yes
servicePrincipalCredentialTypeservicePrincipalCredentialType 要用于服务主体身份验证的凭据类型。The credential type to use for service principal authentication. 允许的值为“ServicePrincipalKey”和“ServicePrincipalCert” 。Allowed values are ServicePrincipalKey and ServicePrincipalCert. Yes
servicePrincipalCredentialservicePrincipalCredential 服务主体凭据。The service principal credential.
使用“ServicePrincipalKey”作为凭据类型时,请指定应用程序的密钥。When you use ServicePrincipalKey as the credential type, specify the the application's key. 将此字段标记为 SecureString 以将其安全地存储在数据工厂中,或引用存储在 Azure Key Vault 中的机密Mark this field as SecureString to store it securely in Data Factory, or reference a secret stored in Azure Key Vault.
使用“ServicePrincipalCert”作为凭据时,请引用 Azure Key Vault 中的证书。When you use ServicePrincipalCert as the credential, reference a certificate in Azure Key Vault.
Yes
servicePrincipalKeyservicePrincipalKey 指定应用程序的密钥。Specify the application's key. 将此字段标记为 SecureString 以将其安全地存储在数据工厂中,或引用存储在 Azure Key Vault 中的机密Mark this field as SecureString to store it securely in Data Factory, or reference a secret stored in Azure Key Vault.
servicePrincipalId + servicePrincipalKey 仍按原样支持此属性。This property is still supported as-is for servicePrincipalId + servicePrincipalKey. 当 ADF 添加新的服务主体证书身份验证时,服务主体身份验证的新模型为 servicePrincipalId + servicePrincipalCredentialType + servicePrincipalCredentialAs ADF adds new service principal certificate authentication, the new model for service principal authentication is servicePrincipalId + servicePrincipalCredentialType + servicePrincipalCredential.
No
tenanttenant 指定应用程序的租户信息(域名或租户 ID)。Specify the tenant information (domain name or tenant ID) under which your application resides. 将鼠标悬停在 Azure 门户右上角进行检索。Retrieve it by hovering the mouse in the upper-right corner of the Azure portal. Yes
azureCloudTypeazureCloudType 对于服务主体身份验证,请指定 Azure Active Directory 应用程序注册到的 Azure 云环境的类型。For service principal authentication, specify the type of Azure cloud environment to which your Azure Active Directory application is registered.
允许的值为“AzureChina”。Allowed value is AzureChina. 默认情况下,使用数据工厂的云环境。By default, the data factory's cloud environment is used.
No
connectViaconnectVia 用于连接到数据存储的集成运行时The integration runtime to be used to connect to the data store. 可使用 Azure Integration Runtime 或自承载集成运行时(如果数据存储位于专用网络)。You can use the Azure integration runtime or a self-hosted integration runtime if your data store is in a private network. 如果未指定,则使用默认 Azure Integration Runtime。If not specified, the default Azure integration runtime is used. No

示例:使用服务主体密钥身份验证Example: using service principal key authentication

也可以将服务主体密钥存储在 Azure Key Vault 中。You can also store service principal key in Azure Key Vault.

{
    "name": "AzureDataLakeStorageGen2LinkedService",
    "properties": {
        "type": "AzureBlobFS",
        "typeProperties": {
            "url": "https://<accountname>.dfs.core.chinacloudapi.cn", 
            "servicePrincipalId": "<service principal id>",
            "servicePrincipalCredentialType": "ServicePrincipalKey",
            "servicePrincipalCredential": {
                "type": "SecureString",
                "value": "<service principal key>"
            },
            "tenant": "<tenant info, e.g. microsoft.partner.onmschina.cn>" 
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

示例:使用服务主体证书身份验证Example: using service principal certificate authentication

{
    "name": "AzureDataLakeStorageGen2LinkedService",
    "properties": {
        "type": "AzureBlobFS",
        "typeProperties": {
            "url": "https://<accountname>.dfs.core.chinacloudapi.cn", 
            "servicePrincipalId": "<service principal id>",
            "servicePrincipalCredentialType": "ServicePrincipalCert",
            "servicePrincipalCredential": { 
                "type": "AzureKeyVaultSecret", 
                "store": { 
                    "referenceName": "<AKV reference>", 
                    "type": "LinkedServiceReference" 
                }, 
                "secretName": "<certificate name in AKV>" 
            },
            "tenant": "<tenant info, e.g. microsoft.partner.onmschina.cn>" 
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

Azure 资源的托管标识身份验证Managed identities for Azure resources authentication

可将数据工厂与代表此特定数据工厂的 Azure 资源托管标识相关联。A data factory can be associated with a managed identity for Azure resources, which represents this specific data factory. 可以像使用自己的服务主体一样,直接使用此托管标识进行 Data Lake Storage Gen2 身份验证。You can directly use this managed identity for Data Lake Storage Gen2 authentication, similar to using your own service principal. 此指定工厂可通过此方法访问以及向/从 Data Lake Storage Gen2 复制数据。It allows this designated factory to access and copy data to or from your Data Lake Storage Gen2.

若要使用 Azure 资源托管标识身份验证,请按照以下步骤操作。To use managed identities for Azure resource authentication, follow these steps.

  1. 通过复制与工厂一起生成的 托管标识对象 ID 的值,检索数据工厂托管标识信息Retrieve the Data Factory managed identity information by copying the value of the managed identity object ID generated along with your factory.

  2. 向托管标识授予适当权限。Grant the managed identity proper permission. 查看文件和目录上的访问控制列表中的一些示例,来了解 Data Lake Storage Gen2 中的权限的工作原理。See examples on how permission works in Data Lake Storage Gen2 from Access control lists on files and directories.

    • 作为源:在存储资源管理器中,至少授予对所有上游文件夹和文件系统的“执行”权限,并授予对文件的“读取”权限以进行复制 。As source: In Storage Explorer, grant at least Execute permission for ALL upstream folders and the file system, along with Read permission for the files to copy. 或者,在访问控制 (IAM) 中,授予至少“存储 Blob 数据读取者”角色。Alternatively, in Access control (IAM), grant at least the Storage Blob Data Reader role.
    • 作为接收器:在存储资源管理器中,至少授予对所有上游文件夹和文件系统的“执行”权限,并授予对接收器文件夹的“写入”权限 。As sink: In Storage Explorer, grant at least Execute permission for ALL upstream folders and the file system, along with Write permission for the sink folder. 或者,在访问控制 (IAM) 中,授予至少“存储 Blob 数据参与者”角色。Alternatively, in Access control (IAM), grant at least the Storage Blob Data Contributor role.

备注

如果使用数据工厂 UI 进行创作,并且托管标识未在 IAM 中设置为“存储 Blob 数据读取者/参与者”角色,则在执行测试连接或在文件夹中浏览/导航时,请选择“测试到文件路径的连接”或“从指定路径浏览”,然后指定具有“读取 + 执行”权限的路径以继续。If you use Data Factory UI to author and the managed identity is not set with "Storage Blob Data Reader/Contributor" role in IAM, when doing test connection or browsing/navigating folders, choose "Test connection to file path" or "Browse from specified path", and specify a path with Read + Execute permission to continue.

重要

如果使用 PolyBase 或 COPY 语句将 Data Lake Storage Gen2 中的数据加载到 Azure Synapse Analytics 中,则对 Data Lake Storage Gen2 使用托管标识身份验证时,请确保也按照此指南中的步骤 1 到步骤 3 进行操作。If you use PolyBase or COPY statement to load data from Data Lake Storage Gen2 into Azure Synapse Analytics, when you use managed identity authentication for Data Lake Storage Gen2, make sure you also follow steps 1 to 3 in this guidance. 这两个步骤将向 Azure AD 注册服务器并将“存储 Blob 数据参与者”角色分配给服务器。Those steps will register your server with Azure AD and assign the Storage Blob Data Contributor role to your server. 其余事项由数据工厂处理。Data Factory handles the rest. 如果使用 Azure 虚拟网络终结点配置 Blob 存储,则还需要根据 Synapse 的要求,在 Azure 存储帐户的“防火墙和虚拟网络”设置菜单下启用“允许受信任的 Microsoft 服务访问此存储帐户”。If you configure Blob storage with an Azure Virtual Network endpoint, you also need to have Allow trusted Microsoft services to access this storage account turned on under Azure Storage account Firewalls and Virtual networks settings menu as required by Synapse.

链接服务支持以下属性:These properties are supported for the linked service:

propertiesProperty 说明Description 必需Required
typetype type 属性必须设置为 AzureBlobFS。The type property must be set to AzureBlobFS. Yes
urlurl Data Lake Storage Gen2 的终结点,其模式为 https://<accountname>.dfs.core.chinacloudapi.cnEndpoint for Data Lake Storage Gen2 with the pattern of https://<accountname>.dfs.core.chinacloudapi.cn. Yes
connectViaconnectVia 用于连接到数据存储的集成运行时The integration runtime to be used to connect to the data store. 可使用 Azure Integration Runtime 或自承载集成运行时(如果数据存储位于专用网络)。You can use the Azure integration runtime or a self-hosted integration runtime if your data store is in a private network. 如果未指定,则使用默认 Azure Integration Runtime。If not specified, the default Azure integration runtime is used. No

示例:Example:

{
    "name": "AzureDataLakeStorageGen2LinkedService",
    "properties": {
        "type": "AzureBlobFS",
        "typeProperties": {
            "url": "https://<accountname>.dfs.core.chinacloudapi.cn", 
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

数据集属性Dataset properties

有关可用于定义数据集的各个部分和属性的完整列表,请参阅数据集For a full list of sections and properties available for defining datasets, see Datasets.

Azure 数据工厂支持以下文件格式。Azure Data Factory supports the following file formats. 请参阅每一篇介绍基于格式的设置的文章。Refer to each article for format-based settings.

在基于格式的数据集中的 location 设置下,Data Lake Storage Gen2 支持以下属性:The following properties are supported for Data Lake Storage Gen2 under location settings in the format-based dataset:

propertiesProperty 说明Description 必需Required
typetype 数据集中 location 下的 type 属性必须设置为 AzureBlobFSLocationThe type property under location in the dataset must be set to AzureBlobFSLocation. Yes
fileSystemfileSystem Data Lake Storage Gen2 文件系统名称。The Data Lake Storage Gen2 file system name. No
folderPathfolderPath 给定文件系统下的文件夹路径。The path to a folder under the given file system. 如果要使用通配符筛选文件夹,请跳过此设置并在活动源设置中指定。If you want to use a wildcard to filter folders, skip this setting and specify it in activity source settings. No
fileNamefileName 给定 fileSystem + folderPath 下的文件名。The file name under the given fileSystem + folderPath. 如果要使用通配符筛选文件,请跳过此设置并在活动源设置中指定。If you want to use a wildcard to filter files, skip this setting and specify it in activity source settings. No

示例:Example:

{
    "name": "DelimitedTextDataset",
    "properties": {
        "type": "DelimitedText",
        "linkedServiceName": {
            "referenceName": "<Data Lake Storage Gen2 linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [ < physical schema, optional, auto retrieved during authoring > ],
        "typeProperties": {
            "location": {
                "type": "AzureBlobFSLocation",
                "fileSystem": "filesystemname",
                "folderPath": "folder/subfolder"
            },
            "columnDelimiter": ",",
            "quoteChar": "\"",
            "firstRowAsHeader": true,
            "compressionCodec": "gzip"
        }
    }
}

复制活动属性Copy activity properties

要完整了解可供活动定义使用的各个部分和属性,请参阅复制活动配置管道和活动For a full list of sections and properties available for defining activities, see Copy activity configurations and Pipelines and activities. 本部分列出了 Data Lake Storage Gen2 源和接收器支持的属性。This section provides a list of properties supported by the Data Lake Storage Gen2 source and sink.

Azure Data Lake Storage Gen2 作为源类型Azure Data Lake Storage Gen2 as a source type

Azure 数据工厂支持以下文件格式。Azure Data Factory supports the following file formats. 请参阅每一篇介绍基于格式的设置的文章。Refer to each article for format-based settings.

从 ADLS Gen2 复制数据的方式有多种:You have several options to copy data from ADLS Gen2:

  • 从数据集中指定的给定路径复制。Copy from the given path specified in the dataset.
  • 有关针对文件夹路径或文件名的通配符筛选器,请参阅 wildcardFolderPathwildcardFileNameWildcard filter against folder path or file name, see wildcardFolderPath and wildcardFileName.
  • 将给定文本文件中定义的文件复制为文件集(请参阅 fileListPath)。Copy the files defined in a given text file as file set, see fileListPath.

在基于格式的复制源中的 storeSettings 设置下,Data Lake Storage Gen2 支持以下属性:The following properties are supported for Data Lake Storage Gen2 under storeSettings settings in format-based copy source:

propertiesProperty 说明Description 必需Required
typetype storeSettings 下的 type 属性必须设置为 AzureBlobFSReadSettingsThe type property under storeSettings must be set to AzureBlobFSReadSettings. Yes
*找到要复制的文件: _*Locate the files to copy: _
选项 1:静态路径OPTION 1: static path
从数据集中指定的给定文件系统或文件夹/文件路径复制。Copy from the given file system or folder/file path specified in the dataset. 若要复制文件系统/文件夹中的所有文件,请另外将 wildcardFileName 指定为 _If you want to copy all files from a file system/folder, additionally specify wildcardFileName as _.
选项 2:通配符OPTION 2: wildcard
- wildcardFolderPath- wildcardFolderPath
数据集中配置的给定文件系统下带有通配符的文件夹路径,用于筛选源文件夹。The folder path with wildcard characters under the given file system configured in dataset to filter source folders.
允许的通配符为:*(匹配零个或更多个字符)和 ?(匹配零个或单个字符);如果实际文件夹名称中包含通配符或此转义字符,请使用 ^ 进行转义。Allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character); use ^ to escape if your actual folder name has wildcard or this escape char inside.
请参阅文件夹和文件筛选器示例中的更多示例。See more examples in Folder and file filter examples.
No
选项 2:通配符OPTION 2: wildcard
- wildcardFileName- wildcardFileName
给定文件系统 + folderPath/wildcardFolderPath 下带有通配符的文件名,用于筛选源文件。The file name with wildcard characters under the given file system + folderPath/wildcardFolderPath to filter source files.
允许的通配符为:*(匹配零个或更多个字符)和 ?(匹配零个或单个字符);如果实际文件名中包含通配符或此转义字符,请使用 ^ 进行转义。Allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character); use ^ to escape if your actual file name has wildcard or this escape char inside. 请参阅文件夹和文件筛选器示例中的更多示例。See more examples in Folder and file filter examples.
Yes
选项 3:文件列表OPTION 3: a list of files
- fileListPath- fileListPath
指明复制给定文件集。Indicates to copy a given file set. 指向包含要复制的文件列表的文本文件,每行一个文件(即数据集中所配置路径的相对路径)。Point to a text file that includes a list of files you want to copy, one file per line, which is the relative path to the path configured in the dataset.
使用此选项时,请不要在数据集中指定文件名。When using this option, do not specify file name in dataset. 请参阅文件列表示例中的更多示例。See more examples in File list examples.
No
*其他设置: _*Additional settings: _
recursiverecursive 指示是要从子文件夹中以递归方式读取数据,还是只从指定的文件夹中读取数据。Indicates whether the data is read recursively from the subfolders or only from the specified folder. 请注意,当 recursive 设置为 true 且接收器是基于文件的存储时,将不会在接收器上复制或创建空的文件夹或子文件夹。Note that when recursive is set to true and the sink is a file-based store, an empty folder or subfolder isn't copied or created at the sink.
允许的值为 true(默认值)和 false。Allowed values are _ true* (default) and false.
如果配置 fileListPath,则此属性不适用。This property doesn't apply when you configure fileListPath.
No
deleteFilesAfterCompletiondeleteFilesAfterCompletion 指示是否会在二进制文件成功移到目标存储后将其从源存储中删除。Indicates whether the binary files will be deleted from source store after successfully moving to the destination store. 文件删除按文件进行。因此,当复制活动失败时,你会看到一些文件已经复制到目标并从源中删除,而另一些文件仍保留在源存储中。The file deletion is per file, so when copy activity fails, you will see some files have already been copied to the destination and deleted from source, while others are still remaining on source store.
此属性仅在二进制文件复制方案中有效。This property is only valid in binary files copy scenario. 默认值:false。The default value: false.
No
modifiedDatetimeStartmodifiedDatetimeStart 基于属性“上次修改时间”的文件筛选器。Files filter based on the attribute: Last Modified.
如果文件的上次修改时间在 modifiedDatetimeStartmodifiedDatetimeEnd 之间的时间范围内,则将选中这些文件。The files will be selected if their last modified time is within the time range between modifiedDatetimeStart and modifiedDatetimeEnd. 该时间应用于 UTC 时区,格式为“2018-12-01T05:00:00Z”。The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z".
属性可以为 NULL,这意味着不向数据集应用任何文件属性筛选器。The properties can be NULL, which means no file attribute filter will be applied to the dataset. 如果 modifiedDatetimeStart 具有日期/时间值,但 modifiedDatetimeEnd 为 NULL,则意味着将选中“上次修改时间”属性大于或等于该日期/时间值的文件。When modifiedDatetimeStart has datetime value but modifiedDatetimeEnd is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value will be selected. 如果 modifiedDatetimeEnd 具有日期/时间值,但 modifiedDatetimeStart 为 NULL,则意味着将选中“上次修改时间”属性小于该日期/时间值的文件。When modifiedDatetimeEnd has datetime value but modifiedDatetimeStart is NULL, it means the files whose last modified attribute is less than the datetime value will be selected.
如果配置 fileListPath,则此属性不适用。This property doesn't apply when you configure fileListPath.
No
modifiedDatetimeEndmodifiedDatetimeEnd 同上。Same as above. No
enablePartitionDiscoveryenablePartitionDiscovery 对于已分区的文件,请指定是否从文件路径分析分区,并将它们添加为附加的源列。For files that are partitioned, specify whether to parse the partitions from the file path and add them as additional source columns.
允许的值为 false(默认)和 true 。Allowed values are false (default) and true.
No
partitionRootPathpartitionRootPath 启用分区发现时,请指定绝对根路径,以便将已分区文件夹读取为数据列。When partition discovery is enabled, specify the absolute root path in order to read partitioned folders as data columns.

如果未指定,默认情况下,If it is not specified, by default,
- 在数据集或源的文件列表中使用文件路径时,分区根路径是在数据集中配置的路径。- When you use file path in dataset or list of files on source, partition root path is the path configured in dataset.
- 使用通配符文件夹筛选器时,分区根路径是第一个通配符前的子路径。- When you use wildcard folder filter, partition root path is the sub-path before the first wildcard.

例如,假设你将数据集中的路径配置为“root/folder/year=2020/month=08/day=27”:For example, assuming you configure the path in dataset as "root/folder/year=2020/month=08/day=27":
- 如果将分区根路径指定为“root/folder/year=2020”,则除了文件内的列外,复制活动还将生成另外两个列 monthday,其值分别为“08”和“27”。- If you specify partition root path as "root/folder/year=2020", copy activity will generate two more columns month and day with value "08" and "27" respectively, in addition to the columns inside the files.
- 如果未指定分区根路径,则不会生成额外的列。- If partition root path is not specified, no extra column will be generated.
No
maxConcurrentConnectionsmaxConcurrentConnections 可以同时连接到存储库的连接数。The number of connections to connect to storage store concurrently. 仅在要限制与数据存储的并发连接时指定。Specify only when you want to limit the concurrent connection to the data store. No

示例:Example:

"activities":[
    {
        "name": "CopyFromADLSGen2",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<Delimited text input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "DelimitedTextSource",
                "formatSettings":{
                    "type": "DelimitedTextReadSettings",
                    "skipLineCount": 10
                },
                "storeSettings":{
                    "type": "AzureBlobFSReadSettings",
                    "recursive": true,
                    "wildcardFolderPath": "myfolder*A",
                    "wildcardFileName": "*.csv"
                }
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

Azure Data Lake Storage Gen2 作为接收器类型Azure Data Lake Storage Gen2 as a sink type

Azure 数据工厂支持以下文件格式。Azure Data Factory supports the following file formats. 请参阅每一篇介绍基于格式的设置的文章。Refer to each article for format-based settings.

在基于格式的复制接收器中的 storeSettings 设置下,Data Lake Storage Gen2 支持以下属性:The following properties are supported for Data Lake Storage Gen2 under storeSettings settings in format-based copy sink:

propertiesProperty 说明Description 必需Required
typetype storeSettings 下的 type 属性必须设置为 AzureBlobFSWriteSettingsThe type property under storeSettings must be set to AzureBlobFSWriteSettings. Yes
copyBehaviorcopyBehavior 定义以基于文件的数据存储中的文件为源时的复制行为。Defines the copy behavior when the source is files from a file-based data store.

允许值包括:Allowed values are:
- PreserveHierarchy(默认):将文件层次结构保留到目标文件夹中。- PreserveHierarchy (default): Preserves the file hierarchy in the target folder. 指向源文件夹的源文件相对路径与指向目标文件夹的目标文件相对路径相同。The relative path of the source file to the source folder is identical to the relative path of the target file to the target folder.
- FlattenHierarchy:源文件夹中的所有文件都位于目标文件夹的第一级中。- FlattenHierarchy: All files from the source folder are in the first level of the target folder. 目标文件具有自动生成的名称。The target files have autogenerated names.
- MergeFiles:将源文件夹中的所有文件合并到一个文件中。- MergeFiles: Merges all files from the source folder to one file. 如果指定了文件名,则合并文件的名称为指定名称。If the file name is specified, the merged file name is the specified name. 否则,它是自动生成的文件名。Otherwise, it's an autogenerated file name.
No
blockSizeInMBblockSizeInMB 指定用于将数据写入 ADLS Gen2 的块大小(以 MB 为单位)。Specify the block size in MB used to write data to ADLS Gen2. 详细了解块 BlobLearn more about Block Blobs.
允许的值介于 4 MB 到 100 MB 之间。Allowed value is between 4 MB and 100 MB.
默认情况下,ADF 会根据源存储类型和数据自动确定块大小。By default, ADF automatically determines the block size based on your source store type and data. 以非二进制格式复制到 ADLS Gen2 时,默认块大小为 100 MB,这样最多能容纳 4.95 TB 数据。For non-binary copy into ADLS Gen2, the default block size is 100 MB so as to fit in at most 4.95-TB data. 当数据不大时,它可能并非最优,特别是当你在网络状况不佳的情况下使用自承载集成运行时的时候,这会导致操作超时或性能问题。It may be not optimal when your data is not large, especially when you use Self-hosted Integration Runtime with poor network resulting in operation timeout or performance issue. 可以显式指定块大小,同时确保 blockSizeInMB*50000 足以存储数据,否则复制活动运行将失败。You can explicitly specify a block size, while ensure blockSizeInMB*50000 is big enough to store the data, otherwise copy activity run will fail.
No
maxConcurrentConnectionsmaxConcurrentConnections 可以同时连接到数据存储的连接数。The number of connections to connect to the data store concurrently. 仅在要限制与数据存储的并发连接时指定。Specify only when you want to limit the concurrent connection to the data store. No

示例:Example:

"activities":[
    {
        "name": "CopyToADLSGen2",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<Parquet output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "<source type>"
            },
            "sink": {
                "type": "ParquetSink",
                "storeSettings":{
                    "type": "AzureBlobFSWriteSettings",
                    "copyBehavior": "PreserveHierarchy"
                }
            }
        }
    }
]

文件夹和文件筛选器示例Folder and file filter examples

本部分介绍使用通配符筛选器生成文件夹路径和文件名的行为。This section describes the resulting behavior of the folder path and file name with wildcard filters.

folderPathfolderPath fileNamefileName recursiverecursive 源文件夹结构和筛选器结果(用 粗体 表示的文件已检索)Source folder structure and filter result (files in bold are retrieved)
Folder* (为空,使用默认值)(Empty, use default) falsefalse FolderAFolderA
    File1.csv    File1.csv
    File2.json    File2.json
    Subfolder1    Subfolder1
        File3.csv        File3.csv
        File4.json        File4.json
        File5.csv        File5.csv
AnotherFolderBAnotherFolderB
    File6.csv    File6.csv
Folder* (为空,使用默认值)(Empty, use default) truetrue FolderAFolderA
    File1.csv    File1.csv
    File2.json    File2.json
    Subfolder1    Subfolder1
        File3.csv        File3.csv
        File4.json        File4.json
        File5.csv        File5.csv
AnotherFolderBAnotherFolderB
    File6.csv    File6.csv
Folder* *.csv falsefalse FolderAFolderA
    File1.csv    File1.csv
    File2.json    File2.json
    Subfolder1    Subfolder1
        File3.csv        File3.csv
        File4.json        File4.json
        File5.csv        File5.csv
AnotherFolderBAnotherFolderB
    File6.csv    File6.csv
Folder* *.csv truetrue FolderAFolderA
    File1.csv    File1.csv
    File2.json    File2.json
    Subfolder1    Subfolder1
        File3.csv        File3.csv
        File4.json        File4.json
        File5.csv        File5.csv
AnotherFolderBAnotherFolderB
    File6.csv    File6.csv

文件列表示例File list examples

本部分介绍了在复制活动源中使用文件列表路径时的结果行为。This section describes the resulting behavior of using file list path in copy activity source.

假设有以下源文件夹结构,并且要复制加粗显示的文件:Assuming you have the following source folder structure and want to copy the files in bold:

示例源结构Sample source structure FileListToCopy.txt 中的内容Content in FileListToCopy.txt ADF 配置ADF configuration
filesystemfilesystem
    FolderA    FolderA
        File1.csv        File1.csv
        File2.json        File2.json
        Subfolder1        Subfolder1
            File3.csv            File3.csv
            File4.json            File4.json
            File5.csv            File5.csv
    元数据    Metadata
        FileListToCopy.txt        FileListToCopy.txt
File1.csvFile1.csv
Subfolder1/File3.csvSubfolder1/File3.csv
Subfolder1/File5.csvSubfolder1/File5.csv
在数据集中:In dataset:
- 文件系统:filesystem- File system: filesystem
- 文件夹路径:FolderA- Folder path: FolderA

在复制活动源中:In copy activity source:
- 文件列表路径:filesystem/Metadata/FileListToCopy.txt- File list path: filesystem/Metadata/FileListToCopy.txt

文件列表路径指向同一数据存储中的文本文件,其中包含要复制的文件列表,每行一个文件,以及在数据集中配置的路径的相对路径。The file list path points to a text file in the same data store that includes a list of files you want to copy, one file per line with the relative path to the path configured in the dataset.

一些 recursive 和 copyBehavior 示例Some recursive and copyBehavior examples

本节介绍了将 recursive 和 copyBehavior 值进行不同组合所产生的复制操作行为。This section describes the resulting behavior of the copy operation for different combinations of recursive and copyBehavior values.

recursiverecursive copyBehaviorcopyBehavior 源文件夹结构Source folder structure 生成目标Resulting target
truetrue preserveHierarchypreserveHierarchy Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
使用与源相同的结构创建目标 Folder1:The target Folder1 is created with the same structure as the source:

Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
truetrue flattenHierarchyflattenHierarchy Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
使用以下结构创建目标 Folder1:The target Folder1 is created with the following structure:

Folder1Folder1
    File1 的自动生成的名称    autogenerated name for File1
    File2 的自动生成的名称    autogenerated name for File2
    File3 的自动生成的名称    autogenerated name for File3
    File4 的自动生成的名称    autogenerated name for File4
    File5 的自动生成的名称    autogenerated name for File5
truetrue mergeFilesmergeFiles Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
使用以下结构创建目标 Folder1:The target Folder1 is created with the following structure:

Folder1Folder1
    File1 + File2 + File3 + File4 + File5 的内容将合并到一个文件中,且自动生成文件名。    File1 + File2 + File3 + File4 + File5 contents are merged into one file with an autogenerated file name.
falsefalse preserveHierarchypreserveHierarchy Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
使用以下结构创建目标 Folder1:The target Folder1 is created with the following structure:

Folder1Folder1
    File1    File1
    File2    File2

没有选取包含 File3、File4 和 File5 的 Subfolder1。Subfolder1 with File3, File4, and File5 isn't picked up.
falsefalse flattenHierarchyflattenHierarchy Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
使用以下结构创建目标 Folder1:The target Folder1 is created with the following structure:

Folder1Folder1
    File1 的自动生成的名称    autogenerated name for File1
    File2 的自动生成的名称    autogenerated name for File2

没有选取包含 File3、File4 和 File5 的 Subfolder1。Subfolder1 with File3, File4, and File5 isn't picked up.
falsefalse mergeFilesmergeFiles Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
使用以下结构创建目标 Folder1:The target Folder1 is created with the following structure:

Folder1Folder1
    File1 + File2 的内容将合并到一个文件中,且自动生成文件名。    File1 + File2 contents are merged into one file with an autogenerated file name. File1 的自动生成的名称autogenerated name for File1

没有选取包含 File3、File4 和 File5 的 Subfolder1。Subfolder1 with File3, File4, and File5 isn't picked up.

在复制期间保留元数据Preserve metadata during copy

将文件从 Amazon S3/Azure Blob/Azure Data Lake Storage Gen2 复制到 Azure Data Lake Storage Gen2/Azure Blob 时,可以选择保留文件元数据和数据。When you copy files from Amazon S3/Azure Blob/Azure Data Lake Storage Gen2 to Azure Data Lake Storage Gen2/Azure Blob, you can choose to preserve the file metadata along with data. 若要了解详细信息,请参阅保留元数据Learn more from Preserve metadata.

映射数据流属性Mapping data flow properties

在映射数据流中转换数据时,可以在 Azure Data Lake Storage Gen2 中读取和写入以下格式的文件:When you're transforming data in mapping data flows, you can read and write files from Azure Data Lake Storage Gen2 in the following formats:

格式特定的设置位于该格式的文档中。Format specific settings are located in the documentation for that format. 有关详细信息,请参阅映射数据流中的源转换映射数据流中的接收器转换For more information, see Source transformation in mapping data flow and Sink transformation in mapping data flow.

源转换Source transformation

在源转换中,可以从 Azure Data Lake Storage Gen2 中的容器、文件夹或单个文件进行读取。In the source transformation, you can read from a container, folder, or individual file in Azure Data Lake Storage Gen2. 通过“源选项”选项卡,可以管理如何读取文件。The Source options tab lets you manage how the files get read.

源选项Source options

通配符路径: 如果使用通配符模式,则是指示 ADF 在单个“源转换”中遍历每个匹配的文件夹和文件。Wildcard path: Using a wildcard pattern will instruct ADF to loop through each matching folder and file in a single Source transformation. 这是在单个流中处理多个文件的有效方法。This is an effective way to process multiple files within a single flow. 使用将鼠标悬停在现有通配符模式上时出现的“+”号来添加多个通配符匹配模式。Add multiple wildcard matching patterns with the + sign that appears when hovering over your existing wildcard pattern.

从源容器中,选择与模式匹配的一系列文件。From your source container, choose a series of files that match a pattern. 数据集中只能指定容器。Only container can be specified in the dataset. 因此,通配符路径必须也包含根文件夹中的文件夹路径。Your wildcard path must therefore also include your folder path from the root folder.

通配符示例:Wildcard examples:

  • * 表示任意字符集* Represents any set of characters

  • ** 表示递归目录嵌套** Represents recursive directory nesting

  • ? 替换一个字符? Replaces one character

  • [] 与括号中众多字符中的一个匹配[] Matches one of more characters in the brackets

  • /data/sales/**/*.csv 获取 /data/sales 下的所有 csv 文件/data/sales/**/*.csv Gets all csv files under /data/sales

  • /data/sales/20??/**/ 获取 20 世纪的所有文件/data/sales/20??/**/ Gets all files in the 20th century

  • /data/sales/*/*/*.csv 获取比 /data/sales 低两个级别的 csv 文件/data/sales/*/*/*.csv Gets csv files two levels under /data/sales

  • /data/sales/2004/*/12/[XY]1?.csv 获取 2004 年 12 月的所有以 X 或 Y 开头且前缀为两位数的 csv 文件/data/sales/2004/*/12/[XY]1?.csv Gets all csv files in 2004 in December starting with X or Y prefixed by a two-digit number

分区根路径: 如果文件源中存在 key=value 格式(例如 year=2019)的分区文件夹,则可以向该分区文件夹树的顶层分配数据流中的列名称。Partition Root Path: If you have partitioned folders in your file source with a key=value format (for example, year=2019), then you can assign the top level of that partition folder tree to a column name in your data flow data stream.

首先,设置一个通配符,用于包括分区文件夹以及要读取的叶文件的所有路径。First, set a wildcard to include all paths that are the partitioned folders plus the leaf files that you wish to read.

分区源文件设置Partition source file settings

使用“分区根路径”设置来定义文件夹结构的顶级。Use the Partition Root Path setting to define what the top level of the folder structure is. 通过数据预览查看数据内容时会看到,ADF 会添加在每个文件夹级别中找到的已解析的分区。When you view the contents of your data via a data preview, you'll see that ADF will add the resolved partitions found in each of your folder levels.

分区根路径Partition root path

文件列表: 这是一个文件集。List of files: This is a file set. 创建一个文本文件,其中包含要处理的相对路径文件的列表。Create a text file that includes a list of relative path files to process. 指向此文本文件。Point to this text file.

用于存储文件名的列: 将源文件的名称存储在数据的列中。Column to store file name: Store the name of the source file in a column in your data. 请在此处输入新列名称以存储文件名字符串。Enter a new column name here to store the file name string.

完成后: 数据流运行后,可以选择不对源文件执行任何操作、删除源文件或移动源文件。After completion: Choose to do nothing with the source file after the data flow runs, delete the source file, or move the source file. 移动路径是相对路径。The paths for the move are relative.

要将源文件移到其他位置进行后期处理,请首先选择“移动”以执行文件操作。To move source files to another location post-processing, first select "Move" for file operation. 然后,设置“从”目录。Then, set the "from" directory. 如果未对路径使用任何通配符,则“从”设置中的文件夹将是与源文件夹相同的文件夹。If you're not using any wildcards for your path, then the "from" setting will be the same folder as your source folder.

如果源路径包含通配符,则语法将如下所示:If you have a source path with wildcard, your syntax will look like this below:

/data/sales/20??/**/*.csv

可将“从”指定为You can specify "from" as

/data/sales

将“到”指定为And "to" as

/backup/priorSales

在此例中,源自 /data/sales 下的所有文件都将移动到 /backup/priorSales。In this case, all files that were sourced under /data/sales are moved to /backup/priorSales.

备注

仅当从在管道中使用“执行数据流”活动的管道运行进程(管道调试或执行运行)中启动数据流时,文件操作才会运行。File operations run only when you start the data flow from a pipeline run (a pipeline debug or execution run) that uses the Execute Data Flow activity in a pipeline. 文件操作不会在数据流调试模式下运行。File operations do not run in Data Flow debug mode.

按上次修改时间筛选: 通过指定上次修改的日期范围,可以筛选要处理的文件。Filter by last modified: You can filter which files you process by specifying a date range of when they were last modified. 所有日期时间均采用 UTC 格式。All date-times are in UTC.

接收器属性Sink properties

在接收器转换中,可以写入到 Azure Data Lake Storage Gen2 中的容器或文件夹。In the sink transformation, you can write to either a container or folder in Azure Data Lake Storage Gen2. 通过“设置”选项卡,可以管理如何写入文件。the Settings tab lets you manage how the files get written.

接收器选项​​sink options

清除文件夹: 确定在写入数据之前是否清除目标文件夹。Clear the folder: Determines whether or not the destination folder gets cleared before the data is written.

文件名选项: 确定目标文件在目标文件夹中的命名方式。File name option: Determines how the destination files are named in the destination folder. 文件名选项有:The file name options are:

  • 默认:允许 Spark 根据 PART 默认值为文件命名。Default: Allow Spark to name files based on PART defaults.
  • 模式:输入一种模式,该模式枚举每个分区的输出文件。Pattern: Enter a pattern that enumerates your output files per partition. 例如,loans[n].csv 将创建 loans1.csv、loans2.csv 等等。For example, loans[n].csv will create loans1.csv, loans2.csv, and so on.
  • 每个分区:为每个分区输入一个文件名。Per partition: Enter one file name per partition.
  • 作为列中的数据:将输出文件设置为列的值。As data in column: Set the output file to the value of a column. 此路径是数据集容器而非目标文件夹的相对路径。The path is relative to the dataset container, not the destination folder. 如果数据集中有文件夹路径,则会将其重写。If you have a folder path in your dataset, it will be overridden.
  • 输出到一个文件:将分区输出文件合并为一个命名的文件。Output to a single file: Combine the partitioned output files into a single named file. 此路径是数据集文件夹的相对路径。The path is relative to the dataset folder. 请注意,合并操作可能会因为节点大小而失败。Please be aware that te merge operation can possibly fail based upon node size. 对于大型数据集,不建议使用此选项。This option is not recommended for large datasets.

全部引用: 确定是否将所有值括在引号中Quote all: Determines whether to enclose all values in quotes

Lookup 活动属性Lookup activity properties

若要了解有关属性的详细信息,请查看 Lookup 活动To learn details about the properties, check Lookup activity.

GetMetadata 活动属性GetMetadata activity properties

若要了解有关属性的详细信息,请查看 GetMetadata 活动To learn details about the properties, check GetMetadata activity

Delete 活动属性Delete activity properties

若要详细了解这些属性,请查看 Delete 活动To learn details about the properties, check Delete activity

旧模型Legacy models

备注

仍按原样支持以下模型,以实现向后兼容性。The following models are still supported as-is for backward compatibility. 建议你以后使用前面部分中提到的新模型,ADF 创作 UI 已经切换到生成新模型。You are suggested to use the new model mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.

旧数据集模型Legacy dataset model

propertiesProperty 说明Description 必需Required
typetype 数据集的 type 属性必须设置为 AzureBlobFSFile。The type property of the dataset must be set to AzureBlobFSFile. Yes
folderPathfolderPath Data Lake Storage Gen2 中的文件夹的路径。Path to the folder in Data Lake Storage Gen2. 如果未指定,它指向根目录。If not specified, it points to the root.

支持通配符筛选器。Wildcard filter is supported. 允许的通配符为:*(匹配零个或更多字符)和 ?(匹配零个或单个字符)。Allowed wildcards are * (matches zero or more characters) and ? (matches zero or single character). 如果实际文件夹名内具有通配符或此转义符,请使用 ^ 进行转义。Use ^ to escape if your actual folder name has a wildcard or this escape char is inside.

示例:filesystem/folder/。Examples: filesystem/folder/. 请参阅文件夹和文件筛选器示例中的更多示例。See more examples in Folder and file filter examples.
No
fileNamefileName 指定“folderPath”下的文件的“名称或通配符筛选器”。Name or wildcard filter for the files under the specified "folderPath". 如果没有为此属性指定任何值,则数据集会指向文件夹中的所有文件。If you don't specify a value for this property, the dataset points to all files in the folder.

对于筛选器,允许的通配符为:*(匹配零个或更多字符)和 ?(匹配零个或单个字符)。For filter, the wildcards allowed are * (matches zero or more characters) and ? (matches zero or single character).
- 示例 1:"fileName": "*.csv"- Example 1: "fileName": "*.csv"
- 示例 2:"fileName": "???20180427.txt"- Example 2: "fileName": "???20180427.txt"
如果实际文件名内具有通配符或此转义符,请使用 ^ 进行转义。Use ^ to escape if your actual file name has a wildcard or this escape char is inside.

如果没有为输出数据集指定 fileName,并且没有在活动接收器中指定 preserveHierarchy,则复制活动会自动生成采用以下模式的文件名称:“Data.[activity run ID GUID].[GUID if FlattenHierarchy].[format if configured].[compression if configured] ”,例如“Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt.gz”。When fileName isn't specified for an output dataset and preserveHierarchy isn't specified in the activity sink, the copy activity automatically generates the file name with the following pattern: "Data.[activity run ID GUID].[GUID if FlattenHierarchy].[format if configured].[compression if configured]", for example, "Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt.gz". 如果使用表名称而不是查询从表格源进行复制,则名称模式为“[table name].[format].[compression if configured]”,例如“MyTable.csv”。If you copy from a tabular source using a table name instead of a query, the name pattern is "[table name].[format].[compression if configured]", for example, "MyTable.csv".
No
modifiedDatetimeStartmodifiedDatetimeStart 基于属性“上次修改时间”的文件筛选器。Files filter based on the attribute Last Modified. 如果文件的上次修改时间在 modifiedDatetimeStartmodifiedDatetimeEnd 之间的时间范围内,则将选中这些文件。The files are selected if their last modified time is within the time range between modifiedDatetimeStart and modifiedDatetimeEnd. 该时间应用于 UTC 时区,格式为“2018-12-01T05:00:00Z”。The time is applied to the UTC time zone in the format of "2018-12-01T05:00:00Z".

当你要从大量文件中进行文件筛选时,启用此设置将影响数据移动的整体性能。The overall performance of data movement is affected by enabling this setting when you want to do file filter with huge amounts of files.

属性可以为 NULL,这意味着不向数据集应用任何文件特性筛选器。The properties can be NULL, which means no file attribute filter is applied to the dataset. 如果 modifiedDatetimeStart 具有日期/时间值,但 modifiedDatetimeEnd 为 NULL,则意味着将选中“上次修改时间”属性大于或等于该日期/时间值的文件。When modifiedDatetimeStart has a datetime value but modifiedDatetimeEnd is NULL, it means the files whose last modified attribute is greater than or equal to the datetime value are selected. 如果 modifiedDatetimeEnd 具有日期/时间值,但 modifiedDatetimeStart 为 NULL,则意味着将选中“上次修改时间”属性小于该日期/时间值的文件。When modifiedDatetimeEnd has a datetime value but modifiedDatetimeStart is NULL, it means the files whose last modified attribute is less than the datetime value are selected.
No
modifiedDatetimeEndmodifiedDatetimeEnd 基于属性“上次修改时间”的文件筛选器。Files filter based on the attribute Last Modified. 如果文件的上次修改时间在 modifiedDatetimeStartmodifiedDatetimeEnd 之间的时间范围内,则将选中这些文件。The files are selected if their last modified time is within the time range between modifiedDatetimeStart and modifiedDatetimeEnd. 该时间应用于 UTC 时区,格式为“2018-12-01T05:00:00Z”。The time is applied to the UTC time zone in the format of "2018-12-01T05:00:00Z".

当你要从大量文件中进行文件筛选时,启用此设置将影响数据移动的整体性能。The overall performance of data movement is affected by enabling this setting when you want to do file filter with huge amounts of files.

属性可以为 NULL,这意味着不向数据集应用任何文件特性筛选器。The properties can be NULL, which means no file attribute filter is applied to the dataset. 如果 modifiedDatetimeStart 具有日期/时间值,但 modifiedDatetimeEnd 为 NULL,则意味着将选中“上次修改时间”属性大于或等于该日期/时间值的文件。When modifiedDatetimeStart has a datetime value but modifiedDatetimeEnd is NULL, it means the files whose last modified attribute is greater than or equal to the datetime value are selected. 如果 modifiedDatetimeEnd 具有日期/时间值,但 modifiedDatetimeStart 为 NULL,则意味着将选中“上次修改时间”属性小于该日期/时间值的文件。When modifiedDatetimeEnd has a datetime value but modifiedDatetimeStart is NULL, it means the files whose last modified attribute is less than the datetime value are selected.
No
formatformat 若要在基于文件的存储之间按原样复制文件(二进制副本),可以在输入和输出数据集定义中跳过格式节。If you want to copy files as is between file-based stores (binary copy), skip the format section in both the input and output dataset definitions.

若要分析或生成具有特定格式的文件,以下是受支持的文件格式类型:TextFormat、JsonFormat、AvroFormat、OrcFormat 和 ParquetFormat 。If you want to parse or generate files with a specific format, the following file format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, and ParquetFormat. 请将 format 中的 type 属性设置为上述值之一。Set the type property under format to one of these values. 有关详细信息,请参阅文本格式JSON 格式Avro 格式ORC 格式Parquet 格式部分。For more information, see the Text format, JSON format, Avro format, ORC format, and Parquet format sections.
否(仅适用于二进制复制方案)No (only for binary copy scenario)
compressioncompression 指定数据的压缩类型和级别。Specify the type and level of compression for the data. 有关详细信息,请参阅受支持的文件格式和压缩编解码器For more information, see Supported file formats and compression codecs.
支持的类型为 GZipDeflateBZip2ZipDeflateSupported types are GZip, Deflate, BZip2, and ZipDeflate.
支持的级别为“最佳”和“最快”。Supported levels are Optimal and Fastest.
No

提示

如需复制文件夹下的所有文件,请仅指定 folderPathTo copy all files under a folder, specify folderPath only.
如需复制具有给定名称的单个文件,请使用文件夹部分指定 folderPath 并使用文件名指定 fileNameTo copy a single file with a given name, specify folderPath with a folder part and fileName with a file name.
如需复制文件夹下的文件子集,请使用文件夹部分指定 folderPath 并使用通配符筛选器指定 fileNameTo copy a subset of files under a folder, specify folderPath with a folder part and fileName with a wildcard filter.

示例:Example:

{
    "name": "ADLSGen2Dataset",
    "properties": {
        "type": "AzureBlobFSFile",
        "linkedServiceName": {
            "referenceName": "<Azure Data Lake Storage Gen2 linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "folderPath": "myfilesystem/myfolder",
            "fileName": "*",
            "modifiedDatetimeStart": "2018-12-01T05:00:00Z",
            "modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
            "format": {
                "type": "TextFormat",
                "columnDelimiter": ",",
                "rowDelimiter": "\n"
            },
            "compression": {
                "type": "GZip",
                "level": "Optimal"
            }
        }
    }
}

旧复制活动源模型Legacy copy activity source model

propertiesProperty 说明Description 必需Required
typetype 复制活动源的 type 属性必须设置为 AzureBlobFSSource。The type property of the copy activity source must be set to AzureBlobFSSource. Yes
recursiverecursive 指示是要从子文件夹中以递归方式读取数据,还是只从指定的文件夹中读取数据。Indicates whether the data is read recursively from the subfolders or only from the specified folder. 当 recursive 设置为 true 且接收器是基于文件的存储时,将不会在接收器上复制或创建空的文件夹或子文件夹。When recursive is set to true and the sink is a file-based store, an empty folder or subfolder isn't copied or created at the sink.
允许的值为 true(默认值)和 falseAllowed values are true (default) and false.
No
maxConcurrentConnectionsmaxConcurrentConnections 可以同时连接到数据存储的连接数。The number of connections to connect to the data store concurrently. 仅在要限制与数据存储的并发连接时指定。Specify only when you want to limit the concurrent connection to the data store. No

示例:Example:

"activities":[
    {
        "name": "CopyFromADLSGen2",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<ADLS Gen2 input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "AzureBlobFSSource",
                "recursive": true
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

旧复制活动接收器模型Legacy copy activity sink model

propertiesProperty 说明Description 必需Required
typetype 复制活动接收器的 type 属性必须设置为 AzureBlobFSSink。The type property of the copy activity sink must be set to AzureBlobFSSink. Yes
copyBehaviorcopyBehavior 定义以基于文件的数据存储中的文件为源时的复制行为。Defines the copy behavior when the source is files from a file-based data store.

允许值包括:Allowed values are:
- PreserveHierarchy(默认):将文件层次结构保留到目标文件夹中。- PreserveHierarchy (default): Preserves the file hierarchy in the target folder. 指向源文件夹的源文件相对路径与指向目标文件夹的目标文件相对路径相同。The relative path of the source file to the source folder is identical to the relative path of the target file to the target folder.
- FlattenHierarchy:源文件夹中的所有文件都位于目标文件夹的第一级中。- FlattenHierarchy: All files from the source folder are in the first level of the target folder. 目标文件具有自动生成的名称。The target files have autogenerated names.
- MergeFiles:将源文件夹中的所有文件合并到一个文件中。- MergeFiles: Merges all files from the source folder to one file. 如果指定了文件名,则合并文件的名称为指定名称。If the file name is specified, the merged file name is the specified name. 否则,它是自动生成的文件名。Otherwise, it's an autogenerated file name.
No
maxConcurrentConnectionsmaxConcurrentConnections 可以同时连接到数据存储的连接数。The number of connections to connect to the data store concurrently. 仅在要限制与数据存储的并发连接时指定。Specify only when you want to limit the concurrent connection to the data store. No

示例:Example:

"activities":[
    {
        "name": "CopyToADLSGen2",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<ADLS Gen2 output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "<source type>"
            },
            "sink": {
                "type": "AzureBlobFSSink",
                "copyBehavior": "PreserveHierarchy"
            }
        }
    }
]

后续步骤Next steps

有关数据工厂中复制活动支持作为源和接收器的数据存储的列表,请参阅支持的数据存储For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data stores.