使用 Azure 数据工厂在 Azure Blob 存储中复制和转换数据Copy and transform data in Azure Blob storage by using Azure Data Factory

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

本文概述如何使用 Azure 数据工厂中的复制活动从/向 Azure Blob 存储复制数据。This article outlines how to use the Copy activity in Azure Data Factory to copy data from and to Azure Blob storage. 若要了解 Azure 数据工厂,请阅读介绍性文章To learn about Azure Data Factory, read the introductory article.

提示

若要了解数据湖或数据仓库的迁移方案,请参阅使用 Azure 数据工厂将数据从数据湖或数据仓库迁移到 AzureTo learn about a migration scenario for a data lake or a data warehouse, see Use Azure Data Factory to migrate data from your data lake or data warehouse to Azure.

支持的功能Supported capabilities

以下活动支持此 Azure Blob 存储连接器:This Azure Blob storage connector is supported for the following activities:

对于复制活动,此 Blob 存储连接器支持以下操作:For the Copy activity, this Blob storage connector supports:

  • 向/从常规用途的 Azure 存储帐户和热/冷 Blob 存储复制 Blob。Copying blobs to and from general-purpose Azure storage accounts and hot/cool blob storage.
  • 通过使用帐户密钥身份验证、服务共享访问签名 (SAS) 身份验证、服务主体身份验证或 Azure 资源托管标识身份验证来复制 Blob。Copying blobs by using an account key, a service shared access signature (SAS), a service principal, or managed identities for Azure resource authentications.
  • 从块、追加或页 Blob 中复制 Blob,并将数据仅复制到块 Blob。Copying blobs from block, append, or page blobs and copying data to only block blobs.
  • 按原样复制 Blob,或者使用支持的文件格式和压缩编解码器分析或生成 Blob。Copying blobs as is, or parsing or generating blobs with supported file formats and compression codecs.
  • 在复制期间保留文件元数据Preserving file metadata during copy.

重要

如果在 Azure 存储防火墙设置中启用了“允许信任的 Microsoft 服务访问此存储帐户”选项,并且要使用 Azure 集成运行时连接到 Blob 存储,则必须使用托管标识身份验证If you enable the Allow trusted Microsoft services to access this storage account option in Azure Storage firewall settings and want to use the Azure integration runtime to connect to Blob storage, you must use managed identity authentication.

入门Get started

若要使用管道执行复制活动,可以使用以下工具或 SDK 之一:To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:

对于特定于 Blob 存储的数据工厂实体,以下部分提供了有关用于定义这些实体的属性的详细信息。The following sections provide details about properties that are used to define Data Factory entities specific to Blob storage.

链接服务属性Linked service properties

此 Blob 存储连接器支持以下身份验证类型。This Blob storage connector supports the following authentication types. 有关详细信息,请参阅相应部分。See the corresponding sections for details.

备注

在使用 PolyBase 将数据载入 Azure Synapse Analytics(以前称为 SQL 数据仓库)时,如果源或暂存 Blob 存储配置了 Azure 虚拟网络终结点,则必须按照 PolyBase 的要求使用托管标识身份验证。When you're using PolyBase to load data into Azure Synapse Analytics (formerly SQL Data Warehouse), if your source or staging Blob storage is configured with an Azure Virtual Network endpoint, you must use managed identity authentication as required by PolyBase. 还必须使用自承载集成运行时 3.18 或更高版本。You must also use the self-hosted integration runtime with version 3.18 or later. 请参阅托管标识身份验证部分,了解更多的配置先决条件。See the Managed identity authentication section for more configuration prerequisites.

备注

Azure HDInsight 和 Azure 机器学习活动只支持使用 Azure Blob 存储帐户密钥的身份验证。Azure HDInsight and Azure Machine Learning activities only support authentication that uses Azure Blob storage account keys.

帐户密钥身份验证Account key authentication

数据工厂支持使用以下属性进行存储帐户密钥身份验证:Data Factory supports the following properties for storage account key authentication:

属性Property 说明Description 必须Required
typetype type 属性必须设置为 AzureBlobStorage(建议)或 AzureStorage(请查看以下注释)。The type property must be set to AzureBlobStorage (suggested) or AzureStorage (see the following notes). Yes
connectionStringconnectionString connectionString 属性指定连接到存储所需的信息。Specify the information needed to connect to Storage for the connectionString property.
还可以将帐户密钥放在 Azure Key Vault 中,从连接字符串中拉取 accountKey 配置。You can also put the account key in Azure Key Vault and pull the accountKey configuration out of the connection string. 有关详细信息,请参阅以下示例和在 Azure Key Vault 中存储凭据一文。For more information, see the following samples and the Store credentials in Azure Key Vault article.
Yes
connectViaconnectVia 用于连接到数据存储的集成运行时The integration runtime to be used to connect to the data store. 可使用 Azure Integration Runtime 或自承载集成运行时(如果数据存储位于专用网络中)。You can use the Azure integration runtime or the self-hosted integration runtime (if your data store is in a private network). 如果未指定此属性,服务会使用默认的 Azure Integration Runtime。If this property isn't specified, the service uses the default Azure integration runtime. No

备注

使用帐户密钥身份验证时,不支持辅助 Blob 服务终结点。A secondary Blob service endpoint is not supported when you're using account key authentication. 可以使用其他身份验证类型。You can use other authentication types.

备注

如果使用“AzureStorage”类型的链接服务,则仍然按原样支持它。If you're using the "AzureStorage" type linked service, it's still supported as is. 但是,建议你今后使用这个新的“AzureBlobStorage”链接服务类型。But we suggest that you use the new "AzureBlobStorage" linked service type going forward.

示例:Example:

{
    "name": "AzureBlobStorageLinkedService",
    "properties": {
        "type": "AzureBlobStorage",
        "typeProperties": {
            "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>;EndpointSuffix=core.chinacloudapi.cn"
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

示例:在 Azure Key Vault 中存储帐户密钥Example: store the account key in Azure Key Vault

{
    "name": "AzureBlobStorageLinkedService",
    "properties": {
        "type": "AzureBlobStorage",
        "typeProperties": {
            "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;EndpointSuffix=core.chinacloudapi.cn",
            "accountKey": { 
                "type": "AzureKeyVaultSecret", 
                "store": { 
                    "referenceName": "<Azure Key Vault linked service name>", 
                    "type": "LinkedServiceReference" 
                }, 
                "secretName": "<secretName>" 
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }            
    }
}

共享访问签名身份验证Shared access signature authentication

共享访问签名对存储帐户中的资源提供委托访问。A shared access signature provides delegated access to resources in your storage account. 使用共享访问签名可以在指定的时间内授予客户端对存储帐户中对象的有限访问权限。You can use a shared access signature to grant a client limited permissions to objects in your storage account for a specified time.

无需共享帐户访问密钥。You don't have to share your account access keys. 共享访问签名是一个 URI,在其查询参数中包含对存储资源已验证访问所需的所有信息。The shared access signature is a URI that encompasses in its query parameters all the information necessary for authenticated access to a storage resource. 若要使用共享访问签名访问存储资源,客户端只需将共享访问签名传入到相应的构造函数或方法。To access storage resources with the shared access signature, the client only needs to pass in the shared access signature to the appropriate constructor or method.

有关共享访问签名的详细信息,请参阅共享访问签名:了解共享访问签名模型For more information about shared access signatures, see Shared access signatures: Understand the shared access signature model.

备注

  • 数据工厂现在同时支持服务共享访问签名帐户共享访问签名Data Factory now supports both service shared access signatures and account shared access signatures. 有关共享访问签名的详细信息,请参阅使用共享访问签名授予对 Azure 存储资源的有限访问权限For more information about shared access signatures, see Grant limited access to Azure Storage resources using shared access signatures.
  • 在稍后的数据集配置中,文件夹路径是从容器级别开始的绝对路径。In later dataset configurations, the folder path is the absolute path starting from the container level. 需要配置与 SAS URI 中的路径一致的路径。You need to configure one aligned with the path in your SAS URI.

数据工厂支持通过以下属性来使用共享访问签名身份验证:Data Factory supports the following properties for using shared access signature authentication:

属性Property 说明Description 必须Required
typetype type 属性必须设置为 AzureBlobStorage(建议)或 AzureStorage(请查看以下注释)。The type property must be set to AzureBlobStorage (suggested) or AzureStorage (see the following note). Yes
sasUrisasUri 指定存储资源(例如 Blob 或容器)的共享访问签名 URI。Specify the shared access signature URI to the Storage resources such as blob or container.
将此字段标记为 SecureString,以便安全地将其存储在数据工厂中。Mark this field as SecureString to store it securely in Data Factory. 还可以将 SAS 令牌放在 Azure Key Vault 中,以使用自动轮换和删除令牌部分。You can also put the SAS token in Azure Key Vault to use auto-rotation and remove the token portion. 有关详细信息,请参阅以下示例和在 Azure Key Vault 中存储凭据For more information, see the following samples and Store credentials in Azure Key Vault.
Yes
connectViaconnectVia 用于连接到数据存储的集成运行时The integration runtime to be used to connect to the data store. 可使用 Azure Integration Runtime 或自承载集成运行时(如果数据存储位于专用网络中)。You can use the Azure integration runtime or the self-hosted integration runtime (if your data store is in a private network). 如果未指定此属性,服务会使用默认的 Azure Integration Runtime。If this property isn't specified, the service uses the default Azure integration runtime. No

备注

如果使用“AzureStorage”类型的链接服务,则仍然按原样支持它。If you're using the "AzureStorage" type linked service, it's still supported as is. 但是,建议你今后使用这个新的“AzureBlobStorage”链接服务类型。But we suggest that you use the new "AzureBlobStorage" linked service type going forward.

示例:Example:

{
    "name": "AzureBlobStorageLinkedService",
    "properties": {
        "type": "AzureBlobStorage",
        "typeProperties": {
            "sasUri": {
                "type": "SecureString",
                "value": "<SAS URI of the Azure Storage resource e.g. https://<accountname>.blob.core.chinacloudapi.cn/?sv=<storage version>&st=<start time>&se=<expire time>&sr=<resource>&sp=<permissions>&sip=<ip range>&spr=<protocol>&sig=<signature>>"
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

示例:在 Azure Key Vault 中存储帐户密钥Example: store the account key in Azure Key Vault

{
    "name": "AzureBlobStorageLinkedService",
    "properties": {
        "type": "AzureBlobStorage",
        "typeProperties": {
            "sasUri": {
                "type": "SecureString",
                "value": "<SAS URI of the Azure Storage resource without token e.g. https://<accountname>.blob.core.chinacloudapi.cn/>"
            },
            "sasToken": { 
                "type": "AzureKeyVaultSecret", 
                "store": { 
                    "referenceName": "<Azure Key Vault linked service name>", 
                    "type": "LinkedServiceReference" 
                }, 
                "secretName": "<secretName with value of SAS token e.g. ?sv=<storage version>&st=<start time>&se=<expire time>&sr=<resource>&sp=<permissions>&sip=<ip range>&spr=<protocol>&sig=<signature>>" 
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

在创建共享访问签名 URI 时,请注意以下几点:When you create a shared access signature URI, consider the following points:

  • 根据链接服务(读取、写入、读/写)在数据工厂中的用法,设置针对对象的适当读/写权限。Set appropriate read/write permissions on objects based on how the linked service (read, write, read/write) is used in your data factory.
  • 根据需要设置“到期时间”。Set Expiry time appropriately. 确保存储对象的访问权限不会在管道的活动期限内过期。Make sure that the access to Storage objects doesn't expire within the active period of the pipeline.
  • 应该根据需要在正确的容器或 Blob 中创建 URI。The URI should be created at the right container or blob based on the need. 数据工厂可以使用 Blob 的共享访问签名 URI 访问该特定 Blob。A shared access signature URI to a blob allows Data Factory to access that particular blob. 数据工厂可以使用 Blob 存储容器的共享访问签名 URI 迭代该容器中的 Blob。A shared access signature URI to a Blob storage container allows Data Factory to iterate through blobs in that container. 以后若要提供更多/更少对象的访问权限或需要更新共享访问签名 URI,请记得使用新 URI 更新链接服务。To provide access to more or fewer objects later, or to update the shared access signature URI, remember to update the linked service with the new URI.

服务主体身份验证Service principal authentication

有关 Azure 存储服务主体身份验证的常规信息,请参阅使用 Azure Active Directory 验证对 Azure 存储的访问权限For general information about Azure Storage service principal authentication, see Authenticate access to Azure Storage using Azure Active Directory.

若要使用服务主体身份验证,请执行以下步骤:To use service principal authentication, follow these steps:

  1. 遵循将应用程序注册到 Azure AD 租户,在 Azure Active Directory (Azure AD) 中注册一个应用程序实体。Register an application entity in Azure Active Directory (Azure AD) by following Register your application with an Azure AD tenant. 记下以下值,这些值用于定义链接服务:Make note of these values, which you use to define the linked service:

    • 应用程序 IDApplication ID
    • 应用程序密钥Application key
    • 租户 IDTenant ID
  2. 授予服务主体在 Azure Blob 存储中的适当权限。Grant the service principal proper permission in Azure Blob storage. 有关角色的详细信息,请参阅使用 RBAC 管理对 Azure 存储数据的访问权限For more information on the roles, see Manage access rights to Azure Storage data with RBAC.

    • 作为源,在“访问控制(标识和访问管理)”中至少授予“存储 Blob 数据读取者”角色。As source, in Access control (IAM), grant at least the Storage Blob Data Reader role.
    • 作为接收器,在“访问控制(标识和访问管理)”中至少授予“存储 Blob 数据参与者”角色。As sink, in Access control (IAM), grant at least the Storage Blob Data Contributor role.

Azure Blob 存储链接服务支持以下属性:These properties are supported for an Azure Blob storage linked service:

属性Property 说明Description 必须Required
typetype type 属性必须设置为 AzureBlobStorageThe type property must be set to AzureBlobStorage. Yes
serviceEndpointserviceEndpoint 使用 https://<accountName>.blob.core.chinacloudapi.cn/ 模式指定 Azure Blob 存储服务终结点。Specify the Azure Blob storage service endpoint with the pattern of https://<accountName>.blob.core.chinacloudapi.cn/. Yes
accountKindaccountKind 指定存储帐户的种类。Specify the kind of your storage account. 允许值包括:Storage(常规用途 v1)、StorageV2(常规用途 v2)、BlobStorage 或 BlockBlobStorage 。Allowed values are: Storage (general purpose v1), StorageV2 (general purpose v2), BlobStorage, or BlockBlobStorage. No
servicePrincipalIdservicePrincipalId 指定应用程序的客户端 ID。Specify the application's client ID. Yes
servicePrincipalKeyservicePrincipalKey 指定应用程序的密钥。Specify the application's key. 将此字段标记为 SecureString 以将其安全地存储在数据工厂中,或引用存储在 Azure Key Vault 中的机密Mark this field as SecureString to store it securely in Data Factory, or reference a secret stored in Azure Key Vault. Yes
tenanttenant 指定应用程序的租户信息(域名或租户 ID)。Specify the tenant information (domain name or tenant ID) under which your application resides. 通过将鼠标悬停在 Azure 门户右上角来检索租户信息。Retrieve it by hovering over the upper-right corner of the Azure portal. Yes
azureCloudTypeazureCloudType 对于服务主体身份验证,请指定 Azure Active Directory 应用程序注册到的 Azure 云环境的类型。For service principal authentication, specify the type of Azure cloud environment, to which your Azure Active Directory application is registered.
允许的值为“AzurePublic”、“AzureChina”、“AzureUsGovernment”和“AzureGermany” 。Allowed values are AzurePublic, AzureChina, AzureUsGovernment, and AzureGermany. 默认情况下使用数据工厂的云环境。By default, the data factory's cloud environment is used.
No
connectViaconnectVia 用于连接到数据存储的集成运行时The integration runtime to be used to connect to the data store. 可使用 Azure Integration Runtime 或自承载集成运行时(如果数据存储位于专用网络中)。You can use the Azure integration runtime or the self-hosted integration runtime (if your data store is in a private network). 如果未指定此属性,服务会使用默认的 Azure Integration Runtime。If this property isn't specified, the service uses the default Azure integration runtime. No

备注

服务主体身份验证仅受“AzureBlobStorage”类型的链接服务支持,而不受以前的“AzureStorage”类型的链接服务支持。Service principal authentication is supported only by the "AzureBlobStorage" type linked service, not the previous "AzureStorage" type linked service.

示例:Example:

{
    "name": "AzureBlobStorageLinkedService",
    "properties": {
        "type": "AzureBlobStorage",
        "typeProperties": {            
            "serviceEndpoint": "https://<accountName>.blob.core.chinacloudapi.cn/",
            "accountKind": "StorageV2",
            "servicePrincipalId": "<service principal id>",
            "servicePrincipalKey": {
                "type": "SecureString",
                "value": "<service principal key>"
            },
            "tenant": "<tenant info, e.g. microsoft.partner.onmschina.cn>" 
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

Azure 资源托管标识身份验证Managed identities for Azure resource authentication

可将数据工厂与代表此特定数据工厂的 Azure 资源托管标识相关联。A data factory can be associated with a managed identity for Azure resources, which represents this specific data factory. 可以像使用自己的服务主体一样,直接使用此托管标识进行 Blob 存储身份验证。You can directly use this managed identity for Blob storage authentication, which is similar to using your own service principal. 这个指定的工厂可通过此方法访问和复制 Blob 存储中的数据或将数据复制到其中。It allows this designated factory to access and copy data from or to Blob storage.

有关 Azure 存储身份验证的常规信息,请参阅使用 Azure Active Directory 验证对 Azure 存储的访问权限For general information about Azure Storage authentication, see Authenticate access to Azure Storage using Azure Active Directory. 若要使用 Azure 资源托管标识身份验证,请执行以下步骤:To use managed identities for Azure resource authentication, follow these steps:

  1. 通过复制与工厂一起生成的托管标识对象 ID 的值,检索数据工厂托管标识信息Retrieve Data Factory managed identity information by copying the value of the managed identity object ID generated along with your factory.

  2. 授予托管标识在 Azure Blob 存储中的权限。Grant the managed identity permission in Azure Blob storage. 有关角色的详细信息,请参阅使用 RBAC 管理对 Azure 存储数据的访问权限For more information on the roles, see Manage access rights to Azure Storage data with RBAC.

    • 作为源,在“访问控制(标识和访问管理)”中至少授予“存储 Blob 数据读取者”角色。As source, in Access control (IAM), grant at least the Storage Blob Data Reader role.
    • 作为接收器,在“访问控制(标识和访问管理)”中至少授予“存储 Blob 数据参与者”角色。As sink, in Access control (IAM), grant at least the Storage Blob Data Contributor role.

重要

如果使用 PolyBase 将数据从 Blob 存储(作为源或作为暂存)载入到 Azure Synapse Analytics(以前称为 SQL 数据仓库)中,在对 Blob 存储使用托管标识身份验证时,请确保依然按照本指南中的步骤 1 和步骤 2 进行操作。If you use PolyBase to load data from Blob storage (as a source or as staging) into Azure Synapse Analytics (formerly SQL Data Warehouse), when you're using managed identity authentication for Blob storage, make sure you also follow steps 1 and 2 in this guidance. 这两个步骤将向 Azure AD 注册服务器并将“存储 Blob 数据参与者”角色分配给服务器。Those steps will register your server with Azure AD and assign the Storage Blob Data Contributor role to your server. 其余事项由数据工厂处理。Data Factory handles the rest. 如果为 Blob 存储配置了 Azure 虚拟网络终结点,那么,若要使用 PolyBase 从其中加载数据,则必须使用 PolyBase 所需的托管标识身份验证。If you configured Blob storage with an Azure Virtual Network endpoint, to use PolyBase to load data from it, you must use managed identity authentication as required by PolyBase.

Azure Blob 存储链接服务支持以下属性:These properties are supported for an Azure Blob storage linked service:

propertiesProperty 说明Description 必需Required
typetype type 属性必须设置为 AzureBlobStorageThe type property must be set to AzureBlobStorage. Yes
serviceEndpointserviceEndpoint 使用 https://<accountName>.blob.core.chinacloudapi.cn/ 模式指定 Azure Blob 存储服务终结点。Specify the Azure Blob storage service endpoint with the pattern of https://<accountName>.blob.core.chinacloudapi.cn/. Yes
accountKindaccountKind 指定存储帐户的种类。Specify the kind of your storage account. 允许值包括:Storage(常规用途 v1)、StorageV2(常规用途 v2)、BlobStorage 或 BlockBlobStorage 。Allowed values are: Storage (general purpose v1), StorageV2 (general purpose v2), BlobStorage, or BlockBlobStorage. No
connectViaconnectVia 用于连接到数据存储的集成运行时The integration runtime to be used to connect to the data store. 可使用 Azure Integration Runtime 或自承载集成运行时(如果数据存储位于专用网络中)。You can use the Azure integration runtime or the self-hosted integration runtime (if your data store is in a private network). 如果未指定此属性,服务会使用默认的 Azure Integration Runtime。If this property isn't specified, the service uses the default Azure integration runtime. No

备注

Azure 资源托管标识身份验证仅受“AzureBlobStorage”类型的链接服务支持,而不受以前的“AzureStorage”类型的链接服务支持。Managed identities for Azure resource authentication are supported only by the "AzureBlobStorage" type linked service, not the previous "AzureStorage" type linked service.

示例:Example:

{
    "name": "AzureBlobStorageLinkedService",
    "properties": {
        "type": "AzureBlobStorage",
        "typeProperties": {            
            "serviceEndpoint": "https://<accountName>.blob.core.chinacloudapi.cn/",
            "accountKind": "StorageV2" 
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

数据集属性Dataset properties

有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集一文。For a full list of sections and properties available for defining datasets, see the Datasets article.

Azure 数据工厂支持以下文件格式。Azure Data Factory supports the following file formats. 请参阅每一篇介绍基于格式的设置的文章。Refer to each article for format-based settings.

Azure Blob 存储支持基于格式的数据集中 location 设置下的以下属性:The following properties are supported for Azure Blob storage under location settings in a format-based dataset:

属性Property 说明Description 必需Required
typetype 数据集中位置的 type 属性必须设置为 AzureBlobStorageLocationThe type property of the location in the dataset must be set to AzureBlobStorageLocation. Yes
容器 (container)container Blob 容器。The blob container. Yes
folderPathfolderPath 给定容器下的文件夹的路径。The path to the folder under the given container. 如果要使用通配符来筛选文件夹,请跳过此设置并在活动源设置中进行相应的指定。If you want to use a wildcard to filter the folder, skip this setting and specify that in activity source settings. No
fileNamefileName 给定容器和文件夹路径下的文件名。The file name under the given container and folder path. 如果要使用通配符来筛选文件,请跳过此设置并在活动源设置中进行相应的指定。If you want to use wildcard to filter files, skip this setting and specify that in activity source settings. No

示例:Example:

{
    "name": "DelimitedTextDataset",
    "properties": {
        "type": "DelimitedText",
        "linkedServiceName": {
            "referenceName": "<Azure Blob Storage linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [ < physical schema, optional, auto retrieved during authoring > ],
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "container": "containername",
                "folderPath": "folder/subfolder"
            },
            "columnDelimiter": ",",
            "quoteChar": "\"",
            "firstRowAsHeader": true,
            "compressionCodec": "gzip"
        }
    }
}

复制活动属性Copy activity properties

有关可用于定义活动的各部分和属性的完整列表,请参阅管道一文。For a full list of sections and properties available for defining activities, see the Pipelines article. 本部分提供 Blob 存储源和接收器支持的属性的列表。This section provides a list of properties that the Blob storage source and sink support.

将 Blob 存储用作源类型Blob storage as a source type

Azure 数据工厂支持以下文件格式。Azure Data Factory supports the following file formats. 请参阅每一篇介绍基于格式的设置的文章。Refer to each article for format-based settings.

Azure Blob 存储支持基于格式的复制源中 storeSettings 设置下的以下属性:The following properties are supported for Azure Blob storage under storeSettings settings in a format-based copy source:

属性Property 说明Description 必需Required
typetype storeSettings 下的 type 属性必须设置为 AzureBlobStorageReadSettingsThe type property under storeSettings must be set to AzureBlobStorageReadSettings. Yes
找到要复制的文件:Locate the files to copy:
选项 1:静态路径OPTION 1: static path
从数据集中指定的给定容器或文件夹/文件路径复制。Copy from the given container or folder/file path specified in the dataset. 若要复制容器或文件夹中的所有 Blob,请另外将 wildcardFileName 指定为 *If you want to copy all blobs from a container or folder, additionally specify wildcardFileName as *.
选项 2:Blob 前缀OPTION 2: blob prefix
- prefix- prefix
在数据集中配置的给定容器下的 Blob 名称的前缀,用于筛选源 Blob。Prefix for the blob name under the given container configured in a dataset to filter source blobs. 名称以 container_in_dataset/this_prefix 开头的 Blob 会被选择。Blobs whose names start with container_in_dataset/this_prefix are selected. 它利用 Blob 存储的服务端筛选器。与通配符筛选器相比,该服务端筛选器可提供更好的性能。It utilizes the service-side filter for Blob storage, which provides better performance than a wildcard filter. No
选项 3:通配符OPTION 3: wildcard
- wildcardFolderPath- wildcardFolderPath
数据集中配置的给定容器下包含通配符的文件夹路径,用于筛选源文件夹。The folder path with wildcard characters under the given container configured in a dataset to filter source folders.
允许的通配符为:*(匹配零个或更多字符)和 ?(匹配零个或单个字符)。Allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character). 如果文件夹名内包含通配符或此转义字符,请使用 ^ 进行转义。Use ^ to escape if your folder name has wildcard or this escape character inside.
请参阅文件夹和文件筛选器示例中的更多示例。See more examples in Folder and file filter examples.
No
选项 3:通配符OPTION 3: wildcard
- wildcardFileName- wildcardFileName
给定容器和文件夹路径(或通配符文件夹路径)下包含通配符的文件名,用于筛选源文件。The file name with wildcard characters under the given container and folder path (or wildcard folder path) to filter source files.
允许的通配符为:*(匹配零个或更多字符)和 ?(匹配零个或单个字符)。Allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character). 如果文件夹名内包含通配符或此转义字符,请使用 ^ 进行转义。Use ^ to escape if your folder name has a wildcard or this escape character inside. 请参阅文件夹和文件筛选器示例中的更多示例。See more examples in Folder and file filter examples.
Yes
选项 4:文件列表OPTION 4: a list of files
- fileListPath- fileListPath
指明复制给定文件集。Indicates to copy a given file set. 指向包含要复制的文件列表的文本文件,其中每行一个文件(即数据集中所配置路径的相对路径)。Point to a text file that includes a list of files you want to copy, one file per line, which is the relative path to the path configured in the dataset.
使用此选项时,请不要在数据集中指定文件名。When you're using this option, do not specify a file name in the dataset. 请参阅文件列表示例中的更多示例。See more examples in File list examples.
No
其他设置:Additional settings:
recursiverecursive 指示是要从子文件夹中以递归方式读取数据,还是只从指定的文件夹中读取数据。Indicates whether the data is read recursively from the subfolders or only from the specified folder. 请注意,当 recursive 设置为 true 且接收器是基于文件的存储时,将不会在接收器上复制或创建空的文件夹或子文件夹。Note that when recursive is set to true and the sink is a file-based store, an empty folder or subfolder isn't copied or created at the sink.
允许的值为 true(默认值)和 falseAllowed values are true (default) and false.
如果配置 fileListPath,则此属性不适用。This property doesn't apply when you configure fileListPath.
No
deleteFilesAfterCompletiondeleteFilesAfterCompletion 指示是否会在二进制文件成功移到目标存储后将其从源存储中删除。Indicates whether the binary files will be deleted from source store after successfully moving to the destination store. 文件删除按文件进行。因此,当复制活动失败时,你会看到一些文件已经复制到目标并从源中删除,而另一些文件仍保留在源存储中。The file deletion is per file, so when copy activity fails, you will see some files have already been copied to the destination and deleted from source, while others are still remaining on source store.
此属性仅在二进制文件复制方案中有效,该方案中的数据源存储为 Blob、ADLS Gen2、S3、Google 云存储、文件、Azure 文件存储、SFTP 或 FTP。This property is only valid in binary copy scenario, where data source stores are Blob, ADLS Gen2, S3, Google Cloud Storage, File, Azure File, SFTP, or FTP. 默认值:false。The default value: false.
No
modifiedDatetimeStartmodifiedDatetimeStart 文件根据“上次修改时间”属性进行筛选。Files are filtered based on the attribute: last modified.
如果文件的上次修改时间在 modifiedDatetimeStartmodifiedDatetimeEnd 之间的时间范围内,则将选中这些文件。The files will be selected if their last modified time is within the time range between modifiedDatetimeStart and modifiedDatetimeEnd. 该时间应用于 UTC 时区,格式为“2018-12-01T05:00:00Z”。The time is applied to a UTC time zone in the format of "2018-12-01T05:00:00Z".
属性可以为 NULL,这意味着不会向数据集应用任何文件属性筛选器。The properties can be NULL, which means no file attribute filter will be applied to the dataset. 如果 modifiedDatetimeStart 具有日期/时间值,但 modifiedDatetimeEnd 为 NULL,则会选中“上次修改时间”属性大于或等于该日期/时间值的文件。When modifiedDatetimeStart has a datetime value but modifiedDatetimeEnd is NULL, the files whose last modified attribute is greater than or equal to the datetime value will be selected. 如果 modifiedDatetimeEnd 具有日期/时间值,但 modifiedDatetimeStart 为 NULL,则会选中“上次修改时间”属性小于该日期/时间值的文件。When modifiedDatetimeEnd has a datetime value but modifiedDatetimeStart is NULL, the files whose last modified attribute is less than the datetime value will be selected.
如果配置 fileListPath,则此属性不适用。This property doesn't apply when you configure fileListPath.
No
modifiedDatetimeEndmodifiedDatetimeEnd 同上。Same as above. No
enablePartitionDiscoveryenablePartitionDiscovery 对于已分区的文件,请指定是否从文件路径分析分区,并将这些分区添加为附加的源列。For files that are partitioned, specify whether to parse the partitions from the file path and add them as additional source columns.
允许的值为 false(默认值)和 true 。Allowed values are false (default) and true.
No
partitionRootPathpartitionRootPath 在启用分区发现时,请指定绝对根路径,以便将已分区文件夹读取为数据列。When partition discovery is enabled, specify the absolute root path in order to read partitioned folders as data columns.

如果未指定,在默认情况下:If it is not specified, by default,
- 在数据集或源的文件列表中使用文件路径时,分区根路径是在数据集中配置的路径。- When you use file path in dataset or list of files on source, partition root path is the path configured in dataset.
- 在使用通配符文件夹筛选器时,分区根路径是第一个通配符前的子路径。- When you use wildcard folder filter, partition root path is the sub-path before the first wildcard.
- 在使用前缀时,分区根路径是最后一个“/”前的子路径。- When you use prefix, partition root path is sub-path before the last "/".

例如,假设将数据集中的路径配置为“root/folder/year=2020/month=08/day=27”:For example, assuming you configure the path in dataset as "root/folder/year=2020/month=08/day=27":
- 如果将分区根路径指定为“root/folder/year=2020”,则复制活动除了文件内的列以外,还将生成另外两个列 monthday,其值分别为“08”和“27”。- If you specify partition root path as "root/folder/year=2020", copy activity will generate two more columns month and day with value "08" and "27" respectively, in addition to the columns inside the files.
- 如果未指定分区根路径,将不会生成额外的列。- If partition root path is not specified, no extra column will be generated.
No
maxConcurrentConnectionsmaxConcurrentConnections 与存储的并发连接数。The number of concurrent connections to storage. 仅在要限制与数据存储的并发连接数时指定。Specify only when you want to limit concurrent connections to the data store. No

备注

对于 Parquet/带分隔符的文本格式,仍然会按原样支持下一部分中提到的 BlobSource 类型复制活动源,以实现后向兼容性。For Parquet/delimited text format, the BlobSource type for the Copy activity source mentioned in the next section is still supported as is for backward compatibility. 建议你使用新模型,直到数据工厂创作 UI 已切换到生成这些新类型为止。We suggest that you use the new model until the Data Factory authoring UI has switched to generating these new types.

示例:Example:

"activities":[
    {
        "name": "CopyFromBlob",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<Delimited text input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "DelimitedTextSource",
                "formatSettings":{
                    "type": "DelimitedTextReadSettings",
                    "skipLineCount": 10
                },
                "storeSettings":{
                    "type": "AzureBlobStorageReadSettings",
                    "recursive": true,
                    "wildcardFolderPath": "myfolder*A",
                    "wildcardFileName": "*.csv"
                }
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

用作接收器类型的 Blob 存储Blob storage as a sink type

Azure 数据工厂支持以下文件格式。Azure Data Factory supports the following file formats. 请参阅每一篇介绍基于格式的设置的文章。Refer to each article for format-based settings.

Azure Blob 存储支持基于格式的复制接收器中 storeSettings 设置下的以下属性:The following properties are supported for Azure Blob storage under storeSettings settings in a format-based copy sink:

属性Property 说明Description 必需Required
typetype storeSettings 下的 type 属性必须设置为 AzureBlobStorageWriteSettingsThe type property under storeSettings must be set to AzureBlobStorageWriteSettings. Yes
copyBehaviorcopyBehavior 定义以基于文件的数据存储中的文件为源时的复制行为。Defines the copy behavior when the source is files from a file-based data store.

允许值包括:Allowed values are:
- PreserveHierarchy(默认):将文件层次结构保留到目标文件夹中。- PreserveHierarchy (default): Preserves the file hierarchy in the target folder. 指向源文件夹的源文件相对路径与指向目标文件夹的目标文件相对路径相同。The relative path of the source file to the source folder is identical to the relative path of the target file to the target folder.
- FlattenHierarchy:源文件夹中的所有文件都位于目标文件夹的第一级中。- FlattenHierarchy: All files from the source folder are in the first level of the target folder. 目标文件具有自动生成的名称。The target files have autogenerated names.
- MergeFiles:将源文件夹中的所有文件合并到一个文件中。- MergeFiles: Merges all files from the source folder to one file. 如果指定了文件名或 Blob 名称,则合并文件的名称为指定名称。If the file or blob name is specified, the merged file name is the specified name. 否则,它是自动生成的文件名。Otherwise, it's an autogenerated file name.
No
blockSizeInMBblockSizeInMB 指定用于将数据写入块 Blob 的块大小(以 MB 为单位)。Specify the block size, in megabytes, used to write data to block blobs. 详细了解块 BlobLearn more about Block Blobs.
允许的值介于 4 MB 到 100 MB 之间。Allowed value is between 4 MB and 100 MB.
默认情况下,数据工厂会根据源存储类型和数据自动确定块大小。By default, Data Factory automatically determines the block size based on your source store type and data. 以非二进制格式复制到 Blob 存储中时,默认块大小为 100 MB,因此该存储(最多)可容纳 4.95 TB 的数据。For nonbinary copy into Blob storage, the default block size is 100 MB so it can fit in (at most) 4.95 TB of data. 当数据不大时,它可能并非最优,特别是当你在网络连接不佳的情况下使用自承载集成运行时的时候,这会导致操作超时或性能问题。It might be not optimal when your data is not large, especially when you use the self-hosted integration runtime with poor network connections that result in operation timeout or performance issues. 可以显式指定块大小,同时确保 blockSizeInMB*50000 足够大,可以存储数据,You can explicitly specify a block size, while ensuring that blockSizeInMB*50000 is big enough to store the data. 否则复制活动运行会失败。Otherwise, the Copy activity run will fail.
No
maxConcurrentConnectionsmaxConcurrentConnections 与存储的并发连接数。The number of concurrent connections to storage. 仅在要限制与数据存储的并发连接数时指定。Specify only when you want to limit concurrent connections to the data store. No

示例:Example:

"activities":[
    {
        "name": "CopyFromBlob",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<Parquet output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "<source type>"
            },
            "sink": {
                "type": "ParquetSink",
                "storeSettings":{
                    "type": "AzureBlobStorageWriteSettings",
                    "copyBehavior": "PreserveHierarchy"
                }
            }
        }
    }
]

文件夹和文件筛选器示例Folder and file filter examples

本部分介绍使用通配符筛选器生成文件夹路径和文件名的行为。This section describes the resulting behavior of the folder path and file name with wildcard filters.

folderPathfolderPath fileNamefileName recursiverecursive 源文件夹结构和筛选器结果(用粗体表示的文件已检索)Source folder structure and filter result (files in bold are retrieved)
container/Folder* (为空,使用默认值)(empty, use default) falsefalse 容器 (container)container
    FolderA    FolderA
        File1.csv        File1.csv
        File2.json        File2.json
        Subfolder1        Subfolder1
            File3.csv            File3.csv
            File4.json            File4.json
            File5.csv            File5.csv
    AnotherFolderB    AnotherFolderB
        File6.csv        File6.csv
container/Folder* (为空,使用默认值)(empty, use default) true 容器 (container)container
    FolderA    FolderA
        File1.csv        File1.csv
        File2.json        File2.json
        Subfolder1        Subfolder1
            File3.csv            File3.csv
            File4.json            File4.json
            File5.csv            File5.csv
    AnotherFolderB    AnotherFolderB
        File6.csv        File6.csv
container/Folder* *.csv falsefalse containercontainer
    FolderA    FolderA
        File1.csv        File1.csv
        File2.json        File2.json
        Subfolder1        Subfolder1
            File3.csv            File3.csv
            File4.json            File4.json
            File5.csv            File5.csv
    AnotherFolderB    AnotherFolderB
        File6.csv        File6.csv
container/Folder* *.csv true 容器 (container)container
    FolderA    FolderA
        File1.csv        File1.csv
        File2.json        File2.json
        Subfolder1        Subfolder1
            File3.csv            File3.csv
            File4.json            File4.json
            File5.csv            File5.csv
    AnotherFolderB    AnotherFolderB
        File6.csv        File6.csv

文件列表示例File list examples

本部分介绍了在复制活动源中使用文件列表路径时产生的行为。This section describes the resulting behavior of using a file list path in the Copy activity source.

假设有以下源文件夹结构,并且要复制以粗体显示的文件:Assume that you have the following source folder structure and want to copy the files in bold:

示例源结构Sample source structure FileListToCopy.txt 中的内容Content in FileListToCopy.txt 数据工厂配置Data Factory configuration
containercontainer
    FolderA    FolderA
        File1.csv        File1.csv
        File2.json        File2.json
        Subfolder1        Subfolder1
            File3.csv            File3.csv
            File4.json            File4.json
            File5.csv            File5.csv
    元数据    Metadata
        FileListToCopy.txt        FileListToCopy.txt
File1.csvFile1.csv
Subfolder1/File3.csvSubfolder1/File3.csv
Subfolder1/File5.csvSubfolder1/File5.csv
在数据集中:In dataset:
- 容器:container- Container: container
- 文件夹路径:FolderA- Folder path: FolderA

在复制活动源中:In Copy activity source:
- 文件列表路径:container/Metadata/FileListToCopy.txt- File list path: container/Metadata/FileListToCopy.txt

文件列表路径指向同一数据存储中的一个文本文件,该文件包含要复制的文件列表(每行一个文件,带有数据集中所配置路径的相对路径)。The file list path points to a text file in the same data store that includes a list of files you want to copy, one file per line, with the relative path to the path configured in the dataset.

一些 recursive 和 copyBehavior 示例Some recursive and copyBehavior examples

此部分介绍了将 recursivecopyBehavior 值进行不同组合所产生的复制操作行为。This section describes the resulting behavior of the Copy operation for different combinations of recursive and copyBehavior values.

recursiverecursive copyBehaviorcopyBehavior 源文件夹结构Source folder structure 生成目标Resulting target
true preserveHierarchypreserveHierarchy Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
使用与源相同的结构创建目标文件夹 Folder1:The target folder, Folder1, is created with the same structure as the source:

Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
true flattenHierarchyflattenHierarchy Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
使用以下结构创建目标文件夹 Folder1:The target folder, Folder1, is created with the following structure:

Folder1Folder1
    File1 的自动生成的名称    autogenerated name for File1
    File2 的自动生成的名称    autogenerated name for File2
    File3 的自动生成的名称    autogenerated name for File3
    File4 的自动生成的名称    autogenerated name for File4
    File5 的自动生成的名称    autogenerated name for File5
true mergeFilesmergeFiles Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
使用以下结构创建目标文件夹 Folder1:The target folder, Folder1, is created with the following structure:

Folder1Folder1
    File1 + File2 + File3 + File4 + File5 的内容将合并到一个文件中,且自动生成文件名。    File1 + File2 + File3 + File4 + File5 contents are merged into one file with an autogenerated file name.
falsefalse preserveHierarchypreserveHierarchy Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
使用以下结构创建目标文件夹 Folder1:The target folder, Folder1, is created with the following structure:

Folder1Folder1
    File1    File1
    File2    File2

不会选取带有 File3、File4 和 File5 的 Subfolder1。Subfolder1 with File3, File4, and File5 is not picked up.
falsefalse flattenHierarchyflattenHierarchy Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
使用以下结构创建目标文件夹 Folder1:The target folder, Folder1, is created with the following structure:

Folder1Folder1
    File1 的自动生成的名称    autogenerated name for File1
    File2 的自动生成的名称    autogenerated name for File2

不会选取带有 File3、File4 和 File5 的 Subfolder1。Subfolder1 with File3, File4, and File5 is not picked up.
falsefalse mergeFilesmergeFiles Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5
使用以下结构创建目标文件夹 Folder1:The target folder, Folder1, is created with the following structure:

Folder1Folder1
    File1 + File2 的内容将合并到一个文件中,且自动生成文件名。    File1 + File2 contents are merged into one file with an autogenerated file name. File1 的自动生成的名称autogenerated name for File1

不会选取带有 File3、File4 和 File5 的 Subfolder1。Subfolder1 with File3, File4, and File5 is not picked up.

在复制期间保留元数据Preserving metadata during copy

将文件从 Amazon S3、Azure Blob 存储或 Azure Data Lake Storage Gen2 复制到 Azure Data Lake Storage Gen2 或 Azure Blob 存储时,可以选择将文件元数据和数据一起保留。When you copy files from Amazon S3, Azure Blob storage, or Azure Data Lake Storage Gen2 to Azure Data Lake Storage Gen2 or Azure Blob storage, you can choose to preserve the file metadata along with data. 保留元数据中了解更多信息。Learn more from Preserve metadata.

Lookup 活动属性Lookup activity properties

若要了解有关属性的详细信息,请查看 Lookup 活动To learn details about the properties, check Lookup activity.

GetMetadata 活动属性GetMetadata activity properties

若要了解有关属性的详细信息,请查看 GetMetadata 活动To learn details about the properties, check GetMetadata activity.

Delete 活动属性Delete activity properties

若要了解有关属性的详细信息,请查看删除活动To learn details about the properties, check Delete activity.

旧模型Legacy models

备注

仍会按原样支持以下模型,以实现后向兼容性。The following models are still supported as is for backward compatibility. 建议使用前面提到的新模型。We suggest that you use the new model mentioned earlier. 数据工厂创作 UI 已切换为生成新模型。The Data Factory authoring UI has switched to generating the new model.

旧数据集模型Legacy dataset model

propertiesProperty 说明Description 必需Required
typetype 数据集的 type 属性必须设置为 AzureBlobThe type property of the dataset must be set to AzureBlob. Yes
folderPathfolderPath 指向 Blob 存储中的容器和文件夹的路径。Path to the container and folder in Blob storage.

不包含容器名称的路径支持通配符筛选器。A wildcard filter is supported for the path, excluding container name. 允许的通配符为:*(匹配零个或更多字符)和 ?(匹配零个或单个字符)。Allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character). 如果文件夹名内包含通配符或此转义字符,请使用 ^ 进行转义。Use ^ to escape if your folder name has a wildcard or this escape character inside.

示例:myblobcontainer/myblobfolder/。An example is: myblobcontainer/myblobfolder/. 请参阅文件夹和文件筛选器示例中的更多示例。See more examples in Folder and file filter examples.
对于复制或查找活动为“是”,对于 GetMetadata 活动为“否”Yes for the Copy or Lookup activity, no for the GetMetadata activity
fileNamefileName 指定的“folderPath”值下 Blob 的名称或通配符筛选器。Name or wildcard filter for the blobs under the specified folderPath value. 如果没有为此属性指定任何值,则数据集会指向文件夹中的所有 Blob。If you don't specify a value for this property, the dataset points to all blobs in the folder.

对于筛选器,允许的通配符为 *(匹配零个或零个以上的字符)和 ?(匹配零个或单个字符)。For the filter, allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character).
- 示例 1:"fileName": "*.csv"- Example 1: "fileName": "*.csv"
- 示例 2:"fileName": "???20180427.txt"- Example 2: "fileName": "???20180427.txt"
如果文件名包含通配符或此转义字符,请使用 ^ 进行转义。Use ^ to escape if your file name has a wildcard or this escape character inside.

如果没有为输出数据集指定 fileName,并且没有在活动接收器中指定 preserveHierarchy,则复制活动会自动生成采用以下模式的 Blob 名称:“Data.[activity run ID GUID].[GUID if FlattenHierarchy].[format if configured].[compression if configured] ”。When fileName isn't specified for an output dataset and preserveHierarchy isn't specified in the activity sink, the Copy activity automatically generates the blob name with the following pattern: "Data.[activity run ID GUID].[GUID if FlattenHierarchy].[format if configured].[compression if configured]". 例如:“Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt.gz”。For example: "Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt.gz".

如果使用表名称(而不是查询)从表格源进行复制,则名称模式为“[table name].[format].[compression if configured]”。If you copy from a tabular source by using a table name instead of a query, the name pattern is "[table name].[format].[compression if configured]". 例如:“MyTable.csv”。For example: "MyTable.csv".
No
modifiedDatetimeStartmodifiedDatetimeStart 文件根据“上次修改时间”属性进行筛选。Files are filtered based on the attribute: last modified. 如果文件的上次修改时间在 modifiedDatetimeStartmodifiedDatetimeEnd 之间的时间范围内,则将选中这些文件。The files will be selected if their last modified time is within the time range between modifiedDatetimeStart and modifiedDatetimeEnd. 该时间应用于 UTC 时区,格式为“2018-12-01T05:00:00Z”。The time is applied to the UTC time zone in the format of "2018-12-01T05:00:00Z".

请注意,要对大量文件进行筛选时,启用此设置会影响数据移动的整体性能。Be aware that enabling this setting will affect the overall performance of data movement when you want to filter huge amounts of files.

属性可以为 NULL,这意味着不会向数据集应用任何文件属性筛选器。The properties can be NULL, which means no file attribute filter will be applied to the dataset. 如果 modifiedDatetimeStart 具有日期/时间值,但 modifiedDatetimeEnd 为 NULL,则会选中“上次修改时间”属性大于或等于该日期/时间值的文件。When modifiedDatetimeStart has a datetime value but modifiedDatetimeEnd is NULL, the files whose last modified attribute is greater than or equal to the datetime value will be selected. 如果 modifiedDatetimeEnd 具有日期/时间值,但 modifiedDatetimeStart 为 NULL,则会选中“上次修改时间”属性小于该日期/时间值的文件。When modifiedDatetimeEnd has a datetime value but modifiedDatetimeStart is NULL, the files whose last modified attribute is less than the datetime value will be selected.
No
modifiedDatetimeEndmodifiedDatetimeEnd 文件根据“上次修改时间”属性进行筛选。Files are filtered based on the attribute: last modified. 如果文件的上次修改时间在 modifiedDatetimeStartmodifiedDatetimeEnd 之间的时间范围内,则将选中这些文件。The files will be selected if their last modified time is within the time range between modifiedDatetimeStart and modifiedDatetimeEnd. 该时间应用于 UTC 时区,格式为“2018-12-01T05:00:00Z”。The time is applied to the UTC time zone in the format of "2018-12-01T05:00:00Z".

请注意,要对大量文件进行筛选时,启用此设置会影响数据移动的整体性能。Be aware that enabling this setting will affect the overall performance of data movement when you want to filter huge amounts of files.

属性可以为 NULL,这意味着不会向数据集应用任何文件属性筛选器。The properties can be NULL, which means no file attribute filter will be applied to the dataset. 如果 modifiedDatetimeStart 具有日期/时间值,但 modifiedDatetimeEnd 为 NULL,则会选中“上次修改时间”属性大于或等于该日期/时间值的文件。When modifiedDatetimeStart has a datetime value but modifiedDatetimeEnd is NULL, the files whose last modified attribute is greater than or equal to the datetime value will be selected. 如果 modifiedDatetimeEnd 具有日期/时间值,但 modifiedDatetimeStart 为 NULL,则会选中“上次修改时间”属性小于该日期/时间值的文件。When modifiedDatetimeEnd has a datetime value but modifiedDatetimeStart is NULL, the files whose last modified attribute is less than the datetime value will be selected.
No
formatformat 若要在基于文件的存储之间按原样复制文件(二进制副本),可以在输入和输出数据集定义中跳过格式节。If you want to copy files as is between file-based stores (binary copy), skip the format section in both the input and output dataset definitions.

若要分析或生成具有特定格式的文件,以下是受支持的文件格式类型:TextFormat、JsonFormat、AvroFormat、OrcFormat 和 ParquetFormat 。If you want to parse or generate files with a specific format, the following file format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, and ParquetFormat. 请将 format 中的 type 属性设置为上述值之一。Set the type property under format to one of these values. 有关详细信息,请参阅文本格式JSON 格式Avro 格式Orc 格式Parquet 格式部分。For more information, see the Text format, JSON format, Avro format, Orc format, and Parquet format sections.
否(仅适用于二进制复制方案)No (only for binary copy scenario)
compressioncompression 指定数据的压缩类型和级别。Specify the type and level of compression for the data. 有关详细信息,请参阅受支持的文件格式和压缩编解码器For more information, see Supported file formats and compression codecs.
支持的类型为 GZipDeflateBZip2ZipDeflateSupported types are GZip, Deflate, BZip2, and ZipDeflate.
支持的级别为“最佳”和“最快”。Supported levels are Optimal and Fastest.
No

提示

如需复制文件夹下的所有 blob,请仅指定 folderPath。To copy all blobs under a folder, specify folderPath only.
如需复制具有给定名称的单个 Blob,请使用文件夹部分指定 folderPath,并使用文件名指定 fileName。To copy a single blob with a given name, specify folderPath for the folder part and fileName for the file name.
如需复制文件夹下 Blob 的子集,请使用文件夹部分指定 folderPath,并使用通配符筛选器指定 fileName。To copy a subset of blobs under a folder, specify folderPath for the folder part and fileName with a wildcard filter.

示例:Example:

{
    "name": "AzureBlobDataset",
    "properties": {
        "type": "AzureBlob",
        "linkedServiceName": {
            "referenceName": "<Azure Blob storage linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "folderPath": "mycontainer/myfolder",
            "fileName": "*",
            "modifiedDatetimeStart": "2018-12-01T05:00:00Z",
            "modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
            "format": {
                "type": "TextFormat",
                "columnDelimiter": ",",
                "rowDelimiter": "\n"
            },
            "compression": {
                "type": "GZip",
                "level": "Optimal"
            }
        }
    }
}

复制活动的旧源模型Legacy source model for the Copy activity

属性Property 说明Description 必需Required
typetype 复制活动源的 type 属性必须设置为 BlobSourceThe type property of the Copy activity source must be set to BlobSource. Yes
recursiverecursive 指示是要从子文件夹中以递归方式读取数据,还是只从指定的文件夹中读取数据。Indicates whether the data is read recursively from the subfolders or only from the specified folder. 请注意,当 recursive 设置为 true 且接收器是基于文件的存储时,将不会在接收器上复制或创建空的文件夹或子文件夹。Note that when recursive is set to true and the sink is a file-based store, an empty folder or subfolder isn't copied or created at the sink.
允许的值为 true(默认值)和 falseAllowed values are true (default) and false.
No
maxConcurrentConnectionsmaxConcurrentConnections 与存储的并发连接数。The number of concurrent connections to storage. 仅在要限制与数据存储的并发连接数时指定。Specify only when you want to limit concurrent connections to the data store. No

示例:Example:

"activities":[
    {
        "name": "CopyFromBlob",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<Azure Blob input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "BlobSource",
                "recursive": true
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

复制活动的旧接收器模型Legacy sink model for the Copy activity

属性Property 说明Description 必需Required
typetype 复制活动接收器的 type 属性必须设置为 BlobSinkThe type property of the Copy activity sink must be set to BlobSink. Yes
copyBehaviorcopyBehavior 定义以基于文件的数据存储中的文件为源时的复制行为。Defines the copy behavior when the source is files from a file-based data store.

允许值包括:Allowed values are:
- PreserveHierarchy(默认):将文件层次结构保留到目标文件夹中。- PreserveHierarchy (default): Preserves the file hierarchy in the target folder. 从源文件到源文件夹的相对路径与从目标文件到目标文件夹的相对路径相同。The relative path of source file to source folder is identical to the relative path of target file to target folder.
- FlattenHierarchy:源文件夹中的所有文件都位于目标文件夹的第一级中。- FlattenHierarchy: All files from the source folder are in the first level of the target folder. 目标文件具有自动生成的名称。The target files have autogenerated names.
- MergeFiles:将源文件夹中的所有文件合并到一个文件中。- MergeFiles: Merges all files from the source folder to one file. 如果指定了文件名或 Blob 名称,则合并文件的名称为指定名称。If the file or blob name is specified, the merged file name is the specified name. 否则,它是自动生成的文件名。Otherwise, it's an autogenerated file name.
No
maxConcurrentConnectionsmaxConcurrentConnections 与存储的并发连接数。The number of concurrent connections to storage. 仅在要限制与数据存储的并发连接数时指定。Specify only when you want to limit concurrent connections to the data store. No

示例:Example:

"activities":[
    {
        "name": "CopyToBlob",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<Azure Blob output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "<source type>"
            },
            "sink": {
                "type": "BlobSink",
                "copyBehavior": "PreserveHierarchy"
            }
        }
    }
]

后续步骤Next steps

有关数据存储(数据工厂中的复制活动支持将其用作源和接收器)的列表,请参阅支持的数据存储For a list of data stores that the Copy activity in Data Factory supports as sources and sinks, see Supported data stores.