使用 Azure 数据工厂从 Amazon 简单存储服务复制数据Copy data from Amazon Simple Storage Service by using Azure Data Factory

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

本文概述了如何从 Amazon 简单存储服务 (Amazon S3) 复制数据。This article outlines how to copy data from Amazon Simple Storage Service (Amazon S3). 若要了解 Azure 数据工厂,请阅读介绍性文章To learn about Azure Data Factory, read the introductory article.

提示

若要详细了解从 Amazon S3 到 Azure 存储的数据迁移方案,请参阅使用 Azure 数据工厂将数据从 Amazon S3 迁移到 Azure 存储To learn more about the data migration scenario from Amazon S3 to Azure Storage, see Use Azure Data Factory to migrate data from Amazon S3 to Azure Storage.

支持的功能Supported capabilities

以下活动支持此 Amazon S3 连接器:This Amazon S3 connector is supported for the following activities:

具体而言,此 Amazon S3 连接器支持按原样复制文件,或者使用受支持的文件格式和压缩编解码器分析文件。Specifically, this Amazon S3 connector supports copying files as is or parsing files with the supported file formats and compression codecs. 还可以选择在复制期间保留文件元数据You can also choose to preserve file metadata during copy. 此连接器使用 AWS 签名版本 4 对发往 S3 的请求进行身份验证。The connector uses AWS Signature Version 4 to authenticate requests to S3.

提示

可以使用此 Amazon S3 连接器从任何兼容 S3 的存储提供程序(例如 Google 云存储)复制数据。You can use this Amazon S3 connector to copy data from any S3-compatible storage provider, such as Google Cloud Storage. 在链接的服务配置中指定相应的服务 URL。Specify the corresponding service URL in the linked service configuration.

所需的权限Required permissions

若要从 Amazon S3 复制数据,请确保已具有以下权限:To copy data from Amazon S3, make sure you've been granted the following permissions:

  • 执行复制活动:Amazon S3 对象操作所需的 s3:GetObjects3:GetObjectVersion 权限。For Copy activity execution: s3:GetObject and s3:GetObjectVersion for Amazon S3 object operations.
  • 数据工厂 GUI 创作:Amazon S3 Bucket 操作所需的 s3:ListAllMyBucketss3:ListBucket/s3:GetBucketLocation 权限。For Data Factory GUI authoring: s3:ListAllMyBuckets and s3:ListBucket/s3:GetBucketLocation for Amazon S3 bucket operations. 测试连接和浏览到文件路径等操作也需要权限。Permissions are also required for operations like testing connections and browsing to file paths. 如果不想授予这些权限,请跳过链接服务创建页面中的测试连接,直接在数据集设置中指定路径。If you don't want to grant these permissions, skip the test connection on the linked service creation page and specify the path directly in dataset settings.

如需 Amazon S3 权限的完整列表,请参阅 AWS 站点上的在策略中指定权限For the full list of Amazon S3 permissions, see Specifying Permissions in a Policy on the AWS site.

入门Getting started

若要使用管道执行复制活动,可以使用以下工具或 SDK 之一:To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:

对于特定于 Amazon S3 的数据工厂实体,以下部分提供了有关用于定义这些实体的属性的详细信息。The following sections provide details about properties that are used to define Data Factory entities specific to Amazon S3.

链接服务属性Linked service properties

Amazon S3 链接服务支持以下属性:The following properties are supported for an Amazon S3 linked service:

属性Property 说明Description 必须Required
typetype type 属性必须设置为 AmazonS3The type property must be set to AmazonS3. Yes
accessKeyIdaccessKeyId 机密访问键 ID。ID of the secret access key. Yes
secretAccessKeysecretAccessKey 机密访问键本身。The secret access key itself. 将此字段标记为 SecureString 以安全地将其存储在数据工厂中或引用存储在 Azure Key Vault 中的机密Mark this field as a SecureString to store it securely in Data Factory, or reference a secret stored in Azure Key Vault. Yes
serviceUrlserviceUrl 如果从官方 Amazon S3 服务之外的兼容 S3 的存储提供程序复制数据,请指定自定义 S3 终结点。Specify the custom S3 endpoint if you're copying data from an S3-compatible storage provider other than the official Amazon S3 service. 例如,若要从 Google 云存储复制数据,请指定 https://storage.googleapis.comFor example, to copy data from Google Cloud Storage, specify https://storage.googleapis.com. No
connectViaconnectVia 用于连接到数据存储的集成运行时The integration runtime to be used to connect to the data store. 可使用 Azure Integration Runtime 或自承载集成运行时(如果数据存储位于专用网络中)。You can use the Azure integration runtime or the self-hosted integration runtime (if your data store is in a private network). 如果未指定此属性,服务会使用默认的 Azure Integration Runtime。If this property isn't specified, the service uses the default Azure integration runtime. No

此连接器需要 AWS 标识和访问管理 (IAM) 帐户的访问密钥才能从 Amazon S3 复制数据。This connector requires access keys for an AWS Identity and Access Management (IAM) account to copy data from Amazon S3. 目前不支持临时安全凭据Temporary security credentials is not supported now.

提示

如果从官方 Amazon S3 服务之外的兼容 S3 的存储复制数据,请指定自定义 S3 服务 URL。Specify the custom S3 service URL if you're copying data from an S3-compatible storage other than the official Amazon S3 service.

下面是一个示例:Here's an example:

{
    "name": "AmazonS3LinkedService",
    "properties": {
        "type": "AmazonS3",
        "typeProperties": {
            "accessKeyId": "<access key id>",
            "secretAccessKey": {
                "type": "SecureString",
                "value": "<secret access key>"
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

数据集属性Dataset properties

有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集一文。For a full list of sections and properties available for defining datasets, see the Datasets article.

Azure 数据工厂支持以下文件格式。Azure Data Factory supports the following file formats. 请参阅每一篇介绍基于格式的设置的文章。Refer to each article for format-based settings.

Amazon S3 支持基于格式的数据集中 location 设置下的以下属性:The following properties are supported for Amazon S3 under location settings in a format-based dataset:

属性Property 说明Description 必须Required
typetype 数据集中 location 下的 type 属性必须设置为 AmazonS3LocationThe type property under location in a dataset must be set to AmazonS3Location. Yes
bucketNamebucketName S3 存储桶的名称。The S3 bucket name. Yes
folderPathfolderPath 给定 Bucket 下的文件夹路径。The path to the folder under the given bucket. 如果要使用通配符来筛选文件夹,请跳过此设置并在活动源设置中进行相应的指定。If you want to use a wildcard to filter the folder, skip this setting and specify that in the activity source settings. No
fileNamefileName 给定 Bucket 和文件夹路径下的文件名。The file name under the given bucket and folder path. 如果要使用通配符来筛选文件,请跳过此设置并在活动源设置中进行相应的指定。If you want to use a wildcard to filter files, skip this setting and specify that in the activity source settings. No
版本version 启用 S3 版本控制时 S3 对象的版本。The version of the S3 object, if S3 versioning is enabled. 如果未指定,则会提取最新版本。If it's not specified, the latest version will be fetched. No

示例:Example:

{
    "name": "DelimitedTextDataset",
    "properties": {
        "type": "DelimitedText",
        "linkedServiceName": {
            "referenceName": "<Amazon S3 linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [ < physical schema, optional, auto retrieved during authoring > ],
        "typeProperties": {
            "location": {
                "type": "AmazonS3Location",
                "bucketName": "bucketname",
                "folderPath": "folder/subfolder"
            },
            "columnDelimiter": ",",
            "quoteChar": "\"",
            "firstRowAsHeader": true,
            "compressionCodec": "gzip"
        }
    }
}

复制活动属性Copy activity properties

有关可用于定义活动的各部分和属性的完整列表,请参阅管道一文。For a full list of sections and properties available for defining activities, see the Pipelines article. 本部分提供 Amazon S3 源支持的属性列表。This section provides a list of properties that the Amazon S3 source supports.

以 Amazon S3 作为源类型Amazon S3 as a source type

Azure 数据工厂支持以下文件格式。Azure Data Factory supports the following file formats. 请参阅每一篇介绍基于格式的设置的文章。Refer to each article for format-based settings.

Amazon S3 支持基于格式的复制源中 storeSettings 设置下的以下属性:The following properties are supported for Amazon S3 under storeSettings settings in a format-based copy source:

属性Property 说明Description 必须Required
typetype storeSettings 下的 type 属性必须设置为 AmazonS3ReadSettingsThe type property under storeSettings must be set to AmazonS3ReadSettings. Yes
找到要复制的文件:Locate the files to copy:
选项 1:静态路径OPTION 1: static path
从数据集中指定的给定存储桶或文件夹/文件路径复制。Copy from the given bucket or folder/file path specified in the dataset. 若要复制 Bucket 或文件夹中的所有文件,请另外将 wildcardFileName 指定为 *If you want to copy all files from a bucket or folder, additionally specify wildcardFileName as *.
选项 2:S3 前缀OPTION 2: S3 prefix
- prefix- prefix
数据集中配置的给定 Bucket 下的 S3 密钥名称的前缀,用于筛选源 S3 文件。Prefix for the S3 key name under the given bucket configured in a dataset to filter source S3 files. 名称以 bucket_in_dataset/this_prefix 开头的 S3 密钥会被选中。S3 keys whose names start with bucket_in_dataset/this_prefix are selected. 它利用 S3 的服务端筛选器。与通配符筛选器相比,该服务端筛选器可提供更好的性能。It utilizes S3's service-side filter, which provides better performance than a wildcard filter. No
选项 3:通配符OPTION 3: wildcard
- wildcardFolderPath- wildcardFolderPath
数据集中配置的给定 Bucket 下包含通配符的文件夹路径,用于筛选源文件夹。The folder path with wildcard characters under the given bucket configured in a dataset to filter source folders.
允许的通配符为:*(匹配零个或更多字符)和 ?(匹配零个或单个字符)。Allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character). 如果文件夹名内包含通配符或此转义字符,请使用 ^ 进行转义。Use ^ to escape if your folder name has a wildcard or this escape character inside.
请参阅文件夹和文件筛选器示例中的更多示例。See more examples in Folder and file filter examples.
No
选项 3:通配符OPTION 3: wildcard
- wildcardFileName- wildcardFileName
给定 Bucket 和文件夹路径(或通配符文件夹路径)下包含通配符的文件名,用于筛选源文件。The file name with wildcard characters under the given bucket and folder path (or wildcard folder path) to filter source files.
允许的通配符为:*(匹配零个或更多字符)和 ?(匹配零个或单个字符)。Allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character). 如果文件夹名内包含通配符或此转义字符,请使用 ^ 进行转义。Use ^ to escape if your folder name has a wildcard or this escape character inside. 请参阅文件夹和文件筛选器示例中的更多示例。See more examples in Folder and file filter examples.
Yes
选项 4:文件列表OPTION 4: a list of files
- fileListPath- fileListPath
指明复制给定文件集。Indicates to copy a given file set. 指向包含要复制的文件列表的文本文件,其中每行一个文件(即数据集中所配置路径的相对路径)。Point to a text file that includes a list of files you want to copy, one file per line, which is the relative path to the path configured in the dataset.
使用此选项时,请不要在数据集中指定文件名。When you're using this option, do not specify a file name in the dataset. 请参阅文件列表示例中的更多示例。See more examples in File list examples.
No
其他设置:Additional settings:
recursiverecursive 指示是要从子文件夹中以递归方式读取数据,还是只从指定的文件夹中读取数据。Indicates whether the data is read recursively from the subfolders or only from the specified folder. 请注意,当 recursive 设置为 true 且接收器是基于文件的存储时,将不会在接收器上复制或创建空的文件夹或子文件夹。Note that when recursive is set to true and the sink is a file-based store, an empty folder or subfolder isn't copied or created at the sink.
允许的值为 true(默认值)和 falseAllowed values are true (default) and false.
如果配置 fileListPath,则此属性不适用。This property doesn't apply when you configure fileListPath.
No
deleteFilesAfterCompletiondeleteFilesAfterCompletion 指示是否会在二进制文件成功移到目标存储后将其从源存储中删除。Indicates whether the binary files will be deleted from source store after successfully moving to the destination store. 文件删除按文件进行。因此,当复制活动失败时,你会看到一些文件已经复制到目标并从源中删除,而另一些文件仍保留在源存储中。The file deletion is per file, so when copy activity fails, you will see some files have already been copied to the destination and deleted from source, while others are still remaining on source store.
此属性仅在二进制文件复制方案中有效,该方案中的数据源存储为 Blob、ADLS Gen2、S3、Google 云存储、文件、Azure 文件存储、SFTP 或 FTP。This property is only valid in binary copy scenario, where data source stores are Blob, ADLS Gen2, S3, Google Cloud Storage, File, Azure File, SFTP, or FTP. 默认值:false。The default value: false.
No
modifiedDatetimeStartmodifiedDatetimeStart 文件根据“上次修改时间”属性进行筛选。Files are filtered based on the attribute: last modified.
如果文件的上次修改时间在 modifiedDatetimeStartmodifiedDatetimeEnd 之间的时间范围内,则将选中这些文件。The files will be selected if their last modified time is within the time range between modifiedDatetimeStart and modifiedDatetimeEnd. 该时间应用于 UTC 时区,格式为“2018-12-01T05:00:00Z”。The time is applied to a UTC time zone in the format of "2018-12-01T05:00:00Z".
属性可以为 NULL,这意味着不会向数据集应用任何文件属性筛选器。The properties can be NULL, which means no file attribute filter will be applied to the dataset. 如果 modifiedDatetimeStart 具有日期/时间值,但 modifiedDatetimeEnd 为 NULL,则会选中“上次修改时间”属性大于或等于该日期/时间值的文件。When modifiedDatetimeStart has a datetime value but modifiedDatetimeEnd is NULL, the files whose last modified attribute is greater than or equal to the datetime value will be selected. 如果 modifiedDatetimeEnd 具有日期/时间值,但 modifiedDatetimeStart 为 NULL,则会选中“上次修改时间”属性小于该日期/时间值的文件。When modifiedDatetimeEnd has a datetime value but modifiedDatetimeStart is NULL, the files whose last modified attribute is less than the datetime value will be selected.
如果配置 fileListPath,则此属性不适用。This property doesn't apply when you configure fileListPath.
No
modifiedDatetimeEndmodifiedDatetimeEnd 同上。Same as above. No
enablePartitionDiscoveryenablePartitionDiscovery 对于已分区的文件,请指定是否从文件路径分析分区,并将它们添加为附加的源列。For files that are partitioned, specify whether to parse the partitions from the file path and add them as additional source columns.
允许的值为 false(默认)和 true 。Allowed values are false (default) and true.
No
partitionRootPathpartitionRootPath 启用分区发现时,请指定绝对根路径,以便将已分区文件夹读取为数据列。When partition discovery is enabled, specify the absolute root path in order to read partitioned folders as data columns.

如果未指定,默认情况下,If it is not specified, by default,
- 在数据集或源的文件列表中使用文件路径时,分区根路径是在数据集中配置的路径。- When you use file path in dataset or list of files on source, partition root path is the path configured in dataset.
- 使用通配符文件夹筛选器时,分区根路径是第一个通配符前的子路径。- When you use wildcard folder filter, partition root path is the sub-path before the first wildcard.
- 在使用前缀时,分区根路径是最后一个“/”前的子路径。- When you use prefix, partition root path is sub-path before the last "/".

例如,假设你将数据集中的路径配置为“root/folder/year=2020/month=08/day=27”:For example, assuming you configure the path in dataset as "root/folder/year=2020/month=08/day=27":
- 如果将分区根路径指定为“root/folder/year=2020”,则除了文件内的列外,复制活动还会生成另外两个列 monthday,其值分别为“08”和“27”。- If you specify partition root path as "root/folder/year=2020", copy activity will generate two more columns month and day with value "08" and "27" respectively, in addition to the columns inside the files.
- 如果未指定分区根路径,则不会生成额外的列。- If partition root path is not specified, no extra column will be generated.
No
maxConcurrentConnectionsmaxConcurrentConnections 与数据存储的并发连接数。The number of concurrent connections to the data store. 仅在要限制与数据存储的并发连接数时指定。Specify only when you want to limit concurrent connections to the data store. No

示例:Example:

"activities":[
    {
        "name": "CopyFromAmazonS3",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<Delimited text input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "DelimitedTextSource",
                "formatSettings":{
                    "type": "DelimitedTextReadSettings",
                    "skipLineCount": 10
                },
                "storeSettings":{
                    "type": "AmazonS3ReadSettings",
                    "recursive": true,
                    "wildcardFolderPath": "myfolder*A",
                    "wildcardFileName": "*.csv"
                }
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

文件夹和文件筛选器示例Folder and file filter examples

本部分介绍使用通配符筛选器生成文件夹路径和文件名的行为。This section describes the resulting behavior of the folder path and file name with wildcard filters.

Bucketbucket keykey recursiverecursive 源文件夹结构和筛选器结果(用粗体表示的文件已检索)Source folder structure and filter result (files in bold are retrieved)
Bucketbucket Folder*/* falsefalse Bucketbucket
    FolderA    FolderA
        File1.csv        File1.csv
        File2.json        File2.json
        Subfolder1        Subfolder1
            File3.csv            File3.csv
            File4.json            File4.json
            File5.csv            File5.csv
    AnotherFolderB    AnotherFolderB
        File6.csv        File6.csv
Bucketbucket Folder*/* true Bucketbucket
    FolderA    FolderA
        File1.csv        File1.csv
        File2.json        File2.json
        Subfolder1        Subfolder1
            File3.csv            File3.csv
            File4.json            File4.json
            File5.csv            File5.csv
    AnotherFolderB    AnotherFolderB
        File6.csv        File6.csv
Bucketbucket Folder*/*.csv falsefalse Bucketbucket
    FolderA    FolderA
        File1.csv        File1.csv
        File2.json        File2.json
        Subfolder1        Subfolder1
            File3.csv            File3.csv
            File4.json            File4.json
            File5.csv            File5.csv
    AnotherFolderB    AnotherFolderB
        File6.csv        File6.csv
Bucketbucket Folder*/*.csv true Bucketbucket
    FolderA    FolderA
        File1.csv        File1.csv
        File2.json        File2.json
        Subfolder1        Subfolder1
            File3.csv            File3.csv
            File4.json            File4.json
            File5.csv            File5.csv
    AnotherFolderB    AnotherFolderB
        File6.csv        File6.csv

文件列表示例File list examples

本部分介绍了在复制活动源中使用文件列表路径时产生的行为。This section describes the resulting behavior of using a file list path in a Copy activity source.

假设有以下源文件夹结构,并且要复制以粗体显示的文件:Assume that you have the following source folder structure and want to copy the files in bold:

示例源结构Sample source structure FileListToCopy.txt 中的内容Content in FileListToCopy.txt 数据工厂配置Data Factory configuration
Bucketbucket
    FolderA    FolderA
        File1.csv        File1.csv
        File2.json        File2.json
        Subfolder1        Subfolder1
            File3.csv            File3.csv
            File4.json            File4.json
            File5.csv            File5.csv
    元数据    Metadata
        FileListToCopy.txt        FileListToCopy.txt
File1.csvFile1.csv
Subfolder1/File3.csvSubfolder1/File3.csv
Subfolder1/File5.csvSubfolder1/File5.csv
在数据集中:In dataset:
- 桶:bucket- Bucket: bucket
- 文件夹路径:FolderA- Folder path: FolderA

在复制活动源中:In Copy activity source:
- 文件列表路径:bucket/Metadata/FileListToCopy.txt- File list path: bucket/Metadata/FileListToCopy.txt

文件列表路径指向同一数据存储中的一个文本文件,该文件包含要复制的文件列表(每行一个文件,带有数据集中所配置路径的相对路径)。The file list path points to a text file in the same data store that includes a list of files you want to copy, one file per line, with the relative path to the path configured in the dataset.

在复制期间保留元数据Preserve metadata during copy

将文件从 Amazon S3 复制到 Azure Data Lake Storage Gen2 或 Azure Blob 存储时,可以选择将文件元数据与数据一起保留。When you copy files from Amazon S3 to Azure Data Lake Storage Gen2 or Azure Blob storage, you can choose to preserve the file metadata along with data. 保留元数据中了解更多信息。Learn more from Preserve metadata.

Lookup 活动属性Lookup activity properties

若要了解有关属性的详细信息,请查看 Lookup 活动To learn details about the properties, check Lookup activity.

GetMetadata 活动属性GetMetadata activity properties

若要了解有关属性的详细信息,请查看 GetMetadata 活动To learn details about the properties, check GetMetadata activity.

Delete 活动属性Delete activity properties

若要了解有关属性的详细信息,请查看删除活动To learn details about the properties, check Delete activity.

旧模型Legacy models

备注

仍会按原样支持以下模型,以实现后向兼容性。The following models are still supported as is for backward compatibility. 建议使用前面提到的新模型。We suggest that you use the new model mentioned earlier. 数据工厂创作 UI 已切换为生成新模型。The Data Factory authoring UI has switched to generating the new model.

旧数据集模型Legacy dataset model

propertiesProperty 说明Description 必需Required
typetype 数据集的 type 属性必须设置为 AmazonS3ObjectThe type property of the dataset must be set to AmazonS3Object. Yes
bucketNamebucketName S3 存储桶的名称。The S3 bucket name. 通配符筛选器不受支持。The wildcard filter is not supported. 对于复制或查找活动为“是”,对于 GetMetadata 活动为“否”Yes for the Copy or Lookup activity, no for the GetMetadata activity
keykey 指定 Bucket 下的 S3 对象键的名称或通配符筛选器。The name or wildcard filter of the S3 object key under the specified bucket. 仅当未指定 prefix 属性时应用。Applies only when the prefix property is not specified.

文件夹部分和文件名部分都支持通配符筛选器。The wildcard filter is supported for both the folder part and the file name part. 允许的通配符为:*(匹配零个或更多字符)和 ?(匹配零个或单个字符)。Allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character).
- 示例 1:"key": "rootfolder/subfolder/*.csv"- Example 1: "key": "rootfolder/subfolder/*.csv"
- 示例 2:"key": "rootfolder/subfolder/???20180427.txt"- Example 2: "key": "rootfolder/subfolder/???20180427.txt"
请参阅文件夹和文件筛选器示例中的更多示例。See more example in Folder and file filter examples. 如果实际文件夹名或文件名内具有通配符或此转义字符,请使用 ^ 进行转义。Use ^ to escape if your actual folder or file name has a wildcard or this escape character inside.
No
前缀prefix S3 对象键的前缀。Prefix for the S3 object key. 已选中其键以该前缀开头的对象。Objects whose keys start with this prefix are selected. 仅当未指定 key 属性时应用。Applies only when the key property is not specified. No
版本version 启用 S3 版本控制时 S3 对象的版本。The version of the S3 object, if S3 versioning is enabled. 如果未指定版本,则会提取最新版本。If a version is not specified, the latest version will be fetched. No
modifiedDatetimeStartmodifiedDatetimeStart 文件根据“上次修改时间”属性进行筛选。Files are filtered based on the attribute: last modified. 如果文件的上次修改时间在 modifiedDatetimeStartmodifiedDatetimeEnd 之间的时间范围内,则将选中这些文件。The files will be selected if their last modified time is within the time range between modifiedDatetimeStart and modifiedDatetimeEnd. 该时间应用于 UTC 时区,格式为“2018-12-01T05:00:00Z”。The time is applied to the UTC time zone in the format of "2018-12-01T05:00:00Z".

请注意,要对大量文件进行筛选时,启用此设置会影响数据移动的整体性能。Be aware that enabling this setting will affect the overall performance of data movement when you want to filter huge amounts of files.

属性可以为 NULL,这意味着不会向数据集应用任何文件属性筛选器。The properties can be NULL, which means no file attribute filter will be applied to the dataset. 如果 modifiedDatetimeStart 具有日期/时间值,但 modifiedDatetimeEnd 为 NULL,则会选中“上次修改时间”属性大于或等于该日期/时间值的文件。When modifiedDatetimeStart has a datetime value but modifiedDatetimeEnd is NULL, the files whose last modified attribute is greater than or equal to the datetime value will be selected. 如果 modifiedDatetimeEnd 具有日期/时间值,但 modifiedDatetimeStart 为 NULL,则会选中“上次修改时间”属性小于该日期/时间值的文件。When modifiedDatetimeEnd has a datetime value but modifiedDatetimeStart is NULL, the files whose last modified attribute is less than the datetime value will be selected.
No
modifiedDatetimeEndmodifiedDatetimeEnd 文件根据“上次修改时间”属性进行筛选。Files are filtered based on the attribute: last modified. 如果文件的上次修改时间在 modifiedDatetimeStartmodifiedDatetimeEnd 之间的时间范围内,则将选中这些文件。The files will be selected if their last modified time is within the time range between modifiedDatetimeStart and modifiedDatetimeEnd. 该时间应用于 UTC 时区,格式为“2018-12-01T05:00:00Z”。The time is applied to the UTC time zone in the format of "2018-12-01T05:00:00Z".

请注意,要对大量文件进行筛选时,启用此设置会影响数据移动的整体性能。Be aware that enabling this setting will affect the overall performance of data movement when you want to filter huge amounts of files.

属性可以为 NULL,这意味着不会向数据集应用任何文件属性筛选器。The properties can be NULL, which means no file attribute filter will be applied to the dataset. 如果 modifiedDatetimeStart 具有日期/时间值,但 modifiedDatetimeEnd 为 NULL,则会选中“上次修改时间”属性大于或等于该日期/时间值的文件。When modifiedDatetimeStart has a datetime value but modifiedDatetimeEnd is NULL, the files whose last modified attribute is greater than or equal to the datetime value will be selected. 如果 modifiedDatetimeEnd 具有日期/时间值,但 modifiedDatetimeStart 为 NULL,则会选中“上次修改时间”属性小于该日期/时间值的文件。When modifiedDatetimeEnd has a datetime value but modifiedDatetimeStart is NULL, the files whose last modified attribute is less than the datetime value will be selected.
No
formatformat 若要在基于文件的存储之间按原样复制文件(二进制副本),可以在输入和输出数据集定义中跳过格式节。If you want to copy files as is between file-based stores (binary copy), skip the format section in both input and output dataset definitions.

若要分析或生成具有特定格式的文件,以下是受支持的文件格式类型:TextFormat、JsonFormat、AvroFormat、OrcFormat、ParquetFormat 。If you want to parse or generate files with a specific format, the following file format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, ParquetFormat. 请将 format 中的 type 属性设置为上述值之一。Set the type property under format to one of these values. 有关详细信息,请参阅文本格式JSON 格式Avro 格式Orc 格式Parquet 格式部分。For more information, see the Text format, JSON format, Avro format, Orc format, and Parquet format sections.
否(仅适用于二进制复制方案)No (only for binary copy scenario)
compressioncompression 指定数据的压缩类型和级别。Specify the type and level of compression for the data. 有关详细信息,请参阅受支持的文件格式和压缩编解码器For more information, see Supported file formats and compression codecs.
支持的类型为 GZipDeflateBZip2ZipDeflateSupported types are GZip, Deflate, BZip2, and ZipDeflate.
支持的级别为“最佳”和“最快”。Supported levels are Optimal and Fastest.
No

提示

若要复制文件夹下的所有文件,请使用 Bucket 的名称指定“bucketName”并使用文件夹部分指定“prefix” 。To copy all files under a folder, specify bucketName for the bucket and prefix for the folder part.

若要复制具有给定名称的单个文件,请使用 Bucket 的名称指定“bucketName”并使用文件夹部分及文件名指定“key”。To copy a single file with a given name, specify bucketName for the bucket and key for the folder part plus file name.

若要复制文件夹下的文件子集,请使用 Bucket 的名称指定“bucketName”并使用文件夹部分及通配符筛选器指定“key” 。To copy a subset of files under a folder, specify bucketName for the bucket and key for the folder part plus wildcard filter.

示例:使用前缀Example: using prefix

{
    "name": "AmazonS3Dataset",
    "properties": {
        "type": "AmazonS3Object",
        "linkedServiceName": {
            "referenceName": "<Amazon S3 linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "bucketName": "testbucket",
            "prefix": "testFolder/test",
            "modifiedDatetimeStart": "2018-12-01T05:00:00Z",
            "modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
            "format": {
                "type": "TextFormat",
                "columnDelimiter": ",",
                "rowDelimiter": "\n"
            },
            "compression": {
                "type": "GZip",
                "level": "Optimal"
            }
        }
    }
}

示例:使用键和版本(可选)Example: using key and version (optional)

{
    "name": "AmazonS3Dataset",
    "properties": {
        "type": "AmazonS3",
        "linkedServiceName": {
            "referenceName": "<Amazon S3 linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "bucketName": "testbucket",
            "key": "testFolder/testfile.csv.gz",
            "version": "XXXXXXXXXczm0CJajYkHf0_k6LhBmkcL",
            "format": {
                "type": "TextFormat",
                "columnDelimiter": ",",
                "rowDelimiter": "\n"
            },
            "compression": {
                "type": "GZip",
                "level": "Optimal"
            }
        }
    }
}

复制活动的旧源模型Legacy source model for the Copy activity

属性Property 说明Description 必需Required
typetype 复制活动源的 type 属性必须设置为 FileSystemSourceThe type property of the Copy activity source must be set to FileSystemSource. Yes
recursiverecursive 指示是要从子文件夹中以递归方式读取数据,还是只从指定的文件夹中读取数据。Indicates whether the data is read recursively from the subfolders or only from the specified folder. 请注意,当 recursive 设置为 true 且接收器是基于文件的存储时,将不会在接收器上复制或创建空的文件夹或子文件夹。Note that when recursive is set to true and the sink is a file-based store, an empty folder or subfolder will not be copied or created at the sink.
允许的值为 true(默认值)和 falseAllowed values are true (default) and false.
No
maxConcurrentConnectionsmaxConcurrentConnections 可以同时连接到数据存储的连接数。The number of the connections to connect to the data store concurrently. 仅在要限制与数据存储的并发连接数时指定。Specify only when you want to limit concurrent connections to the data store. No

示例:Example:

"activities":[
    {
        "name": "CopyFromAmazonS3",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<Amazon S3 input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "FileSystemSource",
                "recursive": true
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

后续步骤Next steps

有关数据存储(Azure 数据工厂中的复制活动支持将其用作源和接收器)的列表,请参阅支持的数据存储For a list of data stores that the Copy activity in Azure Data Factory supports as sources and sinks, see Supported data stores.