使用 Azure 数据工厂从 HDFS 服务器复制数据Copy data from the HDFS server by using Azure Data Factory

适用于: Azure 数据工厂 Azure Synapse Analytics(预览版)

本文概述了如何从 Hadoop 分布式文件系统 (HDFS) 服务器复制数据。This article outlines how to copy data from the Hadoop Distributed File System (HDFS) server. 若要了解 Azure 数据工厂,请阅读介绍性文章To learn about Azure Data Factory, read the introductory article.

支持的功能Supported capabilities

以下活动支持 HDFS 连接器:The HDFS connector is supported for the following activities:

具体而言,HDFS 连接器支持:Specifically, the HDFS connector supports:

  • 使用 Windows (Kerberos) 或匿名身份验证复制文件 。Copying files by using Windows (Kerberos) or Anonymous authentication.
  • 使用 webhdfs 协议或内置 DistCp 支持复制文件 。Copying files by using the webhdfs protocol or built-in DistCp support.
  • 按原样复制文件,或者通过使用支持的文件格式和压缩编解码器分析或生成文件来复制文件。Copying files as is or by parsing or generating files with the supported file formats and compression codecs.

先决条件Prerequisites

如果数据存储位于本地网络、Azure 虚拟网络或 Amazon Virtual Private Cloud 内部,则需要设置自承载集成运行时才能连接到该数据存储。If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private Cloud, you need to set up a self-hosted integration runtime to connect to it.

如果数据存储是托管的云数据服务,则可以使用 Azure 集成运行时。If your data store is a managed cloud data service, you can use Azure integration runtime. 如果访问范围限制为防火墙规则中列入白名单的 IP,可以选择将 Azure 集成运行时 IP 添加到允许列表。If the access is restricted to IPs that are whitelisted in the firewall rules, you can choose to add Azure Integration Runtime IPs into the allow list.

要详细了解网络安全机制和数据工厂支持的选项,请参阅数据访问策略For more information about the network security mechanisms and options supported by Data Factory, see Data access strategies.

备注

请确保集成运行时可以访问 Hadoop 群集的所有 [名称节点服务器]:[名称节点端口] 和 [数据节点服务器]:[数据节点端口]。Make sure that the integration runtime can access all the [name node server]:[name node port] and [data node servers]:[data node port] of the Hadoop cluster. 默认 [名称节点端口] 为 50070,默认 [数据节点端口] 为 50075。The default [name node port] is 50070, and the default [data node port] is 50075.

入门Get started

若要使用管道执行复制活动,可以使用以下工具或 SDK 之一:To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:

对于特定于 HDFS 的数据工厂实体,以下部分提供有关用于定义这些实体的属性的详细信息。The following sections provide details about properties that are used to define Data Factory entities specific to HDFS.

链接服务属性Linked service properties

HDFS 链接服务支持以下属性:The following properties are supported for the HDFS linked service:

属性Property 说明Description 必须Required
typetype type 属性必须设置为 HdfsThe type property must be set to Hdfs. Yes
urlurl HDFS 的 URLThe URL to the HDFS Yes
authenticationTypeauthenticationType 允许的值为 Anonymous 或 Windows 。The allowed values are Anonymous or Windows.

若要设置本地环境,请参阅对 HDFS 连接器使用 Kerberos 身份验证部分。To set up your on-premises environment, see the Use Kerberos authentication for the HDFS connector section.
Yes
userNameuserName Windows 身份验证的用户名。The username for Windows authentication. 对于 Kerberos 身份验证,请指定 <username>@<domain>.comFor Kerberos authentication, specify <username>@<domain>.com. 是(对于 Windows 身份验证)Yes (for Windows authentication)
passwordpassword Windows 身份验证的密码。The password for Windows authentication. 将此字段标记为 SecureString 以安全地将其存储在数据工厂中,或引用存储在 Azure Key Vault 中的机密Mark this field as a SecureString to store it securely in your data factory, or reference a secret stored in an Azure key vault. 是(对于 Windows 身份验证)Yes (for Windows Authentication)
connectViaconnectVia 用于连接到数据存储的集成运行时The integration runtime to be used to connect to the data store. 若要了解详细信息,请参阅先决条件部分。To learn more, see the Prerequisites section. 如果未指定集成运行时,服务会使用默认的 Azure Integration Runtime。If the integration runtime isn't specified, the service uses the default Azure Integration Runtime. No

示例:使用匿名身份验证Example: using Anonymous authentication

{
    "name": "HDFSLinkedService",
    "properties": {
        "type": "Hdfs",
        "typeProperties": {
            "url" : "http://<machine>:50070/webhdfs/v1/",
            "authenticationType": "Anonymous",
            "userName": "hadoop"
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

示例:使用 Windows 身份验证Example: using Windows authentication

{
    "name": "HDFSLinkedService",
    "properties": {
        "type": "Hdfs",
        "typeProperties": {
            "url" : "http://<machine>:50070/webhdfs/v1/",
            "authenticationType": "Windows",
            "userName": "<username>@<domain>.com (for Kerberos auth)",
            "password": {
                "type": "SecureString",
                "value": "<password>"
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

数据集属性Dataset properties

有关可用于定义数据集的各个部分和属性的完整列表,请参阅 Azure 数据工厂中的数据集For a full list of sections and properties that are available for defining datasets, see Datasets in Azure Data Factory.

Azure 数据工厂支持以下文件格式。Azure Data Factory supports the following file formats. 请参阅每一篇介绍基于格式的设置的文章。Refer to each article for format-based settings.

HTTP 支持基于格式的数据集中 location 设置下的以下属性:The following properties are supported for HDFS under location settings in the format-based dataset:

属性Property 说明Description 必须Required
typetype 数据集中 location 下的 type 属性必须设置为 HdfsLocation。The type property under location in the dataset must be set to HdfsLocation. Yes
folderPathfolderPath 文件夹的路径。The path to the folder. 如果要使用通配符来筛选文件夹,请跳过此设置并在活动源设置中指定路径。If you want to use a wildcard to filter the folder, skip this setting and specify the path in activity source settings. No
fileNamefileName 指定的 folderPath 下的文件名。The file name under the specified folderPath. 如果要使用通配符来筛选文件,请跳过此设置并在活动源设置中指定文件名。If you want to use a wildcard to filter files, skip this setting and specify the file name in activity source settings. No

示例:Example:

{
    "name": "DelimitedTextDataset",
    "properties": {
        "type": "DelimitedText",
        "linkedServiceName": {
            "referenceName": "<HDFS linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [ < physical schema, optional, auto retrieved during authoring > ],
        "typeProperties": {
            "location": {
                "type": "HdfsLocation",
                "folderPath": "root/folder/subfolder"
            },
            "columnDelimiter": ",",
            "quoteChar": "\"",
            "firstRowAsHeader": true,
            "compressionCodec": "gzip"
        }
    }
}

复制活动属性Copy activity properties

有关可用于定义活动的各部分和属性的完整列表,请参阅 Azure 数据工厂中的管道和活动For a full list of sections and properties that are available for defining activities, see Pipelines and activities in Azure Data Factory. 本部分提供 HDFS 源支持的属性列表。This section provides a list of properties that are supported by the HDFS source.

以 HDFS 作为源HDFS as source

Azure 数据工厂支持以下文件格式。Azure Data Factory supports the following file formats. 请参阅每一篇介绍基于格式的设置的文章。Refer to each article for format-based settings.

HDFS 支持基于格式的复制源中 storeSettings 设置下的以下属性:The following properties are supported for HDFS under storeSettings settings in the format-based Copy source:

属性Property 说明Description 必须Required
typetype storeSettings 下的 type 属性必须设置为 HdfsReadSettingsThe type property under storeSettings must be set to HdfsReadSettings. Yes
*找到要复制的文件 _*Locate the files to copy _
选项 1:静态路径OPTION 1: static path
从数据集中指定的文件夹或文件路径复制。Copy from the folder or file path that's specified in the dataset. 若要复制文件夹中的所有文件,请另外将 wildcardFileName 指定为 _If you want to copy all files from a folder, additionally specify wildcardFileName as _.
选项 2:通配符OPTION 2: wildcard
- wildcardFolderPath- wildcardFolderPath
带有通配符的文件夹路径,用于筛选源文件夹。The folder path with wildcard characters to filter source folders.
允许的通配符为:*(匹配零个或更多字符)和 ?(匹配零个或单个字符)。Allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character). 如果实际文件夹名内具有通配符或此转义字符,请使用 ^ 进行转义。Use ^ to escape if your actual folder name has a wildcard or this escape character inside.
如需更多示例,请参阅文件夹和文件筛选器示例For more examples, see Folder and file filter examples.
No
选项 2:通配符OPTION 2: wildcard
- wildcardFileName- wildcardFileName
指定的 folderPath/wildcardFolderPath 下带有通配符的文件名,用于筛选源文件。The file name with wildcard characters under the specified folderPath/wildcardFolderPath to filter source files.
允许的通配符为 *(匹配零个或零个以上的字符)和 ?(匹配零个或单个字符);如果实际文件夹名称中包含通配符或此转义字符,请使用 ^ 进行转义。Allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character); use ^ to escape if your actual folder name has a wildcard or this escape character inside. 如需更多示例,请参阅文件夹和文件筛选器示例For more examples, see Folder and file filter examples.
Yes
选项 3:文件列表OPTION 3: a list of files
- fileListPath- fileListPath
表示要复制指定文件集。Indicates to copy a specified file set. 指向一个文本文件,其中包含要复制的文件列表(每行一个文件,带有数据集中所配置路径的相对路径)。Point to a text file that includes a list of files you want to copy (one file per line, with the relative path to the path configured in the dataset).
使用此选项时,请不要在数据集中指定文件名。When you use this option, do not specify file name in the dataset. 如需更多示例,请参阅文件列表示例For more examples, see File list examples.
No
*其他设置 _*Additional settings _
recursiverecursive 指示是要从子文件夹中以递归方式读取数据,还是只从指定的文件夹中读取数据。Indicates whether the data is read recursively from the subfolders or only from the specified folder. recursive 设置为 true 且接收器是基于文件的存储时,将不会在接收器上复制或创建空的文件夹或子文件夹。When recursive is set to _true* and the sink is a file-based store, an empty folder or subfolder isn't copied or created at the sink.
允许的值为 true (默认值)和 falseAllowed values are true (default) and false.
如果配置 fileListPath,则此属性不适用。This property doesn't apply when you configure fileListPath.
No
deleteFilesAfterCompletiondeleteFilesAfterCompletion 指示是否会在二进制文件成功移到目标存储后将其从源存储中删除。Indicates whether the binary files will be deleted from source store after successfully moving to the destination store. 文件删除按文件进行。因此,当复制活动失败时,你会看到一些文件已经复制到目标并从源中删除,而另一些文件仍保留在源存储中。The file deletion is per file, so when copy activity fails, you will see some files have already been copied to the destination and deleted from source, while others are still remaining on source store.
此属性仅在二进制文件复制方案中有效。This property is only valid in binary files copy scenario. 默认值:false。The default value: false.
No
modifiedDatetimeStartmodifiedDatetimeStart 文件根据“上次修改时间”属性进行筛选。Files are filtered based on the attribute Last Modified.
如果文件的上次修改时间在 modifiedDatetimeStartmodifiedDatetimeEnd 之间的范围内,则会选中这些文件。The files are selected if their last modified time is within the range of modifiedDatetimeStart to modifiedDatetimeEnd. 该时间应用于 UTC 时区,格式为“2018-12-01T05:00:00Z”。The time is applied to the UTC time zone in the format of 2018-12-01T05:00:00Z.
属性可以为 NULL,这意味着不向数据集应用任何文件特性筛选器。The properties can be NULL, which means that no file attribute filter is applied to the dataset. 如果 modifiedDatetimeStart 具有日期/时间值,但 modifiedDatetimeEnd 为 NULL,则意味着将选中“上次修改时间”属性大于或等于该日期/时间值的文件。When modifiedDatetimeStart has a datetime value but modifiedDatetimeEnd is NULL, it means that the files whose last modified attribute is greater than or equal to the datetime value are selected. 如果 modifiedDatetimeEnd 具有日期/时间值,但 modifiedDatetimeStart 为 NULL,则意味着将选中“上次修改时间”属性小于该日期/时间值的文件。When modifiedDatetimeEnd has a datetime value but modifiedDatetimeStart is NULL, it means that the files whose last modified attribute is less than the datetime value are selected.
如果配置 fileListPath,则此属性不适用。This property doesn't apply when you configure fileListPath.
No
modifiedDatetimeEndmodifiedDatetimeEnd 同上。Same as above.
enablePartitionDiscoveryenablePartitionDiscovery 对于已分区的文件,请指定是否从文件路径分析分区,并将它们添加为附加的源列。For files that are partitioned, specify whether to parse the partitions from the file path and add them as additional source columns.
允许的值为 false(默认值)和 true 。Allowed values are false (default) and true.
No
partitionRootPathpartitionRootPath 启用了分区发现时,请指定绝对根路径,以便将已分区文件夹读取为数据列。When partition discovery is enabled, specify the absolute root path in order to read partitioned folders as data columns.

如果未指定,则默认情况下,If it is not specified, by default,
- 在源的数据集或文件列表中使用文件路径时,分区根路径是在数据集中配置的路径。- When you use file path in dataset or list of files on source, partition root path is the path configured in dataset.
- 使用通配符文件夹筛选器时,分区根路径是第一个通配符前的子路径。- When you use wildcard folder filter, partition root path is the sub-path before the first wildcard.

例如,假设你将数据集中的路径配置为“root/folder/year=2020/month=08/day=27”:For example, assuming you configure the path in dataset as "root/folder/year=2020/month=08/day=27":
- 如果将分区根路径指定为“root/folder/year=2020”,则复制活动除了生成文件内的列外,还会分别生成另外两个列,即“month”和“day”,其值分别为“08”和“27”。- If you specify partition root path as "root/folder/year=2020", copy activity will generate two more columns month and day with value "08" and "27" respectively, in addition to the columns inside the files.
- 如果未指定分区根路径,则不会生成额外的列。- If partition root path is not specified, no extra column will be generated.
No
maxConcurrentConnectionsmaxConcurrentConnections 可以同时连接到存储区存储的连接数。The number of connections that can connect to the storage store concurrently. 仅在要限制与数据存储的并发连接时指定一个值。Specify a value only when you want to limit the concurrent connection to the data store. No
*DistCp 设置 _*DistCp settings _
distcpSettingsdistcpSettings 使用 HDFS DistCp 时将使用的属性组。The property group to use when you use HDFS DistCp. No
resourceManagerEndpointresourceManagerEndpoint YARN (Yet Another Resource Negotiator) 终结点The YARN (Yet Another Resource Negotiator) endpoint 是(如果使用 DistCp)Yes, if using DistCp
tempScriptPathtempScriptPath 用于存储临时 DistCp 命令脚本的文件夹路径。A folder path that's used to store the temp DistCp command script. 脚本文件由数据工厂生成,将在复制作业完成后删除。The script file is generated by Data Factory and will be removed after the Copy job is finished. 是(如果使用 DistCp)Yes, if using DistCp
distcpOptionsdistcpOptions 提供给 DistCp 命令的其他选项。Additional options provided to DistCp command. No

_ 示例: *_ Example:*

"activities":[
    {
        "name": "CopyFromHDFS",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<Delimited text input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "DelimitedTextSource",
                "formatSettings":{
                    "type": "DelimitedTextReadSettings",
                    "skipLineCount": 10
                },
                "storeSettings":{
                    "type": "HdfsReadSettings",
                    "recursive": true,
                    "distcpSettings": {
                        "resourceManagerEndpoint": "resourcemanagerendpoint:8088",
                        "tempScriptPath": "/usr/hadoop/tempscript",
                        "distcpOptions": "-m 100"
                    }
                }
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

文件夹和文件筛选器示例Folder and file filter examples

本部分介绍了对文件夹路径和文件名使用通配符筛选器时产生的行为。This section describes the resulting behavior if you use a wildcard filter with the folder path and file name.

folderPathfolderPath fileNamefileName recursiverecursive 源文件夹结构和筛选器结果(用 粗体 表示的文件已检索)Source folder structure and filter result (files in bold are retrieved)
Folder* (为空,使用默认值)(empty, use default) falsefalse FolderAFolderA
    File1.csv    File1.csv
    File2.json    File2.json
    Subfolder1    Subfolder1
        File3.csv        File3.csv
        File4.json        File4.json
        File5.csv        File5.csv
AnotherFolderBAnotherFolderB
    File6.csv    File6.csv
Folder* (为空,使用默认值)(empty, use default) true FolderAFolderA
    File1.csv    File1.csv
    File2.json    File2.json
    Subfolder1    Subfolder1
        File3.csv        File3.csv
        File4.json        File4.json
        File5.csv        File5.csv
AnotherFolderBAnotherFolderB
    File6.csv    File6.csv
Folder* *.csv falsefalse FolderAFolderA
    File1.csv    File1.csv
    File2.json    File2.json
    Subfolder1    Subfolder1
        File3.csv        File3.csv
        File4.json        File4.json
        File5.csv        File5.csv
AnotherFolderBAnotherFolderB
    File6.csv    File6.csv
Folder* *.csv true FolderAFolderA
    File1.csv    File1.csv
    File2.json    File2.json
    Subfolder1    Subfolder1
        File3.csv        File3.csv
        File4.json        File4.json
        File5.csv        File5.csv
AnotherFolderBAnotherFolderB
    File6.csv    File6.csv

文件列表示例File list examples

本部分介绍了在复制活动源中使用文件列表路径时产生的行为。This section describes the behavior that results from using a file list path in the Copy activity source. 假设有以下源文件夹结构,并且要复制粗体类型的文件:It assumes that you have the following source folder structure and want to copy the files that are in bold type:

示例源结构Sample source structure FileListToCopy.txt 中的内容Content in FileListToCopy.txt Azure 数据工厂配置Azure Data Factory configuration
rootroot
    FolderA    FolderA
        File1.csv        File1.csv
        File2.json        File2.json
        Subfolder1        Subfolder1
            File3.csv            File3.csv
            File4.json            File4.json
            File5.csv            File5.csv
    元数据    Metadata
        FileListToCopy.txt        FileListToCopy.txt
File1.csvFile1.csv
Subfolder1/File3.csvSubfolder1/File3.csv
Subfolder1/File5.csvSubfolder1/File5.csv
在数据集中:In the dataset:
- 文件夹路径:root/FolderA- Folder path: root/FolderA

在复制活动源中:In the Copy activity source:
- 文件列表路径:root/Metadata/FileListToCopy.txt- File list path: root/Metadata/FileListToCopy.txt

文件列表路径指向同一数据存储中的一个文本文件,该文件包含要复制的文件列表(每行一个文件,带有数据集中所配置路径的相对路径)。The file list path points to a text file in the same data store that includes a list of files you want to copy (one file per line, with the relative path to the path configured in the dataset).

使用 DistCp 从 HDFS 复制数据Use DistCp to copy data from HDFS

DistCp 是 Hadoop 本机命令行工具,用于在 Hadoop 群集中进行分布式复制。DistCp is a Hadoop native command-line tool for doing a distributed copy in a Hadoop cluster. 在 Distcp 中运行某个命令时,该命令首先列出要复制的所有文件,然后在 Hadoop 群集中创建多个 Map 作业。When you run a command in DistCp, it first lists all the files to be copied and then creates several Map jobs in the Hadoop cluster. 每个 Map 作业会将数据以二进制格式从源复制到接收器。Each Map job does a binary copy from the source to the sink.

复制活动支持使用 DistCp 将文件按原样复制到 Azure Blob 存储(包括暂存复制)。The Copy activity supports using DistCp to copy files as is into Azure Blob storage (including staged copy). 在这种情况下,DistCp 可以利用群集的功能,而不必在自承载集成运行时上运行。In this case, DistCp can take advantage of your cluster's power instead of running on the self-hosted integration runtime. 使用 DistCp 可以提供更高的复制吞吐量,尤其是在群集非常强大的情况下。Using DistCp provides better copy throughput, especially if your cluster is very powerful. 根据数据工厂中的配置,复制活动会自动构造 DistCp 命令,将其提交到 Hadoop 群集并监视复制状态。Based on the configuration in your data factory, the Copy activity automatically constructs a DistCp command, submits it to your Hadoop cluster, and monitors the copy status.

先决条件Prerequisites

若要使用 DistCp 将文件按原样从 HDFS 复制到 Azure Blob 存储(包括暂存复制),请确保 Hadoop 群集满足以下要求:To use DistCp to copy files as is from HDFS to Azure Blob storage (including staged copy), make sure that your Hadoop cluster meets the following requirements:

  • 启用了 MapReduce 和 YARN 服务。The MapReduce and YARN services are enabled.

  • YARN 版本为 2.5 或更高版本。YARN version is 2.5 or later.

  • HDFS 服务器已与目标数据存储集成:Azure Blob 存储:The HDFS server is integrated with your target data store: Azure Blob storage:

    从 Hadoop 2.7 起,为 Azure Blob FileSystem 提供本机支持。Azure Blob FileSystem is natively supported since Hadoop 2.7. 只需在 Hadoop 环境配置中指定 JAR 路径即可。You need only to specify the JAR path in the Hadoop environment configuration.

  • 在 HDFS 中准备临时文件夹。Prepare a temp folder in HDFS. 此临时文件夹用于存储 DistCp shell 脚本,因此会占用 KB 级的空间。This temp folder is used to store a DistCp shell script, so it will occupy KB-level space.

  • 确保 HDFS 链接服务中提供的用户帐户具有以下权限:Make sure that the user account that's provided in the HDFS linked service has permission to:

    • 在 YARN 中提交应用程序。Submit an application in YARN.
    • 在临时文件夹下创建一个子文件夹和读/写文件。Create a subfolder and read/write files under the temp folder.

配置Configurations

如需与 DistCp 相关的配置和示例,请参阅以 HDFS 作为源部分。For DistCp-related configurations and examples, go to the HDFS as source section.

对 HDFS 连接器使用 Kerberos 身份验证Use Kerberos authentication for the HDFS connector

可以通过两个选项来设置本地环境,以便对 HDFS 连接器使用 Kerberos 身份验证。There are two options for setting up the on-premises environment to use Kerberos authentication for the HDFS connector. 你可以选择更适合自己情况的选项。You can choose the one that better fits your situation.

对于任一选项,确保为 Hadoop 群集打开 webhdfs:For either option, make sure you turn on webhdfs for Hadoop cluster:

  1. 为 webhdfs 创建 HTTP 主体和密钥表。Create the HTTP principal and keytab for webhdfs.

    重要

    根据 Kerberos HTTP SPNEGO 规范,HTTP Kerberos 主体必须以“HTTP/”开头。The HTTP Kerberos principal must start with " HTTP/ " according to Kerberos HTTP SPNEGO specification.

    Kadmin> addprinc -randkey HTTP/<namenode hostname>@<REALM.COM>
    Kadmin> ktadd -k /etc/security/keytab/spnego.service.keytab HTTP/<namenode hostname>@<REALM.COM>
    
  2. HDFS 配置选项:在 hdfs-site.xml 中添加以下三个属性。HDFS configuration options: add the following three properties in hdfs-site.xml.

    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.web.authentication.kerberos.principal</name>
        <value>HTTP/_HOST@<REALM.COM></value>
    </property>
    <property>
        <name>dfs.web.authentication.kerberos.keytab</name>
        <value>/etc/security/keytab/spnego.service.keytab</value>
    </property>
    

方法 1:在 Kerberos 领域中加入自承载集成运行时计算机Option 1: Join a self-hosted integration runtime machine in the Kerberos realm

要求Requirements

  • 需将自承载集成运行时计算机加入 Kerberos 领域,而不能将其加入任何 Windows 域。The self-hosted integration runtime machine needs to join the Kerberos realm and can’t join any Windows domain.

配置方式How to configure

在 KDC 服务器上:On the KDC server:

创建供 Azure 数据工厂使用的主体,然后指定密码。Create a principal for Azure Data Factory to use, and specify the password.

重要

用户名应不包含主机名。The username should not contain the hostname.

Kadmin> addprinc <username>@<REALM.COM>

在自承载集成运行时计算机上:On the self-hosted integration runtime machine:

  1. 运行 Ksetup 实用工具来配置 Kerberos 密钥分发中心 (KDC) 服务器和领域。Run the Ksetup utility to configure the Kerberos Key Distribution Center (KDC) server and realm.

    由于 Kerberos 领域不同于 Windows 域,因此计算机必须配置为工作组的成员。The machine must be configured as a member of a workgroup, because a Kerberos realm is different from a Windows domain. 可运行以下命令来设置 Kerberos 领域和添加 KDC 服务器,以便实现此配置。You can achieve this configuration by setting the Kerberos realm and adding a KDC server by running the following commands. 将 REALM.COM 替换为你自己的领域名。Replace REALM.COM with your own realm name.

    C:> Ksetup /setdomain REALM.COM
    C:> Ksetup /addkdc REALM.COM <your_kdc_server_address>
    

    运行上述命令后,重启计算机。After you run these commands, restart the machine.

  2. 使用 Ksetup 命令验证配置。Verify the configuration with the Ksetup command. 输出应如下所示:The output should be like:

    C:> Ksetup
    default realm = REALM.COM (external)
    REALM.com:
        kdc = <your_kdc_server_address>
    

在数据工厂中:In your data factory:

  • 将 Windows 身份验证与 Kerberos 主体名称及密码一起使用来配置 HDFS 连接器,以连接到 HDFS 数据源。Configure the HDFS connector by using Windows authentication together with your Kerberos principal name and password to connect to the HDFS data source. 有关配置详细信息,请查看 HDFS 链接服务属性部分。For configuration details, check the HDFS linked service properties section.

选项 2:启用 Windows 域和 Kerberos 领域之间的相互信任Option 2: Enable mutual trust between the Windows domain and the Kerberos realm

要求Requirements

  • 自承载集成运行时计算机必须加入 Windows 域。The self-hosted integration runtime machine must join a Windows domain.
  • 需要用于更新域控制器设置的权限。You need permission to update the domain controller's settings.

配置方式How to configure

备注

将以下教程中的 REALM.COM 和 AD.COM 替换为你自己的领域名和域控制器。Replace REALM.COM and AD.COM in the following tutorial with your own realm name and domain controller.

在 KDC 服务器上:On the KDC server:

  1. 编辑 krb5.conf 文件中的 KDC 配置,通过引用以下配置模板,让 KDC 信任 Windows 域。Edit the KDC configuration in the krb5.conf file to let KDC trust the Windows domain by referring to the following configuration template. 默认情况下,配置位于 /etc/krb5.conf。By default, the configuration is located at /etc/krb5.conf.

    [logging]
     default = FILE:/var/log/krb5libs.log
     kdc = FILE:/var/log/krb5kdc.log
     admin_server = FILE:/var/log/kadmind.log
    
    [libdefaults]
     default_realm = REALM.COM
     dns_lookup_realm = false
     dns_lookup_kdc = false
     ticket_lifetime = 24h
     renew_lifetime = 7d
     forwardable = true
    
    [realms]
     REALM.COM = {
      kdc = node.REALM.COM
      admin_server = node.REALM.COM
     }
    AD.COM = {
     kdc = windc.ad.com
     admin_server = windc.ad.com
    }
    
    [domain_realm]
     .REALM.COM = REALM.COM
     REALM.COM = REALM.COM
     .ad.com = AD.COM
     ad.com = AD.COM
    
    [capaths]
     AD.COM = {
      REALM.COM = .
     }
    

    在配置此文件后,重启 KDC 服务。After you configure the file, restart the KDC service.

  2. 使用以下命令准备 KDC 服务器中名为 krbtgt/REALM.COM@AD.COM 的主体:Prepare a principal named krbtgt/REALM.COM@AD.COM in the KDC server with the following command:

    Kadmin> addprinc krbtgt/REALM.COM@AD.COM
    
  3. hadoop.security.auth_to_local HDFS 服务配置文件中,添加 RULE:[1:$1@$0](.*\@AD.COM)s/\@.*//In the hadoop.security.auth_to_local HDFS service configuration file, add RULE:[1:$1@$0](.*\@AD.COM)s/\@.*//.

在域控制器上:On the domain controller:

  1. 运行以下 Ksetup 命令以添加领域条目:Run the following Ksetup commands to add a realm entry:

    C:> Ksetup /addkdc REALM.COM <your_kdc_server_address>
    C:> ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM
    
  2. 建立从 Windows 域到 Kerberos 领域的信任。Establish trust from the Windows domain to the Kerberos realm. [password] 是主体 krbtgt/REALM.COM@AD.COM 的密码。[password] is the password for the principal krbtgt/REALM.COM@AD.COM.

    C:> netdom trust REALM.COM /Domain: AD.COM /add /realm /password:[password]
    
  3. 选择在 Kerberos 中使用的加密算法。Select the encryption algorithm that's used in Kerberos.

    a.a. 选择“服务器管理器” > “组策略管理” > “域” > “组策略对象” > “默认或活动的域策略”,然后选择“编辑”。 Select Server Manager > Group Policy Management > Domain > Group Policy Objects > Default or Active Domain Policy , and then select Edit.

    b.b. 在“组策略管理编辑器”窗格中,选择“计算机配置” > “策略” > “Windows 设置” > “安全设置” > “本地策略” > “安全选项”,然后配置“网络安全: 配置 Kerberos 允许使用的加密类型”。On the Group Policy Management Editor pane, select Computer Configuration > Policies > Windows Settings > Security Settings > Local Policies > Security Options , and then configure Network security: Configure Encryption types allowed for Kerberos.

    c.c. 选择需要在连接到 KDC 服务器时使用的加密算法。Select the encryption algorithm you want to use when you connect to the KDC server. 可以选择所有选项。You can select all the options.

    “网络安全:配置 Kerberos 允许的加密类型”窗格的屏幕截图

    d.d. 使用 Ksetup 命令可指定要在指定领域使用的加密算法。Use the Ksetup command to specify the encryption algorithm to be used on the specified realm.

    C:> ksetup /SetEncTypeAttr REALM.COM DES-CBC-CRC DES-CBC-MD5 RC4-HMAC-MD5 AES128-CTS-HMAC-SHA1-96 AES256-CTS-HMAC-SHA1-96
    
  4. 创建域帐户和 Kerberos 主体之间的映射,以便在 Windows 域中使用 Kerberos 主体。Create the mapping between the domain account and the Kerberos principal, so that you can use the Kerberos principal in the Windows domain.

    a.a. 选择“管理工具” > “Active Directory 用户和计算机” 。Select Administrative tools > Active Directory Users and Computers.

    b.b. 通过选择“视图” > “高级功能”,配置高级功能 。Configure advanced features by selecting View > Advanced Features.

    c.c. 在“高级功能”窗格中,右键单击要创建映射的帐户,然后在“名称映射”窗格中选择“Kerberos 名称”选项卡 。On the Advanced Features pane, right-click the account to which you want to create mappings and, on the Name Mappings pane, select the Kerberos Names tab.

    d.d. 从领域中添加主体。Add a principal from the realm.

    “安全标识映射”窗格

在自承载集成运行时计算机上:On the self-hosted integration runtime machine:

  • 运行以下 Ksetup 命令以添加领域条目。Run the following Ksetup commands to add a realm entry.

    C:> Ksetup /addkdc REALM.COM <your_kdc_server_address>
    C:> ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM
    

在数据工厂中:In your data factory:

  • 将 Windows 身份验证与域帐户或 Kerberos 主体一起使用来配置 HDFS 连接器,以连接到 HDFS 数据源。Configure the HDFS connector by using Windows authentication together with either your domain account or Kerberos principal to connect to the HDFS data source. 有关配置详细信息,请参阅 HDFS 链接服务属性部分。For configuration details, see the HDFS linked service properties section.

查找活动属性Lookup activity properties

有关查找活动属性的信息,请参阅 Azure 数据工厂中的查找活动For information about Lookup activity properties, see Lookup activity in Azure Data Factory.

Delete 活动属性Delete activity properties

有关删除活动属性的信息,请参阅 Azure 数据工厂中的删除活动For information about Delete activity properties, see Delete activity in Azure Data Factory.

旧模型Legacy models

备注

仍会按原样支持以下模型,以实现后向兼容性。The following models are still supported as is for backward compatibility. 建议你使用前面讨论的新模型,因为 Azure 数据工厂创作 UI 已切换到生成新模型。We recommend that you use the previously discussed new model, because the Azure Data Factory authoring UI has switched to generating the new model.

旧数据集模型Legacy dataset model

propertiesProperty 说明Description 必须Required
typetype 数据集的 type 属性必须设置为 FileShareThe type property of the dataset must be set to FileShare Yes
folderPathfolderPath 文件夹的路径。The path to the folder. 支持通配符筛选器。A wildcard filter is supported. 允许的通配符为 *(匹配零个或零个以上的字符)和 ?(匹配零个或单个字符);如果实际文件名中包含通配符或此转义字符,请使用 ^ 进行转义。Allowed wildcards are * (matches zero or more characters) and ? (matches zero or a single character); use ^ to escape if your actual file name has a wildcard or this escape character inside.

示例:“rootfolder/subfolder/”,请参阅文件夹和文件筛选器示例中的更多示例。Examples: rootfolder/subfolder/, see more examples in Folder and file filter examples.
Yes
fileNamefileName 指定的“folderPath”下的文件的名称或通配符筛选器。The name or wildcard filter for the files under the specified "folderPath". 如果没有为此属性指定任何值,则数据集会指向文件夹中的所有文件。If you don't specify a value for this property, the dataset points to all files in the folder.

对于筛选器,允许的通配符为 *(匹配零个或零个以上的字符)和 ?(匹配零个或单个字符)。For filter, allowed wildcards are * (matches zero or more characters) and ? (matches zero or a single character).
- 示例 1:"fileName": "*.csv"- Example 1: "fileName": "*.csv"
- 示例 2:"fileName": "???20180427.txt"- Example 2: "fileName": "???20180427.txt"
如果实际文件夹名内具有通配符或此转义字符,请使用 ^ 进行转义。Use ^ to escape if your actual folder name has a wildcard or this escape character inside.
No
modifiedDatetimeStartmodifiedDatetimeStart 文件根据“上次修改时间”属性进行筛选。Files are filtered based on the attribute Last Modified. 如果文件的上次修改时间在 modifiedDatetimeStartmodifiedDatetimeEnd 之间的范围内,则会选中这些文件。The files are selected if their last modified time is within the range of modifiedDatetimeStart to modifiedDatetimeEnd. 该时间应用于 UTC 时区,格式为“2018-12-01T05:00:00Z”。The time is applied to the UTC time zone in the format 2018-12-01T05:00:00Z.

请注意,当你要对大量文件应用文件筛选器时,启用此设置会影响数据移动的整体性能。Be aware that the overall performance of data movement will be affected by enabling this setting when you want to apply a file filter to large numbers of files.

属性可以为 NULL,这意味着不向数据集应用任何文件特性筛选器。The properties can be NULL, which means that no file attribute filter is applied to the dataset. 如果 modifiedDatetimeStart 具有日期/时间值,但 modifiedDatetimeEnd 为 NULL,则意味着将选中“上次修改时间”属性大于或等于该日期/时间值的文件。When modifiedDatetimeStart has a datetime value but modifiedDatetimeEnd is NULL, it means that the files whose last modified attribute is greater than or equal to the datetime value are selected. 如果 modifiedDatetimeEnd 具有日期/时间值,但 modifiedDatetimeStart 为 NULL,则意味着将选中“上次修改时间”属性小于该日期/时间值的文件。When modifiedDatetimeEnd has a datetime value but modifiedDatetimeStart is NULL, it means that the files whose last modified attribute is less than the datetime value are selected.
No
modifiedDatetimeEndmodifiedDatetimeEnd 文件根据“上次修改时间”属性进行筛选。Files are filtered based on the attribute Last Modified. 如果文件的上次修改时间在 modifiedDatetimeStartmodifiedDatetimeEnd 之间的范围内,则会选中这些文件。The files are selected if their last modified time is within the range of modifiedDatetimeStart to modifiedDatetimeEnd. 该时间应用于 UTC 时区,格式为“2018-12-01T05:00:00Z”。The time is applied to the UTC time zone in the format 2018-12-01T05:00:00Z.

请注意,当你要对大量文件应用文件筛选器时,启用此设置会影响数据移动的整体性能。Be aware that the overall performance of data movement will be affected by enabling this setting when you want to apply a file filter to large numbers of files.

属性可以为 NULL,这意味着不向数据集应用任何文件特性筛选器。The properties can be NULL, which means that no file attribute filter is applied to the dataset. 如果 modifiedDatetimeStart 具有日期/时间值,但 modifiedDatetimeEnd 为 NULL,则意味着将选中“上次修改时间”属性大于或等于该日期/时间值的文件。When modifiedDatetimeStart has a datetime value but modifiedDatetimeEnd is NULL, it means that the files whose last modified attribute is greater than or equal to the datetime value are selected. 如果 modifiedDatetimeEnd 具有日期/时间值,但 modifiedDatetimeStart 为 NULL,则意味着将选中“上次修改时间”属性小于该日期/时间值的文件。When modifiedDatetimeEnd has a datetime value but modifiedDatetimeStart is NULL, it means that the files whose last modified attribute is less than the datetime value are selected.
No
formatformat 若要在基于文件的存储之间按原样复制文件(二进制副本),可以在输入和输出数据集定义中跳过格式节。If you want to copy files as is between file-based stores (binary copy), skip the format section in both the input and output dataset definitions.

若要分析具有特定格式的文件,以下是受支持的文件格式类型:TextFormat、JsonFormat、AvroFormat、OrcFormat、ParquetFormat 。If you want to parse files with a specific format, the following file format types are supported: TextFormat , JsonFormat , AvroFormat , OrcFormat , ParquetFormat. 请将格式中的“type”属性设置为上述值之一。Set the type property under format to one of these values. 有关详细信息,请参阅文本格式JSON 格式Avro 格式ORC 格式Parquet 格式部分。For more information, see the Text format, JSON format, Avro format, ORC format, and Parquet format sections.
否(仅适用于二进制复制方案)No (only for binary copy scenario)
compressioncompression 指定数据的压缩类型和级别。Specify the type and level of compression for the data. 有关详细信息,请参阅受支持的文件格式和压缩编解码器For more information, see Supported file formats and compression codecs.
支持的类型包括:Gzip、Deflate、Bzip2 和 ZipDeflate。Supported types are: Gzip , Deflate , Bzip2 , and ZipDeflate.
支持的级别为:“最佳”和“最快” 。Supported levels are: Optimal and Fastest.
No

提示

如需复制文件夹下的所有文件,请仅指定 folderPathTo copy all files under a folder, specify folderPath only.
若要复制具有指定名称的单个文件,请使用文件夹部分指定 folderPath 并使用文件名指定 fileName。To copy a single file with a specified name, specify folderPath with folder part and fileName with file name.
如需复制文件夹下的文件子集,请指定文件夹部分的 folderPath 和通配符筛选器部分的 fileNameTo copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.

示例:Example:

{
    "name": "HDFSDataset",
    "properties": {
        "type": "FileShare",
        "linkedServiceName":{
            "referenceName": "<HDFS linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "folderPath": "folder/subfolder/",
            "fileName": "*",
            "modifiedDatetimeStart": "2018-12-01T05:00:00Z",
            "modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
            "format": {
                "type": "TextFormat",
                "columnDelimiter": ",",
                "rowDelimiter": "\n"
            },
            "compression": {
                "type": "GZip",
                "level": "Optimal"
            }
        }
    }
}

旧复制活动源模型Legacy Copy activity source model

属性Property 说明Description 必须Required
typetype 复制活动源的 type 属性必须设置为 HdfsSource。The type property of the Copy activity source must be set to HdfsSource. Yes
recursiverecursive 指示是要从子文件夹中以递归方式读取数据,还是只从指定的文件夹中读取数据。Indicates whether the data is read recursively from the subfolders or only from the specified folder. 当 recursive 设置为 true 且接收器是基于文件的存储时,将不会在接收器上复制或创建空的文件夹或子文件夹。When recursive is set to true and the sink is a file-based store, an empty folder or subfolder will not be copied or created at the sink.
允许的值为 true (默认值)和 falseAllowed values are true (default) and false.
No
distcpSettingsdistcpSettings 使用 HDFS DistCp 时的属性组。The property group when you're using HDFS DistCp. No
resourceManagerEndpointresourceManagerEndpoint YARN 资源管理器终结点The YARN Resource Manager endpoint 是(如果使用 DistCp)Yes, if using DistCp
tempScriptPathtempScriptPath 用于存储临时 DistCp 命令脚本的文件夹路径。A folder path that's used to store the temp DistCp command script. 脚本文件由数据工厂生成,将在复制作业完成后删除。The script file is generated by Data Factory and will be removed after the Copy job is finished. 是(如果使用 DistCp)Yes, if using DistCp
distcpOptionsdistcpOptions 提供给 DistCp 命令的其他选项。Additional options are provided to DistCp command. No
maxConcurrentConnectionsmaxConcurrentConnections 可以同时连接到存储区存储的连接数。The number of connections that can connect to the storage store concurrently. 仅在要限制与数据存储的并发连接时指定一个值。Specify a value only when you want to limit the concurrent connection to the data store. No

示例:复制活动中使用 DistCp 的 HDFS 源Example: HDFS source in Copy activity using DistCp

"source": {
    "type": "HdfsSource",
    "distcpSettings": {
        "resourceManagerEndpoint": "resourcemanagerendpoint:8088",
        "tempScriptPath": "/usr/hadoop/tempscript",
        "distcpOptions": "-m 100"
    }
}

后续步骤Next steps

有关可供 Azure 数据工厂中的复制活动用作源和接收器的数据存储的列表,请参阅支持的数据存储For a list of data stores that are supported as sources and sinks by the Copy activity in Azure Data Factory, see supported data stores.