使用 Azure 数据工厂中的 Hadoop 流式处理活动转换数据Transform data using Hadoop Streaming activity in Azure Data Factory

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

数据工厂管道中的 HDInsight 流式处理活动会在自己的按需 HDInsight 群集上执行 Hadoop 流式处理程序。The HDInsight Streaming Activity in a Data Factory pipeline executes Hadoop Streaming programs on your own or on-demand HDInsight cluster. 本文基于数据转换活动一文,它概述了数据转换和受支持的转换活动。This article builds on the data transformation activities article, which presents a general overview of data transformation and the supported transformation activities.

如果不熟悉 Azure 数据工厂,请在阅读本文之前,先通读 Azure 数据工厂简介,并学习教程:转换数据If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the Tutorial: transform data before reading this article.

JSON 示例JSON sample

{
    "name": "Streaming Activity",
    "description": "Description",
    "type": "HDInsightStreaming",
    "linkedServiceName": {
        "referenceName": "MyHDInsightLinkedService",
        "type": "LinkedServiceReference"
    },
    "typeProperties": {
        "mapper": "MyMapper.exe",
        "reducer": "MyReducer.exe",
        "combiner": "MyCombiner.exe",
        "fileLinkedService": {
            "referenceName": "MyAzureStorageLinkedService",
            "type": "LinkedServiceReference"
        },
        "filePaths": [
            "<containername>/example/apps/MyMapper.exe",
            "<containername>/example/apps/MyReducer.exe",
            "<containername>/example/apps/MyCombiner.exe"
        ],
        "input": "wasb://<containername>@<accountname>.blob.core.chinacloudapi.cn/example/input/MapperInput.txt",
        "output": "wasb://<containername>@<accountname>.blob.core.chinacloudapi.cn/example/output/ReducerOutput.txt",
        "commandEnvironment": [
            "CmdEnvVarName=CmdEnvVarValue"
        ],
        "getDebugInfo": "Failure",
        "arguments": [
            "SampleHadoopJobArgument1"
        ],
        "defines": {
            "param1": "param1Value"
        }
    }
}

语法详细信息Syntax details

属性Property 说明Description 必须Required
namename 活动名称Name of the activity Yes
descriptiondescription 描述活动用途的文本Text describing what the activity is used for No
typetype 对于 Hadoop 流式处理活动,活动类型是 HDInsightStreamingFor Hadoop Streaming Activity, the activity type is HDInsightStreaming Yes
linkedServiceNamelinkedServiceName 引用在数据工厂中注册为链接服务的 HDInsight 群集。Reference to the HDInsight cluster registered as a linked service in Data Factory. 若要了解此链接服务,请参阅计算链接服务一文。To learn about this linked service, see Compute linked services article. Yes
mappermapper 指定映射器可执行文件的名称Specifies the name of the mapper executable Yes
reducerreducer 指定化简器可执行文件的名称Specifies the name of the reducer executable Yes
combinercombiner 指定合并器可执行文件的名称Specifies the name of the combiner executable No
fileLinkedServicefileLinkedService 对 Azure 存储链接服务的引用,该服务用于存储要执行的映射器、合并器和化简器程序。Reference to an Azure Storage Linked Service used to store the Mapper, Combiner, and Reducer programs to be executed. 此处仅支持 Azure Blob 存储ADLS Gen2 链接服务 。Only Azure Blob Storage and ADLS Gen2 linked services are supported here. 如果未指定此链接服务,则使用 HDInsight 链接服务中定义的 Azure 存储链接服务。If you don't specify this Linked Service, the Azure Storage Linked Service defined in the HDInsight Linked Service is used. No
filePathfilePath 提供由 fileLinkedService 引用的 Azure 存储中存储的映射器、合并器和化简器程序的路径数组。Provide an array of path to the Mapper, Combiner, and Reducer programs stored in the Azure Storage referred by fileLinkedService. 此路径区分大小写。The path is case-sensitive. Yes
inputinput 指定映射器输入文件的 WASB 路径。Specifies the WASB path to the input file for the Mapper. Yes
outputoutput 指定化简器输出文件的 WASB 路径。Specifies the WASB path to the output file for the Reducer. Yes
getDebugInfogetDebugInfo 指定何时将日志文件复制到 HDInsight 群集使用的(或者)scriptLinkedService 指定的 Azure 存储。Specifies when the log files are copied to the Azure Storage used by HDInsight cluster (or) specified by scriptLinkedService. 允许的值:None、Always 或 Failure。Allowed values: None, Always, or Failure. 默认值:无。Default value: None. No
参数arguments 指定 Hadoop 作业的参数数组。Specifies an array of arguments for a Hadoop job. 参数以命令行参数的形式传递到每个任务。The arguments are passed as command-line arguments to each task. No
定义defines 在 Hive 脚本中指定参数作为键/值对,以供引用。Specify parameters as key/value pairs for referencing within the Hive script. No

后续步骤Next steps

参阅以下文章了解如何以其他方式转换数据:See the following articles that explain how to transform data in other ways: