在 Azure 数据工厂中使用 Hadoop MapReduce 活动转换数据Transform data using Hadoop MapReduce activity in Azure Data Factory

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

数据工厂管道中的 HDInsight MapReduce 活动会在自己的按需 HDInsight 群集上调用 MapReduce 程序。The HDInsight MapReduce activity in a Data Factory pipeline invokes MapReduce program on your own or on-demand HDInsight cluster. 本文基于数据转换活动一文,它概述了数据转换和受支持的转换活动。This article builds on the data transformation activities article, which presents a general overview of data transformation and the supported transformation activities.

如果不熟悉 Azure 数据工厂,请在阅读本文之前,先通读 Azure 数据工厂简介,并学习以下教程:教程:转换数据If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the tutorial: Tutorial: transform data before reading this article.

请参阅 PigHive,深入了解如何使用 HDInsight Pig 和 Hive 活动通过管道在 HDInsight 群集上运行 Pig/Hive 脚本。See Pig and Hive for details about running Pig/Hive scripts on a HDInsight cluster from a pipeline by using HDInsight Pig and Hive activities.

语法Syntax

{
    "name": "Map Reduce Activity",
    "description": "Description",
    "type": "HDInsightMapReduce",
    "linkedServiceName": {
        "referenceName": "MyHDInsightLinkedService",
        "type": "LinkedServiceReference"
    },
    "typeProperties": {
        "className": "org.myorg.SampleClass",
        "jarLinkedService": {
            "referenceName": "MyAzureStorageLinkedService",
            "type": "LinkedServiceReference"
        },
        "jarFilePath": "MyAzureStorage/jars/sample.jar",
        "getDebugInfo": "Failure",
        "arguments": [
            "-SampleHadoopJobArgument1"
        ],
        "defines": {
            "param1": "param1Value"
        }
    }
}

语法详细信息Syntax details

属性Property 说明Description 必须Required
namename 活动名称Name of the activity Yes
descriptiondescription 描述活动用途的文本Text describing what the activity is used for No
typetype 对于 MapReduce 活动,活动类型是 HDinsightMapReduceFor MapReduce Activity, the activity type is HDinsightMapReduce Yes
linkedServiceNamelinkedServiceName 引用在数据工厂中注册为链接服务的 HDInsight 群集。Reference to the HDInsight cluster registered as a linked service in Data Factory. 若要了解此链接服务,请参阅计算链接服务一文。To learn about this linked service, see Compute linked services article. Yes
classNameclassName 要执行的类的名称Name of the Class to be executed Yes
jarLinkedServicejarLinkedService 对 Azure 存储链接服务的引用,该服务用于存储 Jar 文件。Reference to an Azure Storage Linked Service used to store the Jar files. 此处仅支持 Azure Blob 存储ADLS Gen2 链接服务 。Only Azure Blob Storage and ADLS Gen2 linked services are supported here. 如果未指定此链接服务,则使用 HDInsight 链接服务中定义的 Azure 存储链接服务。If you don't specify this Linked Service, the Azure Storage Linked Service defined in the HDInsight Linked Service is used. No
jarFilePathjarFilePath 提供由 jarLinkedService 引用的 Azure 存储中存储的 Jar 文件的路径。Provide the path to the Jar files stored in the Azure Storage referred by jarLinkedService. 文件名称需区分大小写。The file name is case-sensitive. Yes
jarlibsjarlibs 作业引用的 Jar 库文件路径的字符串数组,该作业存储在 jarLinkedService 中定义的 Azure 存储中。String array of the path to the Jar library files referenced by the job stored in the Azure Storage defined in jarLinkedService. 文件名称需区分大小写。The file name is case-sensitive. No
getDebugInfogetDebugInfo 指定何时将日志文件复制到 HDInsight 群集使用的(或者)jarLinkedService 指定的 Azure 存储。Specifies when the log files are copied to the Azure Storage used by HDInsight cluster (or) specified by jarLinkedService. 允许的值:None、Always 或 Failure。Allowed values: None, Always, or Failure. 默认值:无。Default value: None. No
参数arguments 指定 Hadoop 作业的参数数组。Specifies an array of arguments for a Hadoop job. 参数以命令行参数的形式传递到每个任务。The arguments are passed as command-line arguments to each task. No
定义defines 在 Hive 脚本中指定参数作为键/值对,以供引用。Specify parameters as key/value pairs for referencing within the Hive script. No

示例Example

可使用 HDInsight MapReduce 活动在 HDInsight 群集中运行任何 MapReduce jar 文件。You can use the HDInsight MapReduce Activity to run any MapReduce jar file on an HDInsight cluster. 在管道的以下示例 JSON 定义中,配置了HDInsight 活动,以便运行 Mahout JAR 文件。In the following sample JSON definition of a pipeline, the HDInsight Activity is configured to run a Mahout JAR file.

{
    "name": "MapReduce Activity for Mahout",
    "description": "Custom MapReduce to generate Mahout result",
    "type": "HDInsightMapReduce",
    "linkedServiceName": {
        "referenceName": "MyHDInsightLinkedService",
        "type": "LinkedServiceReference"
    },
    "typeProperties": {
        "className": "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
        "jarLinkedService": {
            "referenceName": "MyStorageLinkedService",
            "type": "LinkedServiceReference"
        },
        "jarFilePath": "adfsamples/Mahout/jars/mahout-examples-0.9.0.2.2.7.1-34.jar",
        "arguments": [
            "-s",
            "SIMILARITY_LOGLIKELIHOOD",
            "--input",
            "wasb://adfsamples@spestore.blob.core.chinacloudapi.cn/Mahout/input",
            "--output",
            "wasb://adfsamples@spestore.blob.core.chinacloudapi.cn/Mahout/output/",
            "--maxSimilaritiesPerItem",
            "500",
            "--tempDir",
            "wasb://adfsamples@spestore.blob.core.chinacloudapi.cn/Mahout/temp/mahout"
        ]
    }
}

可以在参数部分为 MapReduce 程序指定任意参数。You can specify any arguments for the MapReduce program in the arguments section. 运行时,可在 MapReduce 框架中看到几个额外的参数(例如:mapreduce.job.tags)。At runtime, you see a few extra arguments (for example: mapreduce.job.tags) from the MapReduce framework. 要区分自己的参数和 MapReduce 参数,请考虑将选项和值同时作为参数使用,如下例所示(-s、--input、--output 等选项后面紧接相应的值)。To differentiate your arguments with the MapReduce arguments, consider using both option and value as arguments as shown in the following example (-s,--input,--output etc., are options immediately followed by their values).

后续步骤Next steps

参阅以下文章了解如何以其他方式转换数据:See the following articles that explain how to transform data in other ways: