Transform data using Hadoop MapReduce activity in Azure Data Factory or Synapse Analytics

APPLIES TO: Azure Data Factory Azure Synapse Analytics

Tip

Try out Data Factory in Microsoft Fabric, an all-in-one analytics solution for enterprises. Microsoft Fabric covers everything from data movement to data science, real-time analytics, business intelligence, and reporting. Learn how to start a new trial for free!

The HDInsight MapReduce activity in an Azure Data Factory or Synapse Analytics pipeline invokes MapReduce program on your own or on-demand HDInsight cluster. This article builds on the data transformation activities article, which presents a general overview of data transformation and the supported transformation activities.

To learn more, read through the introduction articles for Azure Data Factory and Synapse Analytics, and do the tutorial: Tutorial: transform data before reading this article.

See Pig and Hive for details about running Pig/Hive scripts on a HDInsight cluster from a pipeline by using HDInsight Pig and Hive activities.

Add an HDInsight MapReduce activity to a pipeline with UI

To use an HDInsight MapReduce activity to a pipeline, complete the following steps:

  1. Search for MapReduce in the pipeline Activities pane, and drag a MapReduce activity to the pipeline canvas.

  2. Select the new MapReduce activity on the canvas if it is not already selected.

  3. Select the HDI Cluster tab to select or create a new linked service to an HDInsight cluster that will be used to execute the MapReduce activity.

    Shows the UI for a MapReduce activity.

  4. Select the Jar tab to select or create a new Jar linked service to an Azure Storage account that will host your script. Specify a class name to be executed there, and a file path within the storage location. You can also configure advanced details including a Jar libs location, debugging configuration, and arguments and parameters to be passed to the script.

    Shows the UI for the Jar tab for a MapReduce activity.

Syntax

{
    "name": "Map Reduce Activity",
    "description": "Description",
    "type": "HDInsightMapReduce",
    "linkedServiceName": {
        "referenceName": "MyHDInsightLinkedService",
        "type": "LinkedServiceReference"
    },
    "typeProperties": {
        "className": "org.myorg.SampleClass",
        "jarLinkedService": {
            "referenceName": "MyAzureStorageLinkedService",
            "type": "LinkedServiceReference"
        },
        "jarFilePath": "MyAzureStorage/jars/sample.jar",
        "getDebugInfo": "Failure",
        "arguments": [
            "-SampleHadoopJobArgument1"
        ],
        "defines": {
            "param1": "param1Value"
        }
    }
}

Syntax details

Property Description Required
name Name of the activity Yes
description Text describing what the activity is used for No
type For MapReduce Activity, the activity type is HDinsightMapReduce Yes
linkedServiceName Reference to the HDInsight cluster registered as a linked service. To learn about this linked service, see Compute linked services article. Yes
className Name of the Class to be executed Yes
jarLinkedService Reference to an Azure Storage Linked Service used to store the Jar files. Only Azure Blob Storage and ADLS Gen2 linked services are supported here. If you don't specify this Linked Service, the Azure Storage Linked Service defined in the HDInsight Linked Service is used. No
jarFilePath Provide the path to the Jar files stored in the Azure Storage referred by jarLinkedService. The file name is case-sensitive. Yes
jarlibs String array of the path to the Jar library files referenced by the job stored in the Azure Storage defined in jarLinkedService. The file name is case-sensitive. No
getDebugInfo Specifies when the log files are copied to the Azure Storage used by HDInsight cluster (or) specified by jarLinkedService. Allowed values: None, Always, or Failure. Default value: None. No
arguments Specifies an array of arguments for a Hadoop job. The arguments are passed as command-line arguments to each task. No
defines Specify parameters as key/value pairs for referencing within the Hive script. No

Example

You can use the HDInsight MapReduce Activity to run any MapReduce jar file on an HDInsight cluster. In the following sample JSON definition of a pipeline, the HDInsight Activity is configured to run a Mahout JAR file.

{
    "name": "MapReduce Activity for Mahout",
    "description": "Custom MapReduce to generate Mahout result",
    "type": "HDInsightMapReduce",
    "linkedServiceName": {
        "referenceName": "MyHDInsightLinkedService",
        "type": "LinkedServiceReference"
    },
    "typeProperties": {
        "className": "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
        "jarLinkedService": {
            "referenceName": "MyStorageLinkedService",
            "type": "LinkedServiceReference"
        },
        "jarFilePath": "adfsamples/Mahout/jars/mahout-examples-0.9.0.2.2.7.1-34.jar",
        "arguments": [
            "-s",
            "SIMILARITY_LOGLIKELIHOOD",
            "--input",
            "wasb://adfsamples@spestore.blob.core.chinacloudapi.cn/Mahout/input",
            "--output",
            "wasb://adfsamples@spestore.blob.core.chinacloudapi.cn/Mahout/output/",
            "--maxSimilaritiesPerItem",
            "500",
            "--tempDir",
            "wasb://adfsamples@spestore.blob.core.chinacloudapi.cn/Mahout/temp/mahout"
        ]
    }
}

You can specify any arguments for the MapReduce program in the arguments section. At runtime, you see a few extra arguments (for example: mapreduce.job.tags) from the MapReduce framework. To differentiate your arguments with the MapReduce arguments, consider using both option and value as arguments as shown in the following example (-s,--input,--output etc., are options immediately followed by their values).

See the following articles that explain how to transform data in other ways: