Run custom MapReduce programs
Apache Hadoop-based big data systems such as HDInsight enable data processing using a wide range of tools and technologies. The following table describes the main advantages and considerations for each one.
Query mechanism | Advantages | Considerations |
---|---|---|
Apache Hive using HiveQL |
|
|
Apache Pig using Pig Latin |
|
|
Custom map/reduce |
|
|
Apache HCatalog |
|
|
Typically, you use the simplest of these approaches that can provide the results you require. For example, you may be able to achieve such results by using just Hive, but for more complex scenarios you may need to use Pig, or even write your own map and reduce components. You may also decide, after experimenting with Hive or Pig, that custom map and reduce components can provide better performance by allowing you to fine-tune and optimize the processing.
Map/reduce code consists of two separate functions implemented as map and reduce components. The map component is run in parallel on multiple cluster nodes, each node applying the mapping to the node's own subset of the data. The reduce component collates and summarizes the results from all the map functions. For more information on these two components, see Use MapReduce in Hadoop on HDInsight.
In most HDInsight processing scenarios, it's simpler and more efficient to use a higher-level abstraction such as Pig or Hive. You can also create custom map and reduce components for use within Hive scripts to perform more sophisticated processing.
Custom map/reduce components are typically written in Java. Hadoop provides a streaming interface that also allows components to be used that are developed in other languages such as C#, F#, Visual Basic, Python, and JavaScript.
- For a walkthrough on developing custom Java MapReduce programs, see Develop Java MapReduce programs for Hadoop on HDInsight.
Consider creating your own map and reduce components for the following conditions:
- You need to process data that is completely unstructured by parsing the data and using custom logic to obtain structured information from it.
- You want to perform complex tasks that are difficult (or impossible) to express in Pig or Hive without resorting to creating a UDF. For example, you might need to use an external geocoding service to convert latitude and longitude coordinates or IP addresses in the source data to geographical location names.
- You want to reuse your existing .NET, Python, or JavaScript code in map/reduce components by using the Hadoop streaming interface.
The most common MapReduce programs are written in Java and compiled to a jar file.
After you've developed, compiled, and tested your MapReduce program, use the
scp
command to upload your jar file to the headnode.scp mycustomprogram.jar sshuser@CLUSTERNAME-ssh.azurehdinsight.cn
Replace CLUSTERNAME with the cluster name. If you used a password to secure the SSH account, you're prompted to enter the password. If you used a certificate, you may need to use the
-i
parameter to specify the private key file.Use ssh command to connect to your cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:
ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.cn
From the SSH session, execute your MapReduce program through YARN.
yarn jar mycustomprogram.jar mynamespace.myclass /example/data/sample.log /example/data/logoutput
This command submits the MapReduce job to YARN. The input file is
/example/data/sample.log
, and the output directory is/example/data/logoutput
. The input file and any output files are stored to the default storage for the cluster.
- Use C# with MapReduce streaming on Apache Hadoop in HDInsight
- Develop Java MapReduce programs for Apache Hadoop on HDInsight
- Use Azure Toolkit for Eclipse to create Apache Spark applications for an HDInsight cluster
- Use Python User Defined Functions (UDF) with Apache Hive and Apache Pig in HDInsight