在 Apache Hadoop on HDInsight 中使用 MapReduceUse MapReduce in Apache Hadoop on HDInsight

了解如何在 HDInsight 群集上运行 MapReduce 作业。Learn how to run MapReduce jobs on HDInsight clusters.

示例数据Example data

HDInsight 提供存储在 /example/data/HdiSamples 目录中的各种示例数据集。HDInsight provides various example data sets, which are stored in the /example/data and /HdiSamples directory. 这些目录位于群集的默认存储中。These directories are in the default storage for your cluster. 在本文档中,我们使用 /example/data/gutenberg/davinci.txt 文件。In this document, we use the /example/data/gutenberg/davinci.txt file. 此文件包含 Leonardo Da Vinci 的笔记本。This file contains the notebooks of Leonardo Da Vinci.

MapReduce 示例Example MapReduce

MapReduce 单词计数应用程序示例包含在 HDInsight 群集中。An example MapReduce word count application is included with your HDInsight cluster. 此示例位于群集默认存储的 /example/jars/hadoop-mapreduce-examples.jar 中。This example is located at /example/jars/hadoop-mapreduce-examples.jar on the default storage for your cluster.

以下 Java 代码是包含在 hadoop-mapreduce-examples.jar 文件中的 MapReduce 应用程序的源代码:The following Java code is the source of the MapReduce application contained in the hadoop-mapreduce-examples.jar file:

package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

    public static class TokenizerMapper
        extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
        }
    }
    }

    public static class IntSumReducer
        extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                        Context context
                        ) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
        sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
    }

    public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
        System.err.println("Usage: wordcount <in> <out>");
        System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

有关编写自己的 MapReduce 应用程序的说明,请参阅以下文档:For instructions to write your own MapReduce applications, see the following document:

运行 MapReduceRun the MapReduce

HDInsight 可以使用各种方法运行 HiveQL 作业。HDInsight can run HiveQL jobs by using various methods. 使用下表来确定哪种方法最适合用户,并访问此链接进行演练。Use the following table to decide which method is right for you, then follow the link for a walkthrough.

使用此方法...Use this... ...实现此目的...to do this ...使用此 群集操作系统...with this cluster operating system ...从此 客户端操作系统...from this client operating system
SSHSSH 通过 SSHUse the Hadoop command through SSH LinuxLinux Linux、Unix、Mac OS X 或 WindowsLinux, Unix, Mac OS X, or Windows
CurlCurl 使用 RESTSubmit the job remotely by using REST Linux 或 WindowsLinux or Windows Linux、Unix、Mac OS X 或 WindowsLinux, Unix, Mac OS X, or Windows
Windows PowerShellWindows PowerShell 使用 Windows PowerShellSubmit the job remotely by using Windows PowerShell Linux 或 WindowsLinux or Windows WindowsWindows

后续步骤Next steps

若要了解如何使用 HDInsight 中的数据的详细信息,请参阅以下文档:To learn more about working with data in HDInsight, see the following documents: