为 HDInsight 上的 Apache Hadoop 开发 Java MapReduce 程序Develop Java MapReduce programs for Apache Hadoop on HDInsight

了解如何使用 Apache Maven 创建基于 Java 的 MapReduce 应用程序,并使用 Azure HDInsight 中的 Apache Hadoop 运行它。Learn how to use Apache Maven to create a Java-based MapReduce application, then run it with Apache Hadoop on Azure HDInsight.

Note

此示例最近在 HDInsight 3.6 上进行了测试。This example was most recently tested on HDInsight 3.6.

先决条件Prerequisites

  • Java JDK 8 或更高版本(或类似程序,如 OpenJDK)。Java JDK 8 or later (or an equivalent, such as OpenJDK).

    Note

    HDInsight 版本 3.4 及更早版本使用 Java 7。HDInsight versions 3.4 and earlier use Java 7. HDInsight 3.5 及更高版本使用 Java 8。HDInsight 3.5 and greater uses Java 8.

  • Apache MavenApache Maven

配置开发环境Configure development environment

可以在安装 Java 和 JDK 时设置以下环境变量。The following environment variables may be set when you install Java and the JDK. 不过,应该检查它们是否存在并且包含系统的正确值。However, you should check that they exist and that they contain the correct values for your system.

  • JAVA_HOME - 应该指向已安装 Java 运行时环境 (JRE) 的目录。JAVA_HOME - should point to the directory where the Java runtime environment (JRE) is installed. 例如,在 OS X、Unix 或 Linux 系统上,它的值应该类似于 /usr/lib/jvm/java-7-oracleFor example, on an OS X, Unix or Linux system, it should have a value similar to /usr/lib/jvm/java-7-oracle. 在 Windows 中,它的值类似于 c:\Program Files (x86)\Java\jre1.7In Windows, it would have a value similar to c:\Program Files (x86)\Java\jre1.7

  • PATH - 应该包含以下路径:PATH - should contain the following paths:

    • JAVA_HOME(或等效路径)JAVA_HOME (or the equivalent path)

    • JAVA_HOME\bin(或等效路径)JAVA_HOME\bin (or the equivalent path)

    • 安装 Maven 的目录The directory where Maven is installed

创建 Maven 项目Create a Maven project

  1. 在开发环境中,通过中断会话或命令行将目录更改为要存储此项目的位置。From a terminal session, or command line in your development environment, change directories to the location you want to store this project.

  2. 使用随同 Maven 一起安装的 mvn 命令,为项目生成基架。Use the mvn command, which is installed with Maven, to generate the scaffolding for the project.

    mvn archetype:generate -DgroupId=org.apache.hadoop.examples -DartifactId=wordcountjava -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
    

    Note

    如果使用 PowerShell,必须将 -D 参数用双引号引起来。If you are using PowerShell, you must enclose the -D parameters in double quotes.

    mvn archetype:generate "-DgroupId=org.apache.hadoop.examples" "-DartifactId=wordcountjava" "-DarchetypeArtifactId=maven-archetype-quickstart" "-DinteractiveMode=false"

    此命令将使用 artifactID 参数指定的名称(此示例中为 wordcountjava)创建目录。此目录包含以下项:This command creates a directory with the name specified by the artifactID parameter (wordcountjava in this example.) This directory contains the following items:

    • pom.xml - 项目对象模型 (POM),其中包含用于生成项目的信息和配置详细信息。pom.xml - The Project Object Model (POM) that contains information and configuration details used to build the project.

    • src - 包含应用程序的目录。src - The directory that contains the application.

  3. 删除 src/test/java/org/apache/hadoop/examples/apptest.java 文件,Delete the src/test/java/org/apache/hadoop/examples/apptest.java file. 此示例不使用该文件。It is not used in this example.

添加依赖项Add dependencies

  1. 编辑 pom.xml 文件,并在 <dependencies> 部分中添加以下文本:Edit the pom.xml file and add the following text inside the <dependencies> section:

     <dependency>
         <groupId>org.apache.hadoop</groupId>
         <artifactId>hadoop-mapreduce-examples</artifactId>
         <version>2.7.3</version>
         <scope>provided</scope>
     </dependency>
     <dependency>
         <groupId>org.apache.hadoop</groupId>
         <artifactId>hadoop-mapreduce-client-common</artifactId>
         <version>2.7.3</version>
         <scope>provided</scope>
     </dependency>
     <dependency>
         <groupId>org.apache.hadoop</groupId>
         <artifactId>hadoop-common</artifactId>
         <version>2.7.3</version>
         <scope>provided</scope>
     </dependency>
    

    这会定义具有特定版本(在 <version> 中列出)的库(在 <artifactId> 中列出)。This defines required libraries (listed within <artifactId>) with a specific version (listed within <version>). 在编译时,会从默认 Maven 存储库下载这些依赖项。At compile time, these dependencies are downloaded from the default Maven repository. 可以使用 Maven 存储库搜索 来查看详细信息。You can use the Maven repository search to view more.

    <scope>provided</scope> 告诉 Maven 这些依赖关系不应与此应用程序一起打包,因为它们会在运行时由 HDInsight 群集提供。The <scope>provided</scope> tells Maven that these dependencies should not be packaged with the application, as they are provided by the HDInsight cluster at run-time.

    Important

    使用的版本应与群集上存在的 Hadoop 版本匹配。The version used should match the version of Hadoop present on your cluster. 有关版本的详细信息,请参阅 HDInsight 组件版本控制文档。For more information on versions, see the HDInsight component versioning document.

  2. 将以下内容添加到 pom.xml 文件中。Add the following to the pom.xml file. 此文本必须位于文件中的 <project>...</project> 标记内;例如 </dependencies></project> 之间。This text must be inside the <project>...</project> tags in the file; for example, between </dependencies> and </project>.

     <build>
         <plugins>
         <plugin>
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-shade-plugin</artifactId>
             <version>2.3</version>
             <configuration>
             <transformers>
                 <transformer implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer">
                 </transformer>
             </transformers>
             </configuration>
             <executions>
             <execution>
                 <phase>package</phase>
                     <goals>
                     <goal>shade</goal>
                     </goals>
             </execution>
             </executions>
             </plugin>
         <plugin>
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-compiler-plugin</artifactId>
             <version>3.6.1</version>
             <configuration>
             <source>1.8</source>
             <target>1.8</target>
             </configuration>
         </plugin>
         </plugins>
     </build>
    

    第一个插件配置 Maven Shade Plugin,用于生成 uberjar(有时称为 fatjar),其中包含应用程序所需的依赖项。The first plugin configures the Maven Shade Plugin, which is used to build an uberjar (sometimes called a fatjar), which contains dependencies required by the application. 它还可以防止在 jar 包中复制许可证,复制许可证在某些系统中可能会导致问题。It also prevents duplication of licenses within the jar package, which can cause problems on some systems.

    第二个插件配置目标 Java 版本。The second plugin configures the target Java version.

    Note

    HDInsight 3.4 及更早版本使用 Java 7。HDInsight 3.4 and earlier use Java 7. HDInsight 3.5 及更高版本使用 Java 8。HDInsight 3.5 and greater uses Java 8.

  3. 保存 pom.xml 文件。Save the pom.xml file.

创建 MapReduce 应用程序Create the MapReduce application

  1. 转到 wordcountjava/src/main/java/org/apache/hadoop/examples 目录并将 App.java 文件重命名为 WordCount.javaGo to the wordcountjava/src/main/java/org/apache/hadoop/examples directory and rename the App.java file to WordCount.java.

  2. 在文本编辑器中打开 WordCount.java 文件,并将其内容替换为以下文本:Open the WordCount.java file in a text editor and replace the contents with the following text:

    package org.apache.hadoop.examples;
    
    import java.io.IOException;
    import java.util.StringTokenizer;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.util.GenericOptionsParser;
    
    public class WordCount {
    
        public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable>{
    
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
    
        public void map(Object key, Text value, Context context
                        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
            }
        }
    }
    
    public static class IntSumReducer
            extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();
    
        public void reduce(Text key, Iterable<IntWritable> values,
                            Context context
                            ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
            sum += val.get();
            }
           result.set(sum);
           context.write(key, result);
        }
    }
    
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
    }
    

    请注意,包名称为 org.apache.hadoop.examples,类名称为 WordCountNotice the package name is org.apache.hadoop.examples and the class name is WordCount. 提交 MapReduce 作业时,使用这些名称。You use these names when you submit the MapReduce job.

  3. 保存文件。Save the file.

构建应用程序Build the application

  1. 如果尚未到达此目录,请更改为 wordcountjava 目录。Change to the wordcountjava directory, if you are not already there.

  2. 使用以下命令生成包含该应用程序的 JAR 文件:Use the following command to build a JAR file containing the application:

    mvn clean package
    

    该指令会清除任何以前构建的项目,下载任何尚未安装的依赖项,并构建和打包应用程序。This command cleans any previous build artifacts, downloads any dependencies that have not already been installed, and then builds and package the application.

  3. 命令完成后,wordcountjava/target 目录包含一个名为 wordcountjava-1.0-SNAPSHOT.jar 的文件。Once the command finishes, the wordcountjava/target directory contains a file named wordcountjava-1.0-SNAPSHOT.jar.

    Note

    wordcountjava-1.0-SNAPSHOT.jar 文件是一种 uberjar,其中不仅包含 WordCount 作业,还包含作业在运行时需要的依赖项。The wordcountjava-1.0-SNAPSHOT.jar file is an uberjar, which contains not only the WordCount job, but also dependencies that the job requires at runtime.

上传该 jarUpload the jar

使用以下命令将该 jar 文件上传到 HDInsight 头节点:Use the following command to upload the jar file to the HDInsight headnode:

scp target/wordcountjava-1.0-SNAPSHOT.jar USERNAME@CLUSTERNAME-ssh.azurehdinsight.cn:

USERNAME 替换为群集的 SSH 用户名。Replace USERNAME with your SSH user name for the cluster. CLUSTERNAME 替换为 HDInsight 群集名称。Replace CLUSTERNAME with the HDInsight cluster name.

此命令会将文件从本地系统复制到头节点。This command copies the files from the local system to the head node. 有关详细信息,请参阅 将 SSH 与 HDInsight 配合使用For more information, see Use SSH with HDInsight.

在 Hadoop 上运行 MapReduce 作业Run the MapReduce job on Hadoop

  1. 使用 SSH 连接到 HDInsight。Connect to HDInsight using SSH. 有关详细信息,请参阅 将 SSH 与 HDInsight 配合使用For more information, see Use SSH with HDInsight.

  2. 在 SSH 会话中,使用以下命令运行 MapReduce 应用程序:From the SSH session, use the following command to run the MapReduce application:

    yarn jar wordcountjava-1.0-SNAPSHOT.jar org.apache.hadoop.examples.WordCount /example/data/gutenberg/davinci.txt /example/data/wordcountout
    

    此命令启动 WordCount MapReduce 应用程序。This command starts the WordCount MapReduce application. 输入文件是 /example/data/gutenberg/davinci.txt,输出目录是 /example/data/wordcountoutThe input file is /example/data/gutenberg/davinci.txt, and the output directory is /example/data/wordcountout. 输入文件和输出均存储到群集的默认存储中。Both the input file and output are stored to the default storage for the cluster.

  3. 作业完成后,请使用以下命令查看结果:Once the job completes, use the following command to view the results:

    hdfs dfs -cat /example/data/wordcountout/*
    

    用户会收到单词和计数列表,其包含的值类似于以下文本:You should receive a list of words and counts, with values similar to the following text:

     zeal    1
     zelus   1
     zenith  2
    

后续步骤Next steps

在本文档中,学习了如何开发 Java MapReduce 作业。In this document, you have learned how to develop a Java MapReduce job. 请参阅以下文档,了解使用 HDInsight 的其他方式。See the following documents for other ways to work with HDInsight.

有关详细信息,另请参阅 Java 开发人员中心For more information, see also the Java Developer Center.