为 HDInsight 上的 Apache Hadoop 开发 Java MapReduce 程序Develop Java MapReduce programs for Apache Hadoop on HDInsight

了解如何使用 Apache Maven 创建基于 Java 的 MapReduce 应用程序,并使用 Azure HDInsight 中的 Apache Hadoop 运行它。Learn how to use Apache Maven to create a Java-based MapReduce application, then run it with Apache Hadoop on Azure HDInsight.

必备条件Prerequisites

配置开发环境Configure development environment

本文使用的环境是一台运行 Windows 10 的计算机。The environment used for this article was a computer running Windows 10. 命令在命令提示符下执行,各种文件使用记事本进行编辑。The commands were executed in a command prompt, and the various files were edited with Notepad. 针对环境进行相应的修改。Modify accordingly for your environment.

在命令提示符下,输入以下命令以创建工作环境:From a command prompt, enter the commands below to create a working environment:

IF NOT EXIST C:\HDI MKDIR C:\HDI
cd C:\HDI

创建 Maven 项目Create a Maven project

  1. 输入以下命令,以创建名为 wordcountjava 的 Maven 项目:Enter the following command to create a Maven project named wordcountjava:

    mvn archetype:generate -DgroupId=org.apache.hadoop.examples -DartifactId=wordcountjava -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
    

    此命令将使用 artifactID 参数指定的名称(此示例中为 wordcountjava)创建目录。此目录包含以下项:This command creates a directory with the name specified by the artifactID parameter (wordcountjava in this example.) This directory contains the following items:

    • pom.xml - 项目对象模型 (POM),其中包含用于生成项目的信息和配置详细信息。pom.xml - The Project Object Model (POM) that contains information and configuration details used to build the project.
    • src\main\java\org\apache\hadoop\examples:包含应用程序代码。src\main\java\org\apache\hadoop\examples: Contains your application code.
    • src\test\java\org\apache\hadoop\examples:包含应用程序的测试。src\test\java\org\apache\hadoop\examples: Contains tests for your application.
  2. 删除生成的示例代码。Remove the generated example code. 输入以下命令,删除生成的测试和应用程序文件 AppTest.javaApp.javaDelete the generated test and application files AppTest.java, and App.java by entering the commands below:

    cd wordcountjava
    DEL src\main\java\org\apache\hadoop\examples\App.java
    DEL src\test\java\org\apache\hadoop\examples\AppTest.java
    

更新项目对象模型Update the Project Object Model

有关 pom.xml 文件的完整参考,请参阅 https://maven.apache.org/pom.htmlFor a full reference of the pom.xml file, see https://maven.apache.org/pom.html. 输入以下命令打开 pom.xmlOpen pom.xml by entering the command below:

notepad pom.xml

添加依赖项Add dependencies

pom.xml<dependencies> 节中添加以下文本:In pom.xml, add the following text in the <dependencies> section:

 <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-mapreduce-examples</artifactId>
     <version>2.7.3</version>
     <scope>provided</scope>
 </dependency>
 <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-mapreduce-client-common</artifactId>
     <version>2.7.3</version>
     <scope>provided</scope>
 </dependency>
 <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-common</artifactId>
     <version>2.7.3</version>
     <scope>provided</scope>
 </dependency>
This defines required libraries (listed within &lt;artifactId\>) with a specific version (listed within &lt;version\>). At compile time, these dependencies are downloaded from the default Maven repository. You can use the [Maven repository search](https://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-mapreduce-examples%7C2.5.1%7Cjar) to view more.

The `<scope>provided</scope>` tells Maven that these dependencies should not be packaged with the application, as they are provided by the HDInsight cluster at run-time.

> [!IMPORTANT]
> The version used should match the version of Hadoop present on your cluster. For more information on versions, see the [HDInsight component versioning](../hdinsight-component-versioning.md) document.

生成配置Build configuration

Maven 插件可用于自定义项目的生成阶段。Maven plug-ins allow you to customize the build stages of the project. 此节用于添加插件、资源和其他生成配置选项。This section is used to add plug-ins, resources, and other build configuration options.

将以下代码添加到 pom.xml 文件,然后保存并关闭该文件。Add the following code to the pom.xml file, and then save and close the file. 此文本必须位于文件中的 <project>...</project> 标记内,例如 </dependencies></project> 之间。This text must be inside the <project>...</project> tags in the file, for example, between </dependencies> and </project>.

 <build>
     <plugins>
     <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-shade-plugin</artifactId>
         <version>2.3</version>
         <configuration>
         <transformers>
             <transformer implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer">
             </transformer>
         </transformers>
         </configuration>
         <executions>
         <execution>
             <phase>package</phase>
                 <goals>
                 <goal>shade</goal>
                 </goals>
         </execution>
         </executions>
         </plugin>
     <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-compiler-plugin</artifactId>
         <version>3.6.1</version>
         <configuration>
         <source>1.8</source>
         <target>1.8</target>
         </configuration>
     </plugin>
     </plugins>
 </build>

本部分将配置 Apache Maven 编译器插件和 Apache Maven 阴影插件。This section configures the Apache Maven Compiler Plugin and Apache Maven Shade Plugin. 该编译器插件用于编译拓扑。The compiler plug-in is used to compile the topology. 该阴影插件用于防止在由 Maven 构建的 JAR 程序包中复制许可证。The shade plug-in is used to prevent license duplication in the JAR package that is built by Maven. 此插件用于防止在 HDInsight 群集上运行时出现“重复的许可证文件”错误。This plugin is used to prevent a "duplicate license files" error at run time on the HDInsight cluster. 将 maven-shade-plugin 用于 ApacheLicenseResourceTransformer 实现可防止发生此错误。Using maven-shade-plugin with the ApacheLicenseResourceTransformer implementation prevents the error.

maven-shade-plugin 还会生成 uber jar,其中包含应用程序所需的所有依赖项。The maven-shade-plugin also produces an uber jar that contains all the dependencies required by the application.

保存 pom.xml 文件。Save the pom.xml file.

创建 MapReduce 应用程序Create the MapReduce application

  1. 输入以下命令,以创建并打开新文件 WordCount.javaEnter the command below to create and open a new file WordCount.java. 根据提示选择“是”,以创建新文件。 Select Yes at the prompt to create a new file.

    notepad src\main\java\org\apache\hadoop\examples\WordCount.java
    
  2. 将以下 Java 代码复制并粘贴到新文件中。Then copy and paste the java code below into the new file. 然后关闭该文件。Then close the file.

    package org.apache.hadoop.examples;
    
    import java.io.IOException;
    import java.util.StringTokenizer;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.util.GenericOptionsParser;
    
    public class WordCount {
    
        public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable>{
    
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
    
        public void map(Object key, Text value, Context context
                        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
            }
        }
    }
    
    public static class IntSumReducer
            extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();
    
        public void reduce(Text key, Iterable<IntWritable> values,
                            Context context
                            ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
            sum += val.get();
            }
           result.set(sum);
           context.write(key, result);
        }
    }
    
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
    }
    

    请注意,包名称为 org.apache.hadoop.examples,类名称为 WordCountNotice the package name is org.apache.hadoop.examples and the class name is WordCount. 提交 MapReduce 作业时,使用这些名称。You use these names when you submit the MapReduce job.

生成并打包应用程序Build and package the application

wordcountjava 目录中,使用以下命令来构建包含应用程序的 JAR 文件:From the wordcountjava directory, use the following command to build a JAR file that contains the application:

mvn clean package

此命令清除任何以前构建的项目,下载任何尚未安装的依赖项,然后生成并打包应用程序。This command cleans any previous build artifacts, downloads any dependencies that have not already been installed, and then builds and package the application.

命令完成后,wordcountjava/target 目录包含一个名为 wordcountjava-1.0-SNAPSHOT.jar 的文件。Once the command finishes, the wordcountjava/target directory contains a file named wordcountjava-1.0-SNAPSHOT.jar.

备注

wordcountjava-1.0-SNAPSHOT.jar 文件是一种 uberjar,其中不仅包含 WordCount 作业,还包含作业在运行时需要的依赖项。The wordcountjava-1.0-SNAPSHOT.jar file is an uberjar, which contains not only the WordCount job, but also dependencies that the job requires at runtime.

上传 JAR 并运行作业 (SSH)Upload the JAR and run jobs (SSH)

以下步骤使用 scp 将 JAR 复制到 Apache HBase on HDInsight 群集的主要头节点。The following steps use scp to copy the JAR to the primary head node of your Apache HBase on HDInsight cluster. 然后,使用 ssh 命令连接到群集并直接在头节点上运行此示例。The ssh command is then used to connect to the cluster and run the example directly on the head node.

  1. 将该 jar 上传到群集。Upload the jar to the cluster. CLUSTERNAME 替换为 HDInsight 群集名称,然后输入以下命令:Replace CLUSTERNAME with your HDInsight cluster name and then enter the following command:

    scp target/wordcountjava-1.0-SNAPSHOT.jar sshuser@CLUSTERNAME-ssh.azurehdinsight.net:
    
  2. 连接到群集。Connect to the cluster. CLUSTERNAME 替换为 HDInsight 群集名称,然后输入以下命令:Replace CLUSTERNAME with your HDInsight cluster name and then enter the following command:

    ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.cn
    
  3. 在 SSH 会话中,使用以下命令运行 MapReduce 应用程序:From the SSH session, use the following command to run the MapReduce application:

    yarn jar wordcountjava-1.0-SNAPSHOT.jar org.apache.hadoop.examples.WordCount /example/data/gutenberg/davinci.txt /example/data/wordcountout
    

    此命令启动 WordCount MapReduce 应用程序。This command starts the WordCount MapReduce application. 输入文件是 /example/data/gutenberg/davinci.txt,输出目录是 /example/data/wordcountoutThe input file is /example/data/gutenberg/davinci.txt, and the output directory is /example/data/wordcountout. 输入文件和输出均存储到群集的默认存储中。Both the input file and output are stored to the default storage for the cluster.

  4. 作业完成后,请使用以下命令查看结果:Once the job completes, use the following command to view the results:

    hdfs dfs -cat /example/data/wordcountout/*
    

    用户会收到单词和计数列表,其包含的值类似于以下文本:You should receive a list of words and counts, with values similar to the following text:

    zeal    1
    zelus   1
    zenith  2
    

后续步骤Next steps

在本文档中,学习了如何开发 Java MapReduce 作业。In this document, you have learned how to develop a Java MapReduce job. 请参阅以下文档,了解使用 HDInsight 的其他方式。See the following documents for other ways to work with HDInsight.

有关详细信息,另请参阅 Java 开发人员中心For more information, see also the Java Developer Center.