通过 SSH 将 MapReduce 与 HDInsight 上的 Apache Hadoop 配合使用Use MapReduce with Apache Hadoop on HDInsight with SSH

了解如何从安全外壳 (SSH) 将 MapReduce 作业提交到 HDInsight。Learn how to submit MapReduce jobs from a Secure Shell (SSH) connection to HDInsight.

备注

如果已熟悉如何使用基于 Linux 的 Apache Hadoop 服务器,但刚接触 HDInsight,请参阅基于 Linux 的 HDInsight 提示If you are already familiar with using Linux-based Apache Hadoop servers, but you are new to HDInsight, see Linux-based HDInsight tips.

必备条件Prerequisites

HDInsight 中的 Apache Hadoop 群集。An Apache Hadoop cluster on HDInsight. 请参阅使用 Azure 门户创建 Apache Hadoop 群集See Create Apache Hadoop clusters using the Azure portal.

使用 Hadoop 命令Use Hadoop commands

  1. 使用 ssh 命令连接到群集。Use ssh command to connect to your cluster. 编辑以下命令(将 CLUSTERNAME 替换为群集的名称),然后输入该命令:Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:

    ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.cn
    
  2. 连接到 HDInsight 群集后,使用以下命令启动 MapReduce 作业:After you are connected to the HDInsight cluster, use the following command to start a MapReduce job:

    yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount /example/data/gutenberg/davinci.txt /example/data/WordCountOutput
    

    此命令启动 hadoop-mapreduce-examples.jar 文件中包含的 wordcount 类。This command starts the wordcount class, which is contained in the hadoop-mapreduce-examples.jar file. 它使用 /example/data/gutenberg/davinci.txt 文档作为输入,并将输出存储在 /example/data/WordCountOutput 中。It uses the /example/data/gutenberg/davinci.txt document as input, and output is stored at /example/data/WordCountOutput.

    备注

    有关此 MapReduce 作业和示例数据的详细信息,请参阅在 Apache Hadoop on HDInsight 中使用 MapReduceFor more information about this MapReduce job and the example data, see Use MapReduce in Apache Hadoop on HDInsight.

  3. 作业在处理时提供详细信息,并在完成时返回类似于以下文本的信息:The job emits details as it processes, and it returns information similar to the following text when the job completes:

    File Input Format Counters
    Bytes Read=1395666
    File Output Format Counters
    Bytes Written=337623
    
  4. 作业完成后,使用以下命令列出输出文件:When the job completes, use the following command to list the output files:

    hdfs dfs -ls /example/data/WordCountOutput
    

    此命令显示两个文件(_SUCCESSpart-r-00000)。This command display two files, _SUCCESS and part-r-00000. part-r-00000 文件包含此作业的输出。The part-r-00000 file contains the output for this job.

    备注

    某些 MapReduce 作业可能会将结果拆分成多个 part-r-##### 文件。Some MapReduce jobs may split the results across multiple part-r-##### files. 如果是这样,请使用 ##### 后缀指示文件的顺序。If so, use the ##### suffix to indicate the order of the files.

  5. 若要查看输出,请使用以下命令:To view the output, use the following command:

    hdfs dfs -cat /example/data/WordCountOutput/part-r-00000
    

    此命令会显示一个列表,其内容为 wasb://example/data/gutenberg/davinci.txt 文件中包含的单词以及每个单词出现的次数 。This command displays a list of the words that are contained in the wasb://example/data/gutenberg/davinci.txt file and the number of times each word occurred. 以下文本是文件中所含数据的示例:The following text is an example of the data that is contained in the file:

    wreathed        3
    wreathing       1
    wreaths         1
    wrecked         3
    wrenching       1
    wretched        6
    wriggling       1
    

后续步骤Next steps

如用户所见,Hadoop 命令提供简单的方法让用户在 HDInsight 群集上运行 MapReduce 作业,并查看作业输出。As you can see, Hadoop commands provide an easy way to run MapReduce jobs in an HDInsight cluster and then view the job output. 有关 HDInsight 上 Hadoop 的其他使用方法的信息:For information about other ways you can work with Hadoop on HDInsight: