运行 HDInsight 随附的 MapReduce 示例Run the MapReduce examples included in HDInsight

了解如何在 HDInsight 上运行 Apache Hadoop 随附的 MapReduce 示例。Learn how to run the MapReduce examples included with Apache Hadoop on HDInsight.


MapReduce 示例The MapReduce examples

示例位于 /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar 处的 HDInsight 群集上。The samples are located on the HDInsight cluster at /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar. 这些示例的源代码包含在 /usr/hdp/current/hadoop-client/src/hadoop-mapreduce-project/hadoop-mapreduce-examples 处的 HDInsight 群集上。Source code for these samples is included on the HDInsight cluster at /usr/hdp/current/hadoop-client/src/hadoop-mapreduce-project/hadoop-mapreduce-examples.

下面的示例都包含在该存档文件中:The following samples are contained in this archive:

示例Sample 说明Description
aggregatewordcountaggregatewordcount 计算输入文件中的单词数。Counts the words in the input files.
aggregatewordhistaggregatewordhist 计算输入文件中单词的直方图。Computes the histogram of the words in the input files.
bbpbbp 使用 Bailey-Borwein-Plouffe 来计算 Pi 的精确位数。Uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcountdbcount 计算存储在数据库中的 pageview 日志数。Counts the pageview logs stored in a database.
distbbpdistbbp 使用 BBP 类型的公式来计算 Pi 的精确位数。Uses a BBP-type formula to compute exact bits of Pi.
grepgrep 计算输入中某个正则表达式的匹配数。Counts the matches of a regex in the input.
joinjoin 将经过排序且等分的数据集联接起来。Performs a join over sorted, equally partitioned datasets.
multifilewcmultifilewc 计算多个文件中的单词数。Counts words from several files.
pentominopentomino 平铺排列程序,用于查找五格拼板问题的解决方案。Tile laying program to find solutions to pentomino problems.
pipi 使用拟蒙特卡罗法估算 Pi 值。Estimates Pi using a quasi-Monte Carlo method.
randomtextwriterrandomtextwriter 为每个节点写入 10 GB 的随机文本数据。Writes 10 GB of random textual data per node.
randomwriterrandomwriter 为每个节点写入 10 GB 的随机数据。Writes 10 GB of random data per node.
secondarysortsecondarysort 定义化简阶段的次级排序。Defines a secondary sort to the reduce phase.
sortsort 对随机写入器所写入的数据进行排序。Sorts the data written by the random writer.
sudokusudoku 数独解算器。A sudoku solver.
teragenteragen 为 terasort 生成数据。Generate data for the terasort.
terasortterasort 运行 terasort。Run the terasort.
teravalidateteravalidate 检查 terasort 的结果。Checking results of terasort.
wordcountwordcount 计算输入文件中的单词数。Counts the words in the input files.
wordmeanwordmean 计算输入文件中单词的平均长度。Counts the average length of the words in the input files.
wordmedianwordmedian 计算输入文件中单词的中值长度。Counts the median length of the words in the input files.
wordstandarddeviationwordstandarddeviation 计算输入文件中单词长度的标准差。Counts the standard deviation of the length of the words in the input files.

运行 wordcount 示例Run the wordcount example

  1. 使用 SSH 连接到 HDInsight。Connect to HDInsight using SSH. CLUSTER 替换为群集的名称,然后输入以下命令:Replace CLUSTER with the name of your cluster and then enter the following command:

    ssh sshuser@CLUSTER-ssh.azurehdinsight.cn
  2. 在 SSH 会话中,使用以下命令来列出示例:From the SSH session, use the following command to list the samples:

    yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar

    此命令将生成本文档前一节的示例列表。This command generates the list of sample from the previous section of this document.

  3. 使用以下命令来获取有关特定示例的帮助。Use the following command to get help on a specific sample. 在本例中为 wordcount 示例:In this case, the wordcount sample:

    yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount

    会收到以下消息:You receive the following message:

    Usage: wordcount <in> [<in>...] <out>

    此消息表示可以为源文档提供多个输入路径。This message indicates that you can provide several input paths for the source documents. 最后的路径是存储输出(源文档中的单词计数)的位置。The final path is where the output (count of words in the source documents) is stored.

  4. 使用以下语句来计算《达芬奇笔记》(已作为示例数据提供给群集)中所有单词的数目:Use the following to count all words in the Notebooks of Leonardo Da Vinci, which are provided as sample data with your cluster:

    yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount /example/data/gutenberg/davinci.txt /example/data/davinciwordcount

    将从 /example/data/gutenberg/davinci.txt 读取此作业的输入。Input for this job is read from /example/data/gutenberg/davinci.txt. 此示例的输出存储于 /example/data/davinciwordcount 中。Output for this example is stored in /example/data/davinciwordcount. 两个路径皆位于群集的默认存储,而不是本地文件系统。Both paths are located on default storage for the cluster, not the local file system.


    如字数统计示例帮助中所述,还可以指定多个输入文件。As noted in the help for the wordcount sample, you could also specify multiple input files. 例如,hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount /example/data/gutenberg/davinci.txt /example/data/gutenberg/ulysses.txt /example/data/twowordcount 会计算 davinci.txt 和 ulysses.txt 中单词的数目。For example, hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount /example/data/gutenberg/davinci.txt /example/data/gutenberg/ulysses.txt /example/data/twowordcount would count words in both davinci.txt and ulysses.txt.

  5. 作业完成后,使用以下命令查看输出:Once the job completes, use the following command to view the output:

    hdfs dfs -cat /example/data/davinciwordcount/*

    此命令将连接通过该作业生成的所有输出文件。This command concatenates all the output files produced by the job. 它将输出显示到控制台。It displays the output to the console. 输出与以下文本类似:The output is similar to the following text:

    zum     1
    zur     1
    zwanzig 1
    zweite  1

    每一行表示一个单词,并显示它在输入数据中的出现次数。Each line represents a word and how many times it occurred in the input data.

数独示例The Sudoku example

数独是由九个 3x3 网格组成的逻辑拼图。Sudoku is a logic puzzle made up of nine 3x3 grids. 网格中的某些单元格包含数字,其他单元格为空白,游戏目的是找出空白单元格。Some cells in the grid have numbers, while others are blank, and the goal is to solve for the blank cells. 上一链接包含更多有关此拼图的信息,不过此示例的目的是找出空白单元格。The previous link has more information on the puzzle, but the purpose of this sample is to solve for the blank cells. 因此,我们的输入应该是采用以下格式的文件:So our input should be a file that is in the following format:

  • 九行加九列Nine rows of nine columns
  • 每一列可能包含数字或 ?(表示空白单元格)Each column can contain either a number or ? (which indicates a blank cell)
  • 用空格来分隔单元格Cells are separated by a space

可以通过特定方式来构造数独游戏;每一列或行的数字不能重复。There is a certain way to construct Sudoku puzzles; you can't repeat a number in a column or row. HDInsight 群集上已有一个构造正确的示例。There's an example on the HDInsight cluster that is properly constructed. 它位于 /usr/hdp/*/hadoop/src/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/dancing/puzzle1.dta,且包含以下文本:It is located at /usr/hdp/*/hadoop/src/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/dancing/puzzle1.dta and contains the following text:

8 5 ? 3 9 ? ? ? ?
? ? 2 ? ? ? ? ? ?
? ? 6 ? 1 ? ? ? 2
? ? 4 ? ? 3 ? 5 9
? ? 8 9 ? 1 4 ? ?
3 2 ? 4 ? ? 8 ? ?
9 ? ? ? 8 ? 5 ? ?
? ? ? ? ? ? 2 ? ?
? ? ? ? 4 5 ? 7 8

若要通过数独示例来运行此示例问题,请使用如下命令:To run this example problem through the Sudoku example, use the following command:

yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar sudoku /usr/hdp/*/hadoop/src/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/dancing/puzzle1.dta

显示结果类似以下文本:The results appear similar to the following text:

8 5 1 3 9 2 6 4 7
4 3 2 6 7 8 1 9 5
7 9 6 5 1 4 3 8 2
6 1 4 8 2 3 7 5 9
5 7 8 9 6 1 4 2 3
3 2 9 4 5 7 8 1 6
9 4 7 2 8 6 5 3 1
1 8 5 7 3 9 2 6 4
2 6 3 1 4 5 9 7 8

Pi (π) 示例Pi (π) example

该示例使用统计学方法(拟蒙特卡罗法)来估算 pi 值。The pi sample uses a statistical (quasi-Monte Carlo) method to estimate the value of pi. 点随机置于单位正方形中。Points are placed at random in a unit square. 正方形还包含一个圆圈。The square also contains a circle. 点位于圆圈中的概率等于圆圈面积 pi/4。The probability that the points fall within the circle is equal to the area of the circle, pi/4. 可以根据 4R 的值估算 pi 的值。The value of pi can be estimated from the value of 4R. R 是落入圆圈内的点数与正方形内总点数的比率。R is the ratio of the number of points that are inside the circle to the total number of points that are within the square. 所使用的取样点越多,估算值越准确。The larger the sample of points used, the better the estimate is.

请使用以下命令运行此示例。Use the following command to run this sample. 该命令使用 16 个映射(每个映射 10,000,000 个示例)来估算 pi 值:This command uses 16 maps with 10,000,000 samples each to estimate the value of pi:

yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar pi 16 10000000

此命令返回的值类似于 3.14159155000000000000The value returned by this command is similar to 3.14159155000000000000. 例如,pi 采用前 10 位小数时为 3.1415926535。For references, the first 10 decimal places of pi are 3.1415926535.

10 GB GraySort 示例10 GB GraySort example

GraySort 是一种基准排序。GraySort is a benchmark sort. 其指标为在给大量数据(通常至少 100 TB)排序时达到的排序速率(TB/分钟)。The metric is the sort rate (TB/minute) that is achieved while sorting large amounts of data, usually a 100 TB minimum.

此示例使用适中的 10 GB 数据,这样它运行时能相对快一点。This sample uses a modest 10 GB of data so that it can be run relatively quickly. 它使用由 Owen O'Malley 和 Arun Murthy 开发的 MapReduce 应用程序。It uses the MapReduce applications developed by Owen O'Malley and Arun Murthy. 这些应用程序以 0.578 TB/分钟(100 TB 用时 173 分钟)的速率赢得了 2009 年年度常用(“daytona”)TB 级排序基准。These applications won the annual general-purpose ("daytona") terabyte sort benchmark in 2009, with a rate of 0.578 TB/min (100 TB in 173 minutes). 有关这一排序基准和其他排序基准的详细信息,请参阅 Sortbenchmark 站点。For more information on this and other sorting benchmarks, see the Sortbenchmark site.

本示例使用三组 MapReduce 程序:This sample uses three sets of MapReduce programs:

  • TeraGen:一种 MapReduce 程序,用于生成要进行排序的数据行TeraGen: A MapReduce program that generates rows of data to sort

  • TeraSort:以输入数据为例,使用 MapReduce 将数据排序到总序中TeraSort: Samples the input data and uses MapReduce to sort the data into a total order

    TeraSort 是一种标准 MapReduce 排序,但自定义的分区程序除外。TeraSort is a standard MapReduce sort, except for a custom partitioner. 此分区程序使用 N-1 个抽样键(用于定义每次化简的键范围)的已排序列表。The partitioner uses a sorted list of N-1 sampled keys that define the key range for each reduce. 具体说来,sample[i-1] <= key < sample[i] 之类的所有键都将会发送到化简变量 i。In particular, all keys such that sample[i-1] <= key < sample[i] are sent to reduce i. 此分区程序可确保化简变量 i 的输出全都小于化简变量 i+1 的输出。This partitioner guarantees that the outputs of reduce i are all less than the output of reduce i+1.

  • TeraValidate:一个 MapReduce 程序,用于验证输出是否已全局排序TeraValidate: A MapReduce program that validates that the output is globally sorted

    它在输出目录中对于每个文件创建一个映射,每个映射都确保每个键均小于或等于前一个键。It creates one map per file in the output directory, and each map ensures that each key is less than or equal to the previous one. 此映射函数生成每个文件第一个和最后一个键的记录。The map function generates records of the first and last keys of each file. 化简函数确保文件 i 的第一个键大于文件 i-1 的最后一个键。The reduce function ensures that the first key of file i is greater than the last key of file i-1. 任何问题都会报告为包含故障键的化简阶段的输出结果。Any problems are reported as an output of the reduce phase, with the keys that are out of order.

使用以下步骤来生成数据,排序,并对输出进行验证:Use the following steps to generate data, sort, and then validate the output:

  1. 生成 10 GB 数据,这些数据被存储到 HDInsight 群集在 /example/data/10GB-sort-input 处的默认存储:Generate 10 GB of data, which is stored to the HDInsight cluster's default storage at /example/data/10GB-sort-input:

    yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teragen -Dmapred.map.tasks=50 100000000 /example/data/10GB-sort-input

    -Dmapred.map.tasks 告诉 Hadoop 多少个映射任务用于此作业。The -Dmapred.map.tasks tells Hadoop how many map tasks to use for this job. 最后两个参数指示作业创建 10 GB 数据并将其存储在 /example/data/10GB-sort-input 处。The final two parameters instruct the job to create 10 GB of data and to store it at /example/data/10GB-sort-input.

  2. 使用以下命令对数据排序:Use the following command to sort the data:

    yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar terasort -Dmapred.map.tasks=50 -Dmapred.reduce.tasks=25 /example/data/10GB-sort-input /example/data/10GB-sort-output

    -Dmapred.reduce.tasks 告诉 Hadoop 多少个化简任务用于此作业。The -Dmapred.reduce.tasks tells Hadoop how many reduce tasks to use for the job. 最后两个参数只是数据的输入和输出位置。The final two parameters are just the input and output locations for data.

  3. 使用以下方法来验证排序所生成的数据:Use the following to validate the data generated by the sort:

    yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teravalidate -Dmapred.map.tasks=50 -Dmapred.reduce.tasks=25 /example/data/10GB-sort-output /example/data/10GB-sort-validate

后续步骤Next steps

在本文中,学习了如何运行基于 Linux 的 HDInsight 群集附带的示例。From this article, you learned how to run the samples included with the Linux-based HDInsight clusters. 有关 Pig、Hive 和 MapReduce 如何与 HDInsight 配合使用的教程,请参阅以下主题:For tutorials about using Pig, Hive, and MapReduce with HDInsight, see the following topics: