什么是 Azure HDInsight 中的 Apache Hadoop?What is Apache Hadoop in Azure HDInsight?

Apache Hadoop 是原始的开源框架,适用于对群集上的大数据集进行分布式处理和分析。Apache Hadoop was the original open-source framework for distributed processing and analysis of big data sets on clusters. Hadoop 生态系统包括相关的软件和实用程序,例如 Apache Hive、Apache HBase、Spark、Kafka 等等。The Hadoop ecosystem includes related software and utilities, including Apache Hive, Apache HBase, Spark, Kafka, and many others.

Azure HDInsight 是云中适用于企业的分析服务,具有完全托管、全面且开源的特点。Azure HDInsight is a fully managed, full-spectrum, open-source analytics service in the cloud for enterprises. Azure HDInsight 中的 Apache Hadoop 群集类型可让你使用 HDFS、YARN 资源管理和简单的 MapReduce 编程模型来并行处理和分析批数据。The Apache Hadoop cluster type in Azure HDInsight allows you to use HDFS, YARN resource management, and a simple MapReduce programming model to process and analyze batch data in parallel.

若要查看 HDInsight 上的可用 Hadoop 技术堆栈组件,请参阅可以与 HDInsight 配合使用的组件和版本To see available Hadoop technology stack components on HDInsight, see Components and versions available with HDInsight. 若要详细了解 HDInsight 中的 Hadoop,请参阅 Azure 上介绍了 HDInsight 功能的页面To read more about Hadoop in HDInsight, see the Azure features page for HDInsight.

什么是 MapReduceWhat is MapReduce

Apache Hadoop MapReduce 是一个软件框架,用于编写处理海量数据的作业。Apache Hadoop MapReduce is a software framework for writing jobs that process vast amounts of data. 输入的数据将拆分为独立的区块。Input data is split into independent chunks. 每个区块跨群集中的节点并行进行处理。Each chunk is processed in parallel across the nodes in your cluster. MapReduce 作业包括两个函数:A MapReduce job consists of two functions:

  • Mapper:使用输入数据,对数据进行分析(通常使用筛选器和排序操作),并发出元组(键/值对)Mapper: Consumes input data, analyzes it (usually with filter and sorting operations), and emits tuples (key-value pairs)

  • Reducer:使用 Mapper 发出的元组并执行汇总运算,以基于 Mapper 数据创建更小的合并结果Reducer: Consumes tuples emitted by the Mapper and performs a summary operation that creates a smaller, combined result from the Mapper data

下图演示了一个基本的单词计数 MapReduce 作业示例:A basic word count MapReduce job example is illustrated in the following diagram:

HDI.WordCountDiagram

此作业的输出是文本中每个单词出现次数的计数。The output of this job is a count of how many times each word occurred in the text.

  • mapper 将输入文本中的每一行作为一个输入并将其拆分为多个单词。The mapper takes each line from the input text as an input and breaks it into words. 每当一个单词出现时,mapper 发出一个键/值对,其中在该单词后跟一个 1。It emits a key/value pair each time a word occurs of the word is followed by a 1. 然后将输出排序,再发送到 reducer。The output is sorted before sending it to reducer.
  • 随后,化简器会计算每个单词的计数的和并发出一个键/值对(包含单词,后跟该单词的总出现次数)。The reducer sums these individual counts for each word and emits a single key/value pair that contains the word followed by the sum of its occurrences.

MapReduce 可使用多种语言实现。MapReduce can be implemented in various languages. Java 是最常见的实现,本文档中使用该语言进行演示。Java is the most common implementation, and is used for demonstration purposes in this document.

开发语言Development languages

基于 Java 和 Java 虚拟机的语言或框架可作为 MapReduce 作业直接运行。Languages or frameworks that are based on Java and the Java Virtual Machine can be ran directly as a MapReduce job. 在本文档中使用的示例是 Java MapReduce 应用程序。The example used in this document is a Java MapReduce application. C#、Python 等非 Java 语言或独立可执行文件必须使用 Hadoop 流式处理Non-Java languages, such as C#, Python, or standalone executables, must use Hadoop streaming.

Hadoop 流式处理通过 STDIN 和 STDOUT 与映射器和化简器通信。Hadoop streaming communicates with the mapper and reducer over STDIN and STDOUT. 映射器和化简器从 STDIN 中一次读取一行数据,并将输出写入 STDOUT。The mapper and reducer read data a line at a time from STDIN, and write the output to STDOUT. 映射器和化简器读取或发出的每行必须采用制表符分隔的键/值对格式:Each line read or emitted by the mapper and reducer must be in the format of a key/value pair, delimited by a tab character:

[key]/t[value]

有关详细信息,请参阅 Hadoop Streaming(Hadoop 流式处理)。For more information, see Hadoop Streaming.

有关将 Hadoop 流式处理与 HDInsight 配合使用的示例,请参阅以下文档:For examples of using Hadoop streaming with HDInsight, see the following document:

后续步骤Next steps