HDInsight 中的 Apache Hadoop 体系结构Apache Hadoop architecture in HDInsight

Apache Hadoop 包括两个核心组件:提供存储的 Apache Hadoop 分布式文件系统 (HDFS),以及提供处理功能的 Apache Hadoop Yet Another Resource Negotiator (YARN)Apache Hadoop includes two core components: the Apache Hadoop Distributed File System (HDFS) that provides storage, and Apache Hadoop Yet Another Resource Negotiator (YARN) that provides processing. 有了存储和处理功能,群集就可以运行 MapReduce 程序来执行所需的数据处理。With storage and processing capabilities, a cluster becomes capable of running MapReduce programs to perform the desired data processing.

备注

通常不会将 HDFS 部署在 HDInsight 群集中来提供存储,An HDFS is not typically deployed within the HDInsight cluster to provide storage. 而是由 Hadoop 组件来使用 HDFS 兼容接口层。Instead, an HDFS-compatible interface layer is used by Hadoop components. 实际的存储功能由 Azure 存储或 Azure Data Lake Storage 提供。The actual storage capability is provided by either Azure Storage or Azure Data Lake Storage. 就 Hadoop 来说,在 HDInsight 群集上执行的 MapReduce 作业运行起来就像 HDFS 存在一样,因此不需更改即可满足其存储需求。For Hadoop, MapReduce jobs executing on the HDInsight cluster run as if an HDFS were present and so require no changes to support their storage needs. 在 Hadoop on HDInsight 中,存储是外包的,但 YARN 处理仍为核心组件。In Hadoop on HDInsight, storage is outsourced, but YARN processing remains a core component. 有关详细信息,请参阅 Azure HDInsight 简介For more information, see Introduction to Azure HDInsight.

本文介绍 YARN,说明其如何协调应用程序在 HDInsight 上的执行。This article introduces YARN and how it coordinates the execution of applications on HDInsight.

Apache Hadoop YARN 基础知识Apache Hadoop YARN basics

YARN 控制并协调 Hadoop 中的数据处理。YARN governs and orchestrates data processing in Hadoop. YARN 有两个核心服务,在群集的节点上作为进程运行:YARN has two core services that run as processes on nodes in the cluster:

  • ResourceManagerResourceManager
  • NodeManagerNodeManager

ResourceManager 将群集计算资源授予 MapReduce 作业之类的应用程序。The ResourceManager grants cluster compute resources to applications like MapReduce jobs. ResourceManager 将这些资源作为容器来授予,每个容器都分配有相应的 CPU 核心和 RAM 内存。The ResourceManager grants these resources as containers, where each container consists of an allocation of CPU cores and RAM memory. 如果将群集中的所有可用资源组合了起来,然后以块的形式分发了这些核心和内存,则每个资源块都是一个容器。If you combined all the resources available in a cluster and then distributed the cores and memory in blocks, each block of resources is a container. 群集中的每个节点都有一个容量,只能存储特定数目的容器,因此群集对于可用容器的数目有一个固定的限制。Each node in the cluster has a capacity for a certain number of containers, therefore the cluster has a fixed limit on the number of containers available. 可以对资源在容器中的分配进行配置。The allotment of resources in a container is configurable.

当 MapReduce 应用程序在群集上运行时,ResourceManager 为应用程序提供可在其中执行操作的容器。When a MapReduce application runs on a cluster, the ResourceManager provides the application the containers in which to execute. ResourceManager 可以跟踪运行的应用程序的状态、可用群集容量,还可以在应用程序完成并释放其资源时跟踪应用程序。The ResourceManager tracks the status of running applications, available cluster capacity, and tracks applications as they complete and release their resources.

ResourceManager 还运行一个 Web 服务器进程,该进程提供一个 Web 用户接口,用于监视应用程序的状态。The ResourceManager also runs a web server process that provides a web user interface to monitor the status of applications.

当用户提交要在群集上运行的 MapReduce 应用程序时,该应用程序会提交给 ResourceManager。When a user submits a MapReduce application to run on the cluster, the application is submitted to the ResourceManager. 反过来,ResourceManager 会在可用的 NodeManager 节点上分配一个容器。In turn, the ResourceManager allocates a container on available NodeManager nodes. NodeManager 节点是应用程序的实际执行位置。The NodeManager nodes are where the application actually executes. 第一个分配的容器运行名为 ApplicationMaster 的特殊应用程序。The first container allocated runs a special application called the ApplicationMaster. 该 ApplicationMaster 负责获取资源,这些资源采用后续容器的形式,是运行提交的应用程序所必需的。This ApplicationMaster is responsible for acquiring resources, in the form of subsequent containers, needed to run the submitted application. ApplicationMaster 会检查应用程序的阶段(例如映射阶段和化简阶段),并会将需要处理的数据量考虑进去。The ApplicationMaster examines the stages of the application, such as the map stage and reduce stage, and factors in how much data needs to be processed. ApplicationMaster 然后会代表应用程序从 ResourceManager 请求(协商)资源。The ApplicationMaster then requests (negotiates) the resources from the ResourceManager on behalf of the application. ResourceManager 反过来会将群集中 NodeManager 提供的资源授予 ApplicationMaster,供其在执行应用程序时使用。The ResourceManager in turn grants resources from the NodeManagers in the cluster to the ApplicationMaster for it to use in executing the application.

NodeManagers 先运行应用程序包含的任务,然后将其进度和状态回头报告给 ApplicationMaster。The NodeManagers run the tasks that make up the application, then report their progress and status back to the ApplicationMaster. ApplicationMaster 则将应用程序的状态报告给 ResourceManager。The ApplicationMaster in turn reports the status of the application back to the ResourceManager. ResourceManager 将任何结果返回给客户端。The ResourceManager returns any results to the client.

YARN on HDInsightYARN on HDInsight

所有 HDInsight 群集类型都部署 YARN。All HDInsight cluster types deploy YARN. ResourceManager 在进行高可用性部署时会使用一个主实例和一个辅助实例,二者分别运行在群集的第一个头节点和第二个头节点上。The ResourceManager is deployed for high availability with a primary and secondary instance, which runs on the first and second head nodes within the cluster respectively. 一次只有一个 ResourceManager 实例处于活动状态。Only the one instance of the ResourceManager is active at a time. NodeManager 实例跨群集的可用工作节点运行。The NodeManager instances run across the available worker nodes in the cluster.

YARN on HDInsight

软删除Soft delete

若要从存储帐户中取消删除文件,请参阅:To undelete a file from your Storage Account, see:

Azure 存储Azure Storage

Azure Data Lake Storage Gen 2Azure Data Lake Storage Gen 2

Azure Data Lake Storage Gen2 的已知问题Known issues with Azure Data Lake Storage Gen2

垃圾清除Trash purging

“HDFS” > “高级 core-site” 中的 fs.trash.interval 属性应保持默认值 0,因为不应在本地文件系统上存储任何数据。The fs.trash.interval property from HDFS > Advanced core-site should remain at the default value 0 because you shouldn't store any data on the local file system. 此值不影响远程存储帐户(WASB、ADLS GEN1、ABFS)This value doesn't affect remote storage accounts(WASB, ADLS GEN1, ABFS)

后续步骤Next steps