在 Azure HDInsight 中监视群集性能Monitor cluster performance in Azure HDInsight

监视 HDInsight 群集的运行状况和性能对于维持最佳性能和资源利用率来说至关重要。Monitoring the health and performance of an HDInsight cluster is essential for maintaining optimal performance and resource utilization. 监视还可以帮助你检测并解决群集配置错误和用户代码问题。Monitoring can also help you detect and address cluster configuration errors and user code issues.

以下各部分介绍了如何监视和优化群集上的负载、Apache Hadoop YARN 队列,以及检测存储限制问题。The following sections describe how to monitor and optimize the load on your clusters, Apache Hadoop YARN queues and detect storage throttling issues.

监视群集负载Monitor cluster load

当群集上的负载均匀分布在所有节点中时,Hadoop 群集可以提供最佳性能。Hadoop clusters can deliver the most optimal performance when the load on cluster is evenly distributed across all the nodes. 这使得处理任务在运行时可以不受个体节点上的 RAM、CPU 或磁盘资源约束。This enables the processing tasks to run without being constrained by RAM, CPU, or disk resources on individual nodes.

若要概括地查看群集的节点及其负载,请登录到 Ambari Web UI,然后选择“主机” 选项卡。将按主机完全限定域名列出主机。To get a high-level look at the nodes of your cluster and their loading, sign in to the Ambari Web UI, then select the Hosts tab. Your hosts are listed by their fully qualified domain names. 每个主机的运行状态由一个彩色运行状况指示器进行显示:Each host's operating status is shown by a colored health indicator:

颜色Color 说明Description
红色Red 主机上至少有一个主组件已关闭。At least one master component on the host is down. 悬停鼠标以查看列出受影响组件的工具提示。Hover to see a tooltip that lists affected components.
橙色Orange 主机上至少有一个辅助组件已关闭。At least one secondary component on the host is down. 悬停鼠标以查看列出受影响组件的工具提示。Hover to see a tooltip that lists affected components.
黄色Yellow Ambari 服务器已超过 3 分钟没有接收到来自主机的检测信号。Ambari Server has not received a heartbeat from the host for more than 3 minutes.
绿色Green 正常运行状态。Normal running state.

此外还将看到列,显示每个主机的内核数及 RAM 量、磁盘使用情况和平均负载。You'll also see columns showing the number of cores and amount of RAM for each host, and the disk usage and load average.


选择任意主机名,详细了解在该主机上运行的组件及其指标。Select any of the host names for a detailed look at components running on that host and their metrics. 查看 CPU 使用情况、负载、磁盘使用情况、内存使用情况、网络使用情况和进程数的可选时间线,了解这些指标。The metrics are shown as a selectable timeline of CPU usage, load, disk usage, memory usage, network usage, and numbers of processes.


有关设置警报和查看指标的详细信息,请参阅使用 Apache Ambari Web UI 管理 HDInsight 群集See Manage HDInsight clusters by using the Apache Ambari Web UI for details on setting alerts and viewing metrics.

YARN 队列配置YARN queue configuration

Hadoop 跨其分布式平台运行各种服务。Hadoop has various services running across its distributed platform. YARN (Yet Another Resource Negotiator) 协调这些服务并分配群集资源以确保任何负载都均匀地分布在群集中。YARN (Yet Another Resource Negotiator) coordinates these services and allocates cluster resources to ensure that any load is evenly distributed across the cluster.

YARN 将 JobTracker、资源管理和作业计划/监视的两种责任划分为两个守护程序:一个全局资源管理器和一个每应用程序 ApplicationMaster (AM)。YARN divides the two responsibilities of the JobTracker, resource management and job scheduling/monitoring, into two daemons: a global Resource Manager, and a per-application ApplicationMaster (AM).

资源管理器是一个纯计划程序 ,且仅仲裁所有竞争应用程序之间的可用资源。The Resource Manager is a pure scheduler, and solely arbitrates available resources between all competing applications. 资源管理器确保所有资源都处于使用状态,并针对各种常量(如 SLA、容量保障等)进行优化。The Resource Manager ensures that all resources are always in use, optimizing for various constants such as SLAs, capacity guarantees, and so forth. ApplicationMaster 处理来自于 ResourceManager 的资源,并与 NodeManager 一起执行和监视容器及其资源消耗。The ApplicationMaster negotiates resources from the Resource Manager, and works with the NodeManager(s) to execute and monitor the containers and their resource consumption.

如果多个租户共享一个大型群集,则产生针对群集资源的竞争。When multiple tenants share a large cluster, there is competition for the cluster's resources. CapacityScheduler 是一种可插入计划程序,通过对请求进行排队来协助资源共享。The CapacityScheduler is a pluggable scheduler that assists in resource sharing by queueing up requests. CapacityScheduler 还支持分层队列 ,确保在允许其他应用程序的队列使用可用资源之前,在组织的子队列之间共享资源。The CapacityScheduler also supports hierarchical queues to ensure that resources are shared between the sub-queues of an organization, before other applications' queues are allowed to use free resources.

YARN 允许我们将资源分配给这些队列,并显示是否已分配所有可用资源。YARN allows us to allocate resources to these queues, and shows you whether all of your available resources are assigned. 若要查看有关队列的信息,请登录到 Ambari Web UI,然后从顶部菜单选择“YARN 队列管理器” 。To view information about your queues, sign in to the Ambari Web UI, then select YARN Queue Manager from the top menu.

YARN 队列管理器

YARN 队列管理器页的左侧显示队列的列表,以及分配给每个队列的容量百分比。The YARN Queue Manager page shows a list of your queues on the left, along with the percentage of capacity assigned to each.

YARN 队列管理器详细信息页

若要更加详细地查看队列,在 Ambari 仪表板中,从左侧列表选择“YARN” 服务。For a more detailed look at your queues, from the Ambari dashboard, select the YARN service from the list on the left. 然后,在“快速链接” 下拉菜单下,选择活动节点下的“资源管理器 UI” 。Then under the Quick Links dropdown menu, select Resource Manager UI underneath your active node.

“资源管理器 UI”菜单链接

在资源管理器 UI 中,从左侧菜单中选择“计划程序” 。In the Resource Manager UI, select Scheduler from the left-hand menu. “应用程序队列” 下将显示队列的列表。You see a list of your queues underneath Application Queues. 此处可看到用于每个队列的容量、作业在队列之间的分布情况,以及作业是否受资源约束。Here you can see the capacity used for each of your queues, how well the jobs are distributed between them, and whether any jobs are resource-constrained.

“资源管理器 UI”菜单

存储限制Storage throttling

群集的性能瓶颈可能发生于存储级别。A cluster's performance bottleneck can happen at the storage level. 这种类型的瓶颈最常见的原因是阻止了 输入/输出 (IO) 操作,当正在运行的任务发送的 IO 超过存储服务可以处理的数量时,就会发生这种情况。This type of bottleneck is most often because of blocking input/output (IO) operations, which happen when your running tasks send more IO than the storage service can handle. 这种阻止将创建等待处理完当前 IO 后再进行处理的 IO 请求队列。This blocking creates a queue of IO requests waiting to be processed until after current IOs are processed. 这些阻止是因为存储限制 ,这不是物理限制,而是存储服务通过服务级别协议 (SLA) 施加的限制。The blocks are because of storage throttling, which isn't a physical limit, but rather a limit imposed by the storage service by a service level agreement (SLA). 此限制确保单个客户端或租户无法独占服务。This limit ensures that no single client or tenant can monopolize the service. SLA 会限制 Azure 存储的每秒 IO 数 (IOPS) - 有关详细信息,请参阅标准存储帐户的可伸缩性和性能目标The SLA limits the number of IOs per second (IOPS) for Azure Storage - for details, see Scalability and performance targets for standard storage accounts.

如果使用 Azure 存储,有关监视与存储相关问题(包括限制)的信息,请参阅监视、诊断和排查 Microsoft Azure 存储问题If you are using Azure Storage, for information on monitoring storage-related issues, including throttling, see Monitor, diagnose, and troubleshoot Microsoft Azure Storage.

后续步骤Next steps

请访问以下链接,了解有关故障排除和监视群集的详细信息:Visit the following links for more information about troubleshooting and monitoring your clusters: