群集 Resource Manager 体系结构概述Cluster resource manager architecture overview

Service Fabric 群集资源管理器是在群集中运行的中心服务。The Service Fabric Cluster Resource Manager is a central service that runs in the cluster. 它管理群集中服务所需的状态,对资源消耗和任何放置规则而言尤其如此。It manages the desired state of the services in the cluster, particularly with respect to resource consumption and any placement rules.

为了管理群集中的资源,Service Fabric 群集资源管理器必须包含一些相关的信息:To manage the resources in your cluster, the Service Fabric Cluster Resource Manager must have several pieces of information:

  • 当前存在的服务Which services currently exist
  • 每个服务的当前(或默认)资源消耗Each service's current (or default) resource consumption
  • 剩余的群集容量The remaining cluster capacity
  • 群集中节点的容量The capacity of the nodes in the cluster
  • 每个节点可消耗的资源量The amount of resources consumed on each node

给定服务的资源消耗量可随时间更改,服务通常关注多种类型的资源。The resource consumption of a given service can change over time, and services usually care about more than one type of resource. 在不同服务中,可能同时存在实际物理资源和所度量的物理资源。Across different services, there may be both real physical and physical resources being measured. 服务可能会跟踪内存占用率和磁盘使用量等物理指标。Services may track physical metrics like memory and disk consumption. 更普遍的是,服务可能会关注“WorkQueueDepth”或“TotalRequests”等逻辑指标。More commonly, services may care about logical metrics - things like "WorkQueueDepth" or "TotalRequests". 逻辑指标和物理指标都可用于同一群集。Both logical and physical metrics can be used in the same cluster. 指标可在许多服务间共享,也可特定于特定服务使用。Metrics can be shared across many services or be specific to a particular service.

其他注意事项Other considerations

有时,群集的所有者和操作员与服务和应用程序创建者不同,或至少是一人身兼多职。The owners and operators of the cluster can be different from the service and application authors, or at a minimum are the same people wearing different hats. 开发应用程序时,需要知道有关应用程序需求的一些内容。When you develop your application you know a few things about what it requires. 需要估计应用程序将占用的资源以及不同服务的部署方法。You have an estimate of the resources it will consume and how different services should be deployed. 例如,Web 层需要在连接到 Internet 的节点上运行,而数据库服务则不必。For example, the web tier needs to run on nodes exposed to the Internet, while the database services should not. 再举一例,Web 服务可能会受 CPU 和网络限制,而数据层服务更关注内存和磁盘使用情况。As another example, the web services are probably constrained by CPU and network, while the data tier services care more about memory and disk consumption. 但是,处理该服务在生产环境中的实时站点事件的人员,或者管理服务升级的人员,需要执行不同的作业,并且需要不同的工具。However, the person handling a live-site incident for that service in production, or who is managing an upgrade to the service has a different job to do, and requires different tools.

群集和服务都是动态的:Both the cluster and services are dynamic:

  • 群集中的节点数可以增加和缩减The number of nodes in the cluster can grow and shrink
  • 不同大小和类型的节点可以变化不定Nodes of different sizes and types can come and go
  • 可以创建、删除服务,并更改其所需的资源分配和放置规则Services can be created, removed, and change their desired resource allocations and placement rules
  • 升级或其他管理操作可以在基础结构级别的应用程序中运行Upgrades or other management operations can roll through the cluster at the application on infrastructure levels
  • 随时可能发生失败。Failures can happen at any time.

群集 Resource Manager 组件和数据流Cluster resource manager components and data flow

群集资源管理器必须跟踪每个服务的需求以及这些服务中每个服务对象的资源消耗。The Cluster Resource Manager has to track the requirements of each service and the consumption of resources by each service object within those services. 群集资源管理器具有两个概念部件:在每个节点上运行的代理和容错服务。The Cluster Resource Manager has two conceptual parts: agents that run on each node and a fault-tolerant service. 每个节点上的代理跟踪服务的负载报告、聚合这些报告,并定期报告它们。The agents on each node track load reports from services, aggregate them, and periodically report them. 群集 Resource Manager 服务从本地代理聚合所有信息,并基于其当前配置进行响应。The Cluster Resource Manager service aggregates all the information from the local agents and reacts based on its current configuration.

请查看下图:Let's look at the following diagram:

资源平衡器体系结构

在运行时,有很多更改可能会发生。During runtime, there are many changes that could happen. 例如,假设某些服务使用的资源量更改、某些服务失败以及某些节点加入并离开群集。For example, let's say the amount of resources some services consume changes, some services fail, and some nodes join and leave the cluster. 节点上的所有更改进行汇总,并定期发送到群集 Resource Manager 服务(1,2),它们在其中再次聚合、分析和存储。All the changes on a node are aggregated and periodically sent to the Cluster Resource Manager service (1,2) where they are aggregated again, analyzed, and stored. 每隔几秒钟,服务就查看更改,并确定是否需要任何操作 (3)。Every few seconds that service looks at the changes and determines if any actions are necessary (3). 例如,它可能注意到某些空节点已添加到群集。For example, it could notice that some empty nodes have been added to the cluster. 因此,确定要将某些服务移到这些节点。As a result, it decides to move some services to those nodes. 群集资源管理器可能还注意到特定节点已超载,或者某些服务已失败或删除,在其他位置释放资源。The Cluster Resource Manager could also notice that a particular node is overloaded, or that certain services have failed or been deleted, freeing up resources elsewhere.

请查看下图,了解接下来会发生什么。Let's look at the following diagram and see what happens next. 假设群集资源管理器确定需要更改。Let's say that the Cluster Resource Manager determines that changes are necessary. 它与其他系统服务(尤其是故障转移管理器)进行协调,以进行必要的更改。It coordinates with other system services (in particular the Failover Manager) to make the necessary changes. 然后将所需命令发送到相应节点 (4)。Then the necessary commands are sent to the appropriate nodes (4). 例如,假设资源管理器注意到 Node5 已超载,因此确定要将服务 B 从 Node5 移到 Node4。For example, let's say the Resource Manager noticed that Node5 was overloaded, and so decided to move service B from Node5 to Node4. 重新配置 (5) 结束时,群集看起来像这样:At the end of the reconfiguration (5), the cluster looks like this:

资源平衡器体系结构

后续步骤Next steps

  • 群集 Resource Manager 提供许多用于描述群集的选项。The Cluster Resource Manager has many options for describing the cluster. 若要详细了解这些选项,请查看这篇介绍 Service Fabric 群集的文章To find out more about them, check out this article on describing a Service Fabric cluster
  • 群集资源管理器的主要职责是重新均衡群集,并强制执行放置规则。The Cluster Resource Manager's primary duties are rebalancing the cluster and enforcing placement rules. 有关如何配置这些行为的详细信息,请参阅均衡 Service Fabric 群集For more information on configuring these behaviors, see balancing your Service Fabric cluster