群集 Resource Manager 体系结构概述Cluster resource manager architecture overview

Service Fabric 群集资源管理器是在群集中运行的中心服务。The Service Fabric Cluster Resource Manager is a central service that runs in the cluster. 它管理群集中服务所需的状态,对资源消耗和任何放置规则而言尤其如此。It manages the desired state of the services in the cluster, particularly with respect to resource consumption and any placement rules.

为了管理群集中的资源,Service Fabric 群集资源管理器必须包含一些相关的信息:To manage the resources in your cluster, the Service Fabric Cluster Resource Manager must have several pieces of information:

  • 当前存在的服务Which services currently exist
  • 每个服务的当前(或默认)资源消耗Each service's current (or default) resource consumption
  • 剩余的群集容量The remaining cluster capacity
  • 群集中节点的容量The capacity of the nodes in the cluster
  • 每个节点可消耗的资源量The amount of resources consumed on each node

给定服务的资源消耗量会随时间变化,服务通常会关注多种类型的资源。The resource consumption of a given service can change over time, and services usually care about more than one type of resource. 在不同的服务之间,可能会同时测量实际物理资源和物理资源。Across different services, there may be both real physical and physical resources being measured. 服务可能会跟踪内存和磁盘使用情况等物理指标。Services may track physical metrics like memory and disk consumption. 更常见的是,服务可能会关注逻辑指标,例如“WorkQueueDepth”或“TotalRequests”。More commonly, services may care about logical metrics - things like "WorkQueueDepth" or "TotalRequests". 可以同时在同一群集中使用逻辑指标和物理指标。Both logical and physical metrics can be used in the same cluster. 指标可在许多服务间共享,也可特定于特定服务使用。Metrics can be shared across many services or be specific to a particular service.

其他注意事项Other considerations

有时,群集的所有者和操作员与服务和应用程序创建者不同,或至少是一人身兼多职。The owners and operators of the cluster can be different from the service and application authors, or at a minimum are the same people wearing different hats. 开发应用程序时,需要知道有关应用程序需求的一些内容。When you develop your application you know a few things about what it requires. 需要估计应用程序将占用的资源以及不同服务的部署方法。You have an estimate of the resources it will consume and how different services should be deployed. 例如,Web 层需要在连接到 Internet 的节点上运行,而数据库服务则不必。For example, the web tier needs to run on nodes exposed to the Internet, while the database services should not. 再举一例,Web 服务可能会受 CPU 和网络限制,而数据层服务更关注内存和磁盘使用情况。As another example, the web services are probably constrained by CPU and network, while the data tier services care more about memory and disk consumption. 但是,处理该服务在生产环境中的实时站点事件的人员,或者管理服务升级的人员,需要执行不同的作业,并且需要不同的工具。However, the person handling a live-site incident for that service in production, or who is managing an upgrade to the service has a different job to do, and requires different tools.

群集和服务都是动态的:Both the cluster and services are dynamic:

  • 群集中的节点数可能增加也可能减少The number of nodes in the cluster can grow and shrink
  • 不同大小和类型的节点可能加入也可能离开Nodes of different sizes and types can come and go
  • 可以创建、删除服务,并更改其所需的资源分配和放置规则Services can be created, removed, and change their desired resource allocations and placement rules
  • 升级或其他管理操作可以在基础结构级别的应用程序中运行Upgrades or other management operations can roll through the cluster at the application on infrastructure levels
  • 随时可能会发生故障。Failures can happen at any time.

群集 Resource Manager 组件和数据流Cluster resource manager components and data flow

群集资源管理器必须跟踪每个服务的需求以及这些服务中每个服务对象的资源消耗。The Cluster Resource Manager has to track the requirements of each service and the consumption of resources by each service object within those services. 群集资源管理器具有两个概念部件:在每个节点上运行的代理和容错服务。The Cluster Resource Manager has two conceptual parts: agents that run on each node and a fault-tolerant service. 每个节点上的代理会跟踪服务的负载报告、聚合这些报告,并定期汇报。The agents on each node track load reports from services, aggregate them, and periodically report them. 群集 Resource Manager 服务会聚合来自本地代理的所有信息,并根据当前配置做出反应。The Cluster Resource Manager service aggregates all the information from the local agents and reacts based on its current configuration.

请查看下图:Let's look at the following diagram:

资源平衡器体系结构

运行时阶段可能会发生很多更改。During runtime, there are many changes that could happen. 例如,假设某些服务使用的资源量更改、某些服务失败以及某些节点加入并离开群集。For example, let's say the amount of resources some services consume changes, some services fail, and some nodes join and leave the cluster. 节点上的所有更改都要进行汇总,并定期发送到群集 Resource Manager 服务(1、2),并在其中再次聚合、分析和存储。All the changes on a node are aggregated and periodically sent to the Cluster Resource Manager service (1,2) where they are aggregated again, analyzed, and stored. 每隔几秒钟,服务就查看更改,并确定是否需要任何操作 (3)。Every few seconds that service looks at the changes and determines if any actions are necessary (3). 例如,它可能注意到某些空节点已添加到群集。For example, it could notice that some empty nodes have been added to the cluster. 因此,确定要将某些服务移到这些节点。As a result, it decides to move some services to those nodes. 群集 Resource Manager 可能还注意到特定节点已超载,或者某些服务已失败或已删除,在其他位置释放了资源。The Cluster Resource Manager could also notice that a particular node is overloaded, or that certain services have failed or been deleted, freeing up resources elsewhere.

请查看下图,了解接下来会发生什么。Let's look at the following diagram and see what happens next. 假设群集资源管理器确定需要更改。Let's say that the Cluster Resource Manager determines that changes are necessary. 它与其他系统服务(尤其是故障转移管理器)进行协调,进行必要的更改。It coordinates with other system services (in particular the Failover Manager) to make the necessary changes. 然后将所需命令发送到相应节点 (4)。Then the necessary commands are sent to the appropriate nodes (4). 例如,假设资源管理器注意到节点 5 已超载,因此确定要将服务 B 从节点 5 移动到节点 4。For example, let's say the Resource Manager noticed that Node5 was overloaded, and so decided to move service B from Node5 to Node4. 重新配置 (5) 结束时,群集看起来像这样:At the end of the reconfiguration (5), the cluster looks like this:

资源平衡器体系结构

后续步骤Next steps

  • 群集 Resource Manager 提供许多用于描述群集的选项。The Cluster Resource Manager has many options for describing the cluster. 若要详细了解这些选项,请查看这篇描述 Service Fabric 群集的文章To find out more about them, check out this article on describing a Service Fabric cluster
  • 群集资源管理器的主要职责是重新均衡群集,并强制执行放置规则。The Cluster Resource Manager's primary duties are rebalancing the cluster and enforcing placement rules. 有关如何配置这些行为的详细信息,请参阅均衡 Service Fabric 群集For more information on configuring these behaviors, see balancing your Service Fabric cluster