Service Fabric 群集 Resource Manager 简介Introducing the Service Fabric cluster resource manager

在传统上,管理 IT 系统或联机服务意味着将特定物理机或虚拟机专用于这些特定的服务或系统。Traditionally managing IT systems or online services meant dedicating specific physical or virtual machines to those specific services or systems. 服务构建为层级形式。Services were architected as tiers. 这些层级分为“ Web”层和“数据”(或“存储”)层。There would be a "web" tier and a "data" or "storage" tier. 应用程序会有消息传送层(请求在其中流入和流出)以及一组专用于缓存的计算机。Applications would have a messaging tier where requests flowed in and out, as well as a set of machines dedicated to caching. 每个层级或每种类型的工作负荷都有特定的专用计算机:数据库需要一些专用计算机,Web 服务器也需要一些。Each tier or type of workload had specific machines dedicated to it: the database got a couple machines dedicated to it, the web servers a few. 如果特定类型的工作负荷导致运行它的计算机运行温度过高,则可以向该层添加更多具有该相同配置的计算机。If a particular type of workload caused the machines it was on to run too hot, then you added more machines with that same configuration to that tier. 但是,并非所有工作负荷都可以如此轻松地进行横向扩展 - 尤其是在数据层中,通常需要将计算机替换为更大的计算机。However, not all workloads could be scaled out so easily - particularly with the data tier you would typically replace machines with larger machines. 这很容易理解。Easy. 如果某台计算机发生故障,则在还原该计算机之前,整个应用程序中的该部件以较低容量运行。If a machine failed, that part of the overall application ran at lower capacity until the machine could be restored. 这仍然很容易理解(但不一定有趣)。Still fairly easy (if not necessarily fun).

然而,现在的服务和软件体系结构领域已发生改变。Now however the world of service and software architecture has changed. 应用程序采用横向扩展设计更为常见。It's more common that applications have adopted a scale-out design. 具有容器和/或微服务的应用程序的构建已较为普遍。Building applications with containers or microservices (or both) is common. 现在,虽然可能仍只具有几台计算机,但它们已不只是运行单个工作负荷实例。Now, while you may still have only a few machines, they're not running just a single instance of a workload. 它们甚至可以同时运行多个不同的工作负荷。They may even be running multiple different workloads at the same time. 现在有多个不同类型的服务(没有一个服务需要占用整个计算机的资源),可能使用了这些服务的数百个不同实例。You now have dozens of different types of services (none consuming a full machine's worth of resources), perhaps hundreds of different instances of those services. 每个命名实例都有一个或多个实例或副本用于高可用性 (HA)。Each named instance has one or more instances or replicas for High Availability (HA). 根据这些工作负荷的大小及其繁忙程度,可能需要数百至数千台计算机。Depending on the sizes of those workloads, and how busy they are, you may find yourself with hundreds or thousands of machines.

突然间,管理环境并不像管理一些专用于单一类型工作负荷的计算机一样简单。Suddenly managing your environment is not so simple as managing a few machines dedicated to single types of workloads. 服务器是虚拟的且不再具有名称(毕竟现在要管理的是一大堆而不是几台计算机)。Your servers are virtual and no longer have names (you have switched mindsets from pets to cattle after all). 有关计算机的配置减少了,有关服务本身的配置增多了。Configuration is less about the machines and more about the services themselves. 专用于单个工作负荷实例的硬件大体上已过时。Hardware that is dedicated to a single instance of a workload is largely a thing of the past. 服务本身已经变成小型分布式系统,跨越多个较小的商用硬件。Services themselves have become small distributed systems that span multiple smaller pieces of commodity hardware.

由于应用不再是一系列分布在多个层级的固化结构,因此现在就有更多的组合需要处理。Because your app is no longer a series of monoliths spread across several tiers, you now have many more combinations to deal with. 哪种因素决定了哪种类型的工作负荷可在特定的硬件上,或者可以运行多少个工作负荷?Who decides what types of workloads can run on which hardware, or how many? 哪些工作负荷可在相同的硬件上运行得更好,哪些会发生冲突?Which workloads work well on the same hardware, and which conflict? 计算机出现故障时,如何知道计算机上正在运行哪些程序?When a machine goes down how do you know what was running there on that machine? 哪种机制负责确保该工作负荷可再次开始运行?Who is in charge of making sure that workload starts running again? 是否正在等待(虚拟机)计算机恢复正常,或者工作负荷自动故障转移到其他计算机并保持运行?Do you wait for the (virtual?) machine to come back or do your workloads automatically fail over to other machines and keep running? 是否需要人工干预?Is human intervention required? 如何在此环境中升级?What about upgrades in this environment?

作为在此环境中进行操作的开发人员和操作员,我们希望获取一些帮助来管理此复杂性。As developers and operators dealing in this environment, we're going to want help managing this complexity. 大量招聘以及尝试通过配备人员来掩饰复杂性可能并非正确解答,那么我们应该怎么办?A hiring binge and trying to hide the complexity with people is probably not the right answer, so what do we do?

协调器简介Introducing orchestrators

“协调器”是软件片段中使用的一般术语,可帮助管理员管理这些类型的环境。An "Orchestrator" is the general term for a piece of software that helps administrators manage these types of environments. 协调器是组件,会获取“我想要在环境中运行此服务的五个副本”这样的请求。Orchestrators are the components that take in requests like "I would like five copies of this service running in my environment." 它们会尝试让环境来满足所需的状态,不管发生什么情况。They try to make the environment match the desired state, no matter what happens.

协调器(不是人类)是当计算机故障或工作负荷出于某种意外原因而终止时,要采取措施的组件。Orchestrators (not humans) are what take action when a machine fails or a workload terminates for some unexpected reason. 大多数协调器所做的操作不仅是处理故障。Most orchestrators do more than just deal with failure. 其他功能包括:管理新部署、处理升级和处理资源消耗及治理。Other features they have are managing new deployments, handling upgrades, and dealing with resource consumption and governance. 从本质来说,所有协调器就是要维护环境中配置的某些所需状态。All orchestrators are fundamentally about maintaining some desired state of configuration in the environment. 可将自己的预期告诉协调器,让它帮助完成繁重的工作。You want to be able to tell an orchestrator what you want and have it do the heavy lifting. 例如,位于 Mesos、Docker Datacenter/Docker Swarm、Kubernetes 和 Service Fabric 顶层的 Aurora 都是协调器。Aurora on top of Mesos, Docker Datacenter/Docker Swarm, Kubernetes, and Service Fabric are all examples of orchestrators. 开发人员正在积极开发这些协调器,以满足生产环境中的实际工作负荷需求。These orchestrators are being actively developed to meet the needs of real workloads in production environments.

协调即服务Orchestration as a service

群集资源管理器是在 Service Fabric 中处理业务流程的系统组件。The Cluster Resource Manager is the system component that handles orchestration in Service Fabric. 群集资源管理器的作业可划分为三个部分:The Cluster Resource Manager's job is broken down into three parts:

  1. 强制实施规则Enforcing Rules
  2. 优化环境Optimizing Your Environment
  3. 提供其他进程的帮助Helping with Other Processes

它不是什么What it isn't

在传统 N 层应用程序中,始终存在负载均衡器In traditional N tier applications, there's always a Load Balancer. 通常这是网络负载均衡器 (NLB) 或应用程序负载均衡器 (ALB),具体取决于它在网络堆栈中的位置。Usually this was a Network Load Balancer (NLB) or an Application Load Balancer (ALB) depending on where it sat in the networking stack. 有些负载均衡器基于硬件(例如 F5 的 BigIP 产品),有些则基于软件(例如 21Vianet 的 NLB)。Some load balancers are Hardware-based like F5's BigIP offering, others are software-based such as 21Vianet's NLB. 在其他环境中,可能会在此角色中看到类似于 HAProxy、nginx、Istio 或 Envoy 的组件。In other environments, you might see something like HAProxy, nginx, Istio, or Envoy in this role. 在这些体系结构中,负载均衡作业的目标是确保无状态工作负荷(大致) 接收相同的工作量。In these architectures, the job of load balancing is to ensure stateless workloads receive (roughly) the same amount of work. 均衡负载策略各不相同。Strategies for balancing load varied. 某些均衡器会将每个不同的调用发送到不同的服务器。Some balancers would send each different call to a different server. 另外一些均衡器提供会话固定/粘连。Others provided session pinning/stickiness. 更高级的均衡器使用实际负载估计或报告来根据其预期的成本和当前计算机负载路由调用。More advanced balancers use actual load estimation or reporting to route a call based on its expected cost and current machine load.

网络均衡器或消息路由器尝试确保 Web/辅助角色层保持大致均衡。Network balancers or message routers tried to ensure that the web/worker tier remained roughly balanced. 用于均衡数据层的策略有所不同,具体取决于数据存储机制。Strategies for balancing the data tier were different and depended on the data storage mechanism. 数据层的均衡依赖于数据分区、缓存、托管视图、存储过程和其他特定于存储的机制。Balancing the data tier relied on data sharding, caching, managed views, stored procedures, and other store-specific mechanisms.

尽管其中有些策略很有作用,但 Service Fabric 群集 Resource Manager 的功能并不像网络负载均衡器或缓存一样。While some of these strategies are interesting, the Service Fabric Cluster Resource Manager is not anything like a network load balancer or a cache. 网络负载均衡器通过在前端分散流量来确保前端均衡。A Network Load Balancer balances frontends by spreading traffic across frontends. Service Fabric 群集资源管理器采用不同的策略。The Service Fabric Cluster Resource Manager has a different strategy. Service Fabric 基本上是将服务移到最适当的位置,让流量或负载进行跟随。 Fundamentally, Service Fabric moves services to where they make the most sense, expecting traffic or load to follow. 例如,它可能会将服务移到目前较冷清的节点,这些节点因为其中的服务的工作量不大而显得冷清。For example, it might move services to nodes that are currently cold because the services that are there are not doing much work. 节点冷清可能是因为其中的服务被删除或移至别处。The nodes may be cold since the services that were present were deleted or moved elsewhere. 再举一例,群集 Resource Manager 也可能会将服务从计算机中移除。As another example, the Cluster Resource Manager could also move a service away from a machine. 可能该计算机需要升级,也可能该计算机因为其上运行的服务处于使用高峰而导致过载。Perhaps the machine is about to be upgraded, or is overloaded due to a spike in consumption by the services running on it. 或者,服务的资源需求可能已增加。Alternatively, the service's resource requirements may have increased. 因此,此计算机上没有足够的资源来继续运行服务。As a result there aren't sufficient resources on this machine to continue running it.

由于群集资源管理器负责移动服务,因此它提供一个不同于网络负载均衡器的功能集。Because the Cluster Resource Manager is responsible for moving services around, it contains a different feature set compared to what you would find in a network load balancer. 这是因为,网络负载均衡器将网络流量传送到服务所在位置,即使这个位置并不适合运行该服务。This is because network load balancers deliver network traffic to where services already are, even if that location is not ideal for running the service itself. Service Fabric 群集资源管理器使用本质上不同的策略来确保可以高效利用群集中的资源。The Service Fabric Cluster Resource Manager employs fundamentally different strategies for ensuring that the resources in the cluster are efficiently utilized.

后续步骤Next steps

  • 有关群集资源管理器中的体系结构和信息流的信息,请参阅此文For information on the architecture and information flow within the Cluster Resource Manager, check out this article
  • 群集 Resource Manager 提供许多用于描述群集的选项。The Cluster Resource Manager has many options for describing the cluster. 若要详细了解这些指标,请查看这篇介绍 Service Fabric 群集的文章To find out more about metrics, check out this article on describing a Service Fabric cluster
  • 有关配置服务的详细信息,请参阅了解如何配置服务For more information on configuring services, Learn about configuring Services
  • 指标是 Service Fabric 群集资源管理器在群集中管理消耗和容量的方式。Metrics are how the Service Fabric Cluster Resource Manger manages consumption and capacity in the cluster. 若要详细了解指标及其配置方式,请查看本文To learn more about metrics and how to configure them check out this article
  • 群集 Resource Manager 可与 Service Fabric 的管理功能配合使用。The Cluster Resource Manager works with Service Fabric's management capabilities. 若要详细了解这种集成,请阅读此文To find out more about that integration, read this article
  • 若要了解群集 Resource Manager 如何管理和均衡群集中的负载,请查看有关均衡负载的文章To find out about how the Cluster Resource Manager manages and balances load in the cluster, check out the article on balancing load