缩放 Azure Service Fabric 群集Scaling Azure Service Fabric clusters

Service Fabric 群集是通过网络连接在一起的一组虚拟机或物理机,可在其中部署和管理微服务。A Service Fabric cluster is a network-connected set of virtual or physical machines into which your microservices are deployed and managed. 属于群集一部分的计算机或 VM 称为节点。A machine or VM that's part of a cluster is called a node. 群集可以包含数千个节点。Clusters can contain potentially thousands of nodes. 创建 Service Fabric 群集后,可以群集横向缩放(更改节点数)或纵向缩放(更改节点资源)该群集。After creating a Service Fabric cluster, you can scale the cluster horizontally (change the number of nodes) or vertically (change the resources of the nodes). 随时可以缩放群集,即使该群集上正在运行工作负荷。You can scale the cluster at any time, even when workloads are running on the cluster. 在缩放群集的同时,应用程序也会随之自动缩放。As the cluster scales, your applications automatically scale as well.

为何缩放群集?Why scale the cluster? 应用程序的需求会不断变化。Application demands change over time. 可能需要增加群集资源来满足更多的应用程序工作负荷或网络流量,或者在需求下降时减少群集资源。You may need to increase cluster resources to meet increased application workload or network traffic or decrease cluster resources when demand drops.

横向扩展和缩减Scaling in and out, or horizontal scaling

更改群集中的节点数。Changes the number of nodes in the cluster. 新节点加入群集后,群集资源管理器会将服务移到其中,导致现有节点上的总负载减少。Once the new nodes join the cluster, the Cluster Resource Manager moves services to them which reduces load on the existing nodes. 此外,如果群集的资源未被有效利用,可以减少节点数量。You can also decrease the number of nodes if the cluster's resources are not being used efficiently. 节点退出群集后,服务会移出这些节点,剩余节点上的负载会增大。As nodes leave the cluster, services move off those nodes and load increases on the remaining nodes. 减少 Azure 中运行的群集的节点数可以节省资金,因为我们是根据 VM 的数量付费,而不是根据这些 VM 上的工作负荷付费。Reducing the number of nodes in a cluster running in Azure can save you money, since you pay for the number of VMs you use and not the workload on those VMs.

  • 优点:理论上无限缩放。Advantages: Infinite scale, in theory. 如果应用程序采用可伸缩性设计,则可以通过添加更多节点来实现无限扩充。If your application is designed for scalability, you can enable limitless growth by adding more nodes. 使用云环境中的工具可以轻松添加或删除节点,因此可以方便地调整容量,并且只需为使用的资源付费。The tooling in cloud environments makes it easy to add or remove nodes, so it's easy to adjust capacity and you only pay for the resources you use.
  • 缺点:应用程序必须采用可伸缩性设计Disadvantages: Applications must be designed for scalability. 应用程序数据库和持久性可能需要更多的体系结构工作才能正常缩放。Application databases and persistence may require additional architectural work to scale as well. 但是,Service Fabric 有状态服务中的可靠集合能够大大简化应用程序数据的缩放。Reliable collections in Service Fabric stateful services, however, make it much easier to scale your application data.

虚拟机规模集是一种 Azure 计算资源,可用于将一组虚拟机作为一个集进行部署和管理。Virtual machine scale sets are an Azure compute resource that you can use to deploy and manage a collection of virtual machines as a set. Azure 群集中定义的每个节点类型设置为独立的规模集Every node type that is defined in an Azure cluster is set up as a separate scale set. 然后,每个节点类型可以独立缩减或扩展、打开不同的端口集,并可以有不同的容量指标。Each node type can then be scaled in or out independently, have different sets of ports open, and can have different capacity metrics.

缩放 Azure 群集时,请记住以下准则:When scaling an Azure cluster, keep the following guidelines in mind:

  • 运行生产工作负荷的主节点类型应始终具有五个或更多个节点。primary node types running production workloads should always have five or more nodes.
  • 运行有状态生产工作负荷的非主节点类型应始终具有五个或更多个节点。non-primary node types running stateful production workloads should always have five or more nodes.
  • 运行无状态生产工作负荷的非主节点类型应始终具有两个或更多个节点。non-primary node types running stateless production workloads should always have two or more nodes.
  • 持久性级别为金级或银级的任何节点类型应始终具有五个或更多个节点。Any node type of durability level of Gold or Silver should always have five or more nodes.
  • 不要从节点类型中删除随机 VM 实例/节点,始终使用虚拟机规模集缩减功能。Do not remove random VM instances/nodes from a node type, always use the virtual machine scale set scale down feature. 删除随机 VM 实例可能会对系统正确进行负载均衡造成负面影响。The deletion of random VM instances can adversely affect the systems ability to properly load balance.
  • 如果使用自动缩放规则,请将规则设置为每次对一个节点执行缩减(删除 VM 实例)。If using autoscale rules, set the rules so that scaling in (removing VM instances) is done one node at a time. 一次减少多个实例是不安全的。Scaling down more than one instance at a time is not safe.

由于群集中的 Service Fabric 节点类型由后端的虚拟机规模集构成,因此可以设置自动缩放规则,或手动缩放每个节点类型/虚拟机规模集。Since the Service Fabric node types in your cluster are made up of virtual machine scale sets at the backend, you can set up auto-scale rules or manually scale each node type/virtual machine scale set.

编程缩放Programmatic scaling

在许多方案中,手动或使用自动缩放规则缩放群集是合理的解决方案。In many scenarios, Scaling a cluster manually or with autoscale rules are good solutions. 但是,对于更高级的方案,这种缩放方法可能不合适。For more advanced scenarios, though, they may not be the right fit. 这些方法的潜在缺点包括:Potential drawbacks to these approaches include:

  • 手动缩放要求登录并显式请求缩放操作。Manually scaling requires you to log in and explicitly request scaling operations. 如果经常需要执行缩放操作或者执行该操作的时间不可预测,则这种缩放方法可能不是一个很好的解决方案。If scaling operations are required frequently or at unpredictable times, this approach may not be a good solution.
  • 当自动缩放规则从虚拟机规模集中删除某个实例时,它们不会从关联的 Service Fabric 群集中自动删除该节点的信息,除非节点类型的持久性级别达到了银级或金级。When auto-scale rules remove an instance from a virtual machine scale set, they do not automatically remove knowledge of that node from the associated Service Fabric cluster unless the node type has a durability level of Silver or Gold. 由于自动缩放规则在规模集级别(而不是 Service Fabric 级别)工作,因此,自动缩放规则可能会在未正常关闭 Service Fabric 节点的情况下将其删除。Because auto-scale rules work at the scale set level (rather than at the Service Fabric level), auto-scale rules can remove Service Fabric nodes without shutting them down gracefully. 在执行缩减操作后,这种强行删除节点的方式会使 Service Fabric 节点保持“虚幻”状态。This rude node removal will leave 'ghost' Service Fabric node state behind after scale-in operations. 个人(或服务)需要定期清理 Service Fabric 群集中已删除节点的状态。An individual (or a service) would need to periodically clean up removed node state in the Service Fabric cluster.
  • 持久性级别达到金级或银级的节点类型会自动清理已删除的节点,因此无需任何附加清理。A node type with a durability level of Gold or Silver automatically cleans up removed nodes, so no additional clean-up is needed.
  • 尽管自动缩放规则支持许多指标,但指标集的规模仍然有限。Although there are many metrics supported by auto-scale rules, it is still a limited set. 如果方案需要根据该集中未涵盖的某个指标进行缩放,则自动缩放规则可能不是一个适当的选项。If your scenario calls for scaling based on some metric not covered in that set, then auto-scale rules may not be a good option.

应选择哪种 Service Fabric 缩放方法取决于具体的方案。How you should approach Service Fabric scaling depends on your scenario. 如果缩放过程不常见,则具备手动添加或删除节点的能力也许已足够。If scaling is uncommon, the ability to add or remove nodes manually is probably sufficient. 在比较复杂的方案中,能够以编程方式缩放的自动缩放规则和 SDK 可用作强大的替代方法。For more complex scenarios, auto-scale rules and SDKs exposing the ability to scale programmatically offer powerful alternatives.

Azure API 可让应用程序以编程方式使用虚拟机规模集和 Service Fabric 群集。Azure APIs exist which allow applications to programmatically work with virtual machine scale sets and Service Fabric clusters. 如果现有的自动缩放选项不适用于方案,可通过这些 API 实现自定义的缩放逻辑。If existing auto-scale options don't work for your scenario, these APIs make it possible to implement custom scaling logic.

实现这种“定制”自动缩放功能的方法之一是,将一个新的无状态服务添加到 Service Fabric 应用程序来管理缩放操作。One approach to implementing this 'home-made' auto-scaling functionality is to add a new stateless service to the Service Fabric application to manage scaling operations. 创建自己的缩放服务可以针对应用程序的缩放行为实现最大控制度和定制性。Creating your own scaling service provides the highest degree of control and customizability over your application's scaling behavior. 在需要精确何时或者如何缩减或扩展应用程序的方案中,这种方法非常有效。但是,这种控制也附带了代码复杂性方面的弊端。This can be useful for scenarios requiring precise control over when or how an application scales in or out. However, this control comes with a tradeoff of code complexity. 使用这种方法意味着需要拥有缩放代码,而这并不是一个简单的任务。Using this approach means that you need to own scaling code, which is non-trivial. 在服务的 RunAsync 方法中,有一组触发器可以确定是否需要缩放(包括检查最大群集大小等参数,以及缩放减缓)。Within the service's RunAsync method, a set of triggers can determine if scaling is required (including checking parameters such as maximum cluster size and scaling cooldowns).

适用于虚拟机规模集交互的 API(用于确定和修改当前虚拟机实例数量)为 Fluent Azure 管理计算库The API used for virtual machine scale set interactions (both to check the current number of virtual machine instances and to modify it) is the fluent Azure Management Compute library. fluent 计算库提供一个易用的 API 来与虚拟机规模集交互。The fluent compute library provides an easy-to-use API for interacting with virtual machine scale sets. 若要与 Service Fabric 群集本身交互,可使用 System.Fabric.FabricClientTo interact with the Service Fabric cluster itself, use System.Fabric.FabricClient.

不过,缩放代码无需在群集中以服务的形式运行即可缩放。The scaling code doesn't need to run as a service in the cluster to be scaled, though. IAzureFabricClient 均可远程连接到其关联的 Azure 资源,因此,缩放服务可以单纯地是一个控制台应用程序,或者是从 Service Fabric 应用程序外部运行的 Windows 服务。Both IAzure and FabricClient can connect to their associated Azure resources remotely, so the scaling service could easily be a console application or Windows service running from outside the Service Fabric application.

由于这些限制,我们可能想要实现其他自定义的自动缩放模型Based on these limitations, you may wish to implement more customized automatic scaling models.

纵向扩展和缩减Scaling up and down, or vertical scaling

更改群集中节点的资源(CPU、内存或存储)。Changes the resources (CPU, memory, or storage) of nodes in the cluster.

  • 优点:软件和应用程序体系结构保持不变。Advantages: Software and application architecture stays the same.
  • 缺点:有限缩放,因为在单个节点上增加的资源量有限制。Disadvantages: Finite scale, since there is a limit to how much you can increase resources on individual nodes. 会造成停机,因为需要使物理机或虚拟机脱机才能添加或删除资源。Downtime, because you will need to take physical or virtual machines offline in order to add or remove resources.

虚拟机规模集是一种 Azure 计算资源,可用于将一组虚拟机作为一个集进行部署和管理。Virtual machine scale sets are an Azure compute resource that you can use to deploy and manage a collection of virtual machines as a set. Azure 群集中定义的每个节点类型设置为独立的规模集Every node type that is defined in an Azure cluster is set up as a separate scale set. 然后可以单独管理每个节点类型。Each node type can then be managed separately. 纵向扩展或缩减节点类型涉及到更改规模集中虚拟机实例的 SKU。Scaling a node type up or down involves changing the SKU of the virtual machine instances in the scale set.

Warning

我们建议不要更改规模集/节点类型的 VM SKU,除非它在银级持久性或更高的级别运行。We recommend that you do not change the VM SKU of a scale set/node type unless it is running at Silver durability or greater. 更改 VM SKU 大小是一种破坏数据的就地基础结构操作。Changing VM SKU Size is a data-destructive in-place infrastructure operation. 由于无法延迟或监视此更改,此操作可能会导致有状态服务的数据丢失或其他意外操作问题(甚至可能影响无状态工作负载)。Without some ability to delay or monitor this change, it is possible that the operation can cause data loss for stateful services or cause other unforeseen operational issues, even for stateless workloads.

缩放 Azure 群集时,请记住以下准则:When scaling an Azure cluster, keep the following guideline in mind:

  • 如果减少某个主节点类型,则绝不应将其缩减到超出可靠性层允许的数目。If scaling down a primary node type, you should never scale it down more than what the reliability tier allows.

根据节点类型是非主节点类型还是主节点类型,其纵向缩放过程有所不同。The process of scaling a node type up or down is different depending on whether it is a non-primary or primary node type.

缩放非主节点类型Scaling non-primary node types

使用所需的资源创建新节点类型。Create a new node type with the resources you need. 更新运行中服务的位置约束,以包含新节点类型。Update the placement constraints of running services to include the new node type. 将旧节点类型的实例计数逐渐(一次一个)减少至零,以免影响群集的可靠性。Gradually (one at a time), reduce the instance count of the old node type instance count to zero so that the reliability of the cluster is not affected. 在解除旧节点类型的过程中,服务会逐渐迁移到新节点类型。Services will gradually migrate to the new node type as the old node type is decommisioned.

缩放主节点类型Scaling the primary node type

我们建议不要更改主节点类型的 VM SKU。We recommend that you do not change the VM SKU of the primary node type. 如果需要更多群集容量,我们建议添加更多实例。If you need more cluster capacity, we recommend adding more instances.

如果那不可行,可以创建新群集并从旧群集还原应用程序状态(如果适用)。If that not possible, you can create a new cluster and restore application state (if applicable) from your old cluster. 不需要还原任何系统服务状态,在将应用程序部署到新群集时就已重新创建它们。You do not need to restore any system service state, they are recreated when you deploy your applications to your new cluster. 如果只在群集上运行无状态应用程序,则只需将应用程序部署到新群集即可,无需还原任何内容。If you were just running stateless applications on your cluster, then all you do is deploy your applications to the new cluster, you have nothing to restore. 如果一定要进行不受支持的操作,更改 VM SKU,请修改虚拟机规模集模型定义以反映新的 SKU。If you decide to go the unsupported route and want to change the VM SKU, then make modifications to the virtual machine scale set Model definition to reflect the new SKU. 如果群集只有一个节点类型,请确保所有有状态应用程序及时响应所有服务副本生命周期事件(例如,在生成副本时出现停滞),并且重新生成服务副本的持续时间小于五分钟(适用于“银级”持久性级别)。If your cluster has only one node type, then make sure that all your stateful applications respond to all Service replica lifecycle events (like replica in build is stuck) in a timely fashion and that your service replica rebuild duration is less than five minutes (for Silver durability level).

后续步骤Next steps