Service Fabric 群集容量规划注意事项Service Fabric cluster capacity planning considerations

群集容量计划对于每个 Service Fabric 生产环境都非常重要。Cluster capacity planning is important for every Service Fabric production environment. 需考虑的关键因素包括:Key considerations include:

  • 群集节点类型的初始编号和属性Initial number and properties of cluster node types

  • 每个节点类型的持续性级别,它决定 Azure 基础结构中的 Service Fabric VM 权限Durability level of each node type, which determines Service Fabric VM privileges within Azure infrastructure

  • 群集的可靠性级别,它决定 Service Fabric 系统服务的稳定性及整体集群功能Reliability level of the cluster, which determines the stability of Service Fabric system services and overall cluster function

本文将引导你了解其中每个方面的重要决策点。This article will walk you through the significant decision points for each of these areas.

群集节点类型的初始编号和属性Initial number and properties of cluster node types

节点类型定义群集中一组节点(虚拟机)的大小、数量和属性。A node type defines the size, number, and properties for a set of nodes (virtual machines) in the cluster. 在 Service Fabric 群集中定义的每个节点类型映射到虚拟机规模集Every node type that is defined in a Service Fabric cluster maps to a virtual machine scale set.

由于每个节点类型是不同的规模集,所以可以独立纵向扩展或缩减,可以打开不同的端口集,并使用不同的容量指标。Because each node type is a distinct scale set, it can be scaled up or down independently, have different sets of ports open, and have different capacity metrics. 有关节点类型和虚拟机规模集之间关系的详细信息,请参阅 Service Fabric 群集节点类型For more information about the relationship between node types and virtual machine scale sets, see Service Fabric cluster node types.

每个群集都需要一个主节点类型,它运行提供 Service Fabric 平台功能的关键系统服务。Each cluster requires one primary node type, which runs critical system services that provide Service Fabric platform capabilities. 虽然也可以使用主节点类型来运行应用程序,但建议仅将其用于运行系统服务。Although it's possible to also use primary node types to run your applications, it's recommended to dedicate them solely to running system services.

非主节点类型可用于定义应用程序角色(例如前端和后端服务)并在群集中物理隔离服务 。Non-primary node types can be used to define application roles (such as front-end and back-end services) and to physically isolate services within a cluster. Service Fabric 群集可以有零个或更多非主节点类型。Service Fabric clusters can have zero or more non-primary node types.

使用 Azure 资源管理器部署模板中节点类型定义下的 isPrimary 属性配置主节点类型。The primary node type is configured using the isPrimary attribute under the node type definition in the Azure Resource Manager deployment template. 有关节点类型属性的完整列表,请参阅 NodeTypeDescription 对象See the NodeTypeDescription object for the full list of node type properties. 若要了解示例用法,请打开 Service Fabric 群集示例中的任意 AzureDeploy.json 文件,并通过“在页面上查找”来搜索 nodetTypes 对象 。For example usage, open any AzureDeploy.json file in Service Fabric cluster samples and Find on Page search for the nodetTypes object.

主节点计划注意事项Node type planning considerations

初始节点类型的数量取决于集群的目的以及集群上运行的应用程序和服务。The number of initial nodes types depends upon the purpose of you cluster and the applications and services running on it. 考虑以下问题:Consider the following questions:

  • 应用程序是否有多个服务,其中是否有任何服务需面向公众或面向 Internet?Does your application have multiple services, and do any of them need to be public or internet facing?

    典型的应用程序包括从客户端接收输入的前端网关服务,以及与前端服务进行通信的一个或多个后端服务,前端和后端服务之间单独联网。Typical applications contain a front-end gateway service that receives input from a client, and one or more back-end services that communicate with the front-end services, with separate networking between the front-end and back-end services. 这些情况通常需要三种节点类型:一个主节点类型和两个非主节点类型(分别用于前端和后端服务)。These cases typically require three node types: one primary node type, and two non-primary node types (one each for the front and back-end service).

  • 构成应用程序的各项服务是否有不同的基础结构要求,例如更多的 RAM 或更高的 CPU 周期?Do the services that make up your application have different infrastructure needs such as greater RAM or higher CPU cycles?

    前端服务通常可以在容量较小(如 D2 等 VM 大小)且向 Internet 开放了端口的 VM 上运行。Often, front-end service can run on smaller VMs (VM sizes like D2) that have ports open to the internet. 计算密集型后端服务可能需要在不面向 Internet 的大型 VM(D4、D6、D15 等 VM 大小)上运行。Computationally intensive back-end services might need to run on larger VMs (with VM sizes like D4, D6, D15) that are not internet-facing. 为这些服务定义不同的节点类型,可以更有效、更安全地使用基础 Service Fabric VM,并使它们能够独立缩放。Defining different node types for these services allow you to make more efficient and secure use of underlying Service Fabric VMs, and enables them to scale them independently. 有关估算所需资源量的详细信息,请参阅 Service Fabric 应用程序的容量计划For more on estimating the amount of resources you'll need, see Capacity planning for Service Fabric applications

  • 是否有应用程序服务需要扩展到 100 个节点以上?Will any of your application services need to scale out beyond 100 nodes?

    对于 Service Fabric 应用程序,单个节点类型无法可靠地扩展到每个虚拟机规模集 100 个节点以上。A single node type can't reliably scale beyond 100 nodes per virtual machine scale set for Service Fabric applications. 运行超过 100 个节点需要额外的虚拟机规模集(因而还需要其他节点类型)。Running more than 100 nodes requires additional virtual machine scale sets (and therefore additional node types).

为集群的初始创建确定节点类型的数量和属性时,请记住,部署集群后,随时可以添加、修改或删除(非主要)节点类型。When determining the number and properties of node types for the initial creation of your cluster, keep in mind that you can always add, modify, or remove (non-primary) node types once your cluster is deployed. 也可以在正在运行的集群中修改主节点类型(尽管在生产环境中执行此类操作需要大量的计划和谨慎工作)。Primary node types can also be modified in running clusters (though such operations require a great deal of planning and caution in production environments).

节点类型属性的另一个考虑因素是持续性级别,它决定该节点类型的 VM 在 Azure 基础结构中拥有的权限。A further consideration for your node type properties is durability level, which determines privileges a node type's VMs have within Azure infrastructure. 使用你为群集选择的 VM 的大小以及为各个节点类型分配的实例计数,帮助确定每种节点类型的适当持续性层,如下所述。Use the size of VMs you choose for your cluster and the instance count you assign for individual node types to help determine the appropriate durability tier for each of your node types, as described next.

群集的持续性特征Durability characteristics of the cluster

持续性级别指定 Service Fabric VM 对底层 Azure 基础结构的权限。The durability level designates the privileges your Service Fabric VMs have with the underlying Azure infrastructure. 此特权允许 Service Fabric 暂停任何会影响 Service Fabric 系统服务和你的有状态服务的法定要求的 VM 级基础结构请求(例如重新引导、重新映像或迁移)。This privilege allows Service Fabric to pause any VM-level infrastructure request (such as reboot, reimage, or migration) that impacts the quorum requirements for Service Fabric system services and your stateful services.

重要

为每个节点类型设置持续性级别。Durability level is set per node type. 如果未指定任何级别,将使用青铜层,但该层不提供自动 OS 升级。If there's none specified, Bronze tier will be used, however it doesn't provide automatic OS upgrades. 对于生产工作负载,建议使用白银或黄金级持续性 。Silver or Gold durability is recommended for production workloads.

下表列出了 Service Fabric 持续性层级及其要求和可提供的能力。The table below lists Service Fabric durability tiers, their requirements, and affordances.

持续性层Durability tier 所需 VM 数量下限Required minimum number of VMs 支持的 VM 大小Supported VM Sizes 你对虚拟机规模集所做的更新Updates you make to your virtual machine scale set Azure 启动的更新和维护Updates and maintenance initiated by Azure
GoldGold 55 专用于单个客户的全节点大小(例如 DS15_v2、D15_v2)Full-node sizes dedicated to a single customer (for example, DS15_v2, D15_v2) 可延迟到 Service Fabric 群集批准Can be delayed until approved by the Service Fabric cluster 可以在每个升级域中暂停 2 个小时,以留出更多时间使副本从早期故障中恢复Can be paused for 2 hours per upgrade domain to allow additional time for replicas to recover from earlier failures
SilverSilver 55 单核或更多核心的 VM,至少 50 GB 的本地 SSDVMs of single core or above with at least 50 GB of local SSD 可延迟到 Service Fabric 群集批准Can be delayed until approved by the Service Fabric cluster 任何时候都无法延迟Cannot be delayed for any significant period of time
BronzeBronze 11 VM,至少 50 GB 的本地 SSDVMs with at least 50 GB of local SSD 不会因为 Service Fabric 群集延迟Will not be delayed by the Service Fabric cluster 任何时候都无法延迟Cannot be delayed for any significant period of time

警告

在青铜级持续性下,无法进行自动 OS 映像升级。With Bronze durability, automatic OS image upgrade isn't available. 虽然不推荐将补丁协调应用程序(仅适用于非 Azure 托管群集)用于白银或更高持续性级别,但对于 Service Fabric 升级域,只能通过它自动执行 Windows 更新。While Patch Orchestration Application (intended only for non-Azure hosted clusters) is not recommended for Silver or greater durability levels, it is your only option to automate Windows updates with respect to Service Fabric upgrade domains.

重要

无论在任何持续性级别,VM 规模集上的解除分配操作都将破坏群集。Regardless of durability level, running a Deallocation operation on a virtual machine scale set will destroy the cluster.

BronzeBronze

以青铜级持续性运行的节点类型不具有任何特权。Node types running with Bronze durability obtain no privileges. 这意味着不会停止或延迟影响你的有状态工作负载的基础结构作业。This means that infrastructure jobs that impact your stateful workloads won't be stopped or delayed. 对仅运行无状态工作负载的节点类型使用青铜级持续性。Use Bronze durability for node types that only run stateless workloads. 对于生产工作负荷,建议运行“白银”或以上级别。For production workloads, running Silver or above is recommended.

白银和黄金Silver and Gold

对于承载你希望经常扩展的有状态服务的所有节点类型,以及希望在其中延迟部署操作并减少容量的位置来简化流程,请对所有节点类型使用白银或黄金级持续性。Use Silver or Gold durability for all node types that host stateful services you expect to scale-in frequently, and where you wish deployment operations be delayed and capacity to be reduced in favor of simplifying the process. 横向扩展方案不应影响你对持续性层级的选择。Scale-out scenarios should not affect your choice of the durability tier.

优点Advantages

  • 减少横向缩减操作所需的步骤数(自动调用节点停用和 Remove-ServiceFabricNodeState)。Reduces number of required steps for scale-in operations (node deactivation and Remove-ServiceFabricNodeState are called automatically).
  • 降低因就地 VM 大小更改操作和 Azure 基础结构操作而导致数据丢失的风险。Reduces risk of data loss due to in-place VM size change operations and Azure infrastructure operations.

缺点Disadvantages

  • 群集中或基础结构级别的问题可能使部署到虚拟机规模集和其他相关 Azure 资源的操作超时、延迟或完全受阻。Deployments to virtual machine scale sets and other related Azure resources can time out, be delayed, or be blocked entirely by problems in your cluster or at the infrastructure level.
  • 由于 Azure 基础结构操作期间发生的自动节点停用而增加了副本生命周期事件(例如,主交换)的数量。Increases the number of replica lifecycle events (for example, primary swaps) due to automated node deactivations during Azure infrastructure operations.
  • 执行 Azure 平台的软件更新或硬件维护活动时,将节点暂停服务一段时间。Takes nodes out of service for periods of time while Azure platform software updates or hardware maintenance activities are occurring. 在这些活动期间,你可能会看到状态为“正在禁用”/“已禁用”的节点。You may see nodes with status Disabling/Disabled during these activities. 这会暂时降低群集的容量,但不应影响群集或应用程序的可用性。This reduces the capacity of your cluster temporarily, but should not impact the availability of your cluster or applications.

白银和黄金级持续性节点类型的最佳做法Best practices for Silver and Gold durability node types

请遵循以下建议来管理具有白银或黄金持续性的节点类型:Follow these recommendations for managing node types with Silver or Gold durability:

  • 使群集和应用程序在任何时间都正常工作,并确保应用程序及时响应所有服务副本生命周期事件(例如,生成副本时出现停滞)。Keep your cluster and applications healthy at all times, and make sure that applications respond to all Service replica lifecycle events (like replica in build is stuck) in a timely fashion.
  • 采用更安全的方式进行 VM 大小更改(纵向扩展/缩减)。Adopt safer ways to make a VM size change (scale up/down). 要更改虚拟机规模集的 VM 大小,需要仔细计划和谨慎操作。Changing the VM size of a virtual machine scale set requires careful planning and caution. 有关详细信息,请参阅纵向扩展 Service Fabric 节点类型For details, see Scale up a Service Fabric node type
  • 为任何已启用“黄金”或“白银”耐久性级别的虚拟机规模集保留至少五个节点。Maintain a minimum count of five nodes for any virtual machine scale set that has durability level of Gold or Silver enabled. 如果横向缩减到此阈值以下,群集将进入错误状态,需要手动清除已删除节点的状态 (Remove-ServiceFabricNodeState)。Your cluster will enter error state if you scale in below this threshold, and you'll need to manually clean up state (Remove-ServiceFabricNodeState) for the removed nodes.
  • 持续性级别为“白银”或“黄金”的每个虚拟机规模集,在 Service Fabric 群集中都必须映射到其自己的节点类型。Each virtual machine scale set with durability level Silver or Gold must map to its own node type in the Service Fabric cluster. 将多个虚拟机规模集映射到单个节点类型,将阻碍 Service Fabric 群集和 Azure 基础结构间的协调正常工作。Mapping multiple virtual machine scale sets to a single node type will prevent coordination between the Service Fabric cluster and the Azure infrastructure from working properly.
  • 不要删除随机 VM 实例,请始终使用虚拟机规模集横向缩减功能。Do not delete random VM instances, always use virtual machine scale set scale in feature. 删除随机 VM 实例可能会在分布于升级域故障域的 VM 实例中造成不平衡。The deletion of random VM instances has a potential of creating imbalances in the VM instance spread across upgrade domains and fault domains. 这一失衡可能会对系统在服务实例/服务副本之间进行适当负载均衡的能力产生负面影响。This imbalance could adversely affect the systems ability to properly load balance among the service instances/Service replicas.
  • 如果使用自动缩放,请设置规则,使得在同一时间仅对一个节点进行横向缩减(删除 VM 实例)。If using Autoscale, set the rules such that scale in (removing of VM instances) operations are done only one node at a time. 一次减少多个实例是不安全的。Scaling down more than one instance at a time is not safe.
  • 如果在主节点类型上删除或取消分配 VM,切勿将已分配 VM 数降至可靠性层所需数量以下。If deleting or deallocating VMs on the primary node type, never reduce the count of allocated VMs below what the reliability tier requires. 在耐久性级别为“白银”或“黄金”的规模集中,这些操作会被无限期阻止。These operations will be blocked indefinitely in a scale set with a durability level of Silver or Gold.

更改耐久性级别Changing durability levels

在某些约束下,可以调整节点类型的持续性级别:Within certain constraints, node type durability level can be adjusted:

  • 具有白银或黄金持续性级别的节点类型不能降级到青铜级。Node types with durability levels of Silver or Gold can't be downgraded to Bronze.
  • 从“青铜”升级到“白银”或“黄金”可能需要几个小时。Upgrading from Bronze to Silver or Gold can take a few hours.
  • 更改持续性级别时,请务必在虚拟机规模集资源中的 Service Fabric 扩展配置以及 Service Fabric 群集资源中的节点类型定义中同时进行更新。When changing durability level, be sure to update it in both the Service Fabric extension configuration in your virtual machine scale set resource and in the node type definition in your Service Fabric cluster resource. 这些值必须匹配。These values must match.

容量计划的另一个注意事项是群集的可靠性级别,它决定系统服务和整个群集的稳定性,如以下部分所述。Another consideration when capacity planning is the reliability level for your cluster, which determines the stability of system services and your overall cluster, as described in the next section.

群集的可靠性特征Reliability characteristics of the cluster

群集可靠性级别决定在群集主节点类型上运行的系统服务副本数。The cluster reliability level determines the number of system services replicas running on the primary node type of the cluster. 副本越多,系统服务(以及整个集群)越可靠。The more replicas, the more reliable are the system services (and therefore the cluster as a whole).

重要

在集群级别设置可靠性级别,它决定主节点类型的节点数下限。Reliability level is set at the cluster level and determines the minimum number of nodes of the primary node type. 生产工作负载需要白银(大于或等于 5 个节点)或更高的可靠性级别。Production workloads require a reliability level of Silver (greater or equal to five nodes) or above.

可靠性层可以采用以下值:The reliability tier can take the following values:

  • 白金 - 系统服务在目标副本集计数为 9 的情况下运行Platinum - System services run with target replica set count of nine
  • 黄金 - 系统服务在目标副本集计数为 7 的情况下运行Gold - System services run with target replica set count of seven
  • 白银 - 系统服务在目标副本集计数为 5 的情况下运行Silver - System services run with target replica set count of five
  • 青铜 - 系统服务在目标副本集计数为 3 的情况下运行Bronze - System services run with target replica set count of three

下面是有关选择可靠性层的建议。Here is the recommendation on choosing the reliability tier. 种子节点数也设置为可靠性层的最小节点数。The number of seed nodes is also set to the minimum number of nodes for a reliability tier.

节点数Number of nodes 可靠性层Reliability Tier
11 请勿指定 reliabilityLevel 参数:系统将会计算它。Do not specify the reliabilityLevel parameter: the system calculates it.
33 BronzeBronze
5 或 65 or 6 SilverSilver
7 或 87 or 8 GoldGold
9 及以上9 and up PlatinumPlatinum

当增大或减小群集的大小(所有节点类型中的 VM 实例的总数)时,请考虑提升群集的可靠性层级。When you increase or decrease the size of your cluster (the sum of VM instances in all node types), consider updating the reliability of your cluster from one tier to another. 这样做会触发更改系统服务副本集计数所需的群集升级。Doing this triggers the cluster upgrades needed to change the system services replica set count. 等待升级完成,然后对群集做出其他任何更改,例如添加节点。Wait for the upgrade in progress to complete before making any other changes to the cluster, like adding nodes. 可以在 Service Fabric Explorer 中运行 Get-ServiceFabricClusterUpgrade 来监视升级进度You can monitor the progress of the upgrade on Service Fabric Explorer or by running Get-ServiceFabricClusterUpgrade

保障可靠性的容量计划Capacity planning for reliability

群集的容量需求取决于特定的工作负载和可靠性要求。The capacity needs of your cluster will be determined by your specific workload and reliability requirements. 本节提供一般指导,帮助你开始进行容量计划。This section provides general guidance to help you get started with capacity planning.

虚拟机大小调整Virtual machine sizing

“对于生产工作负载,建议将 VM 大小 (SKU) 设为 Standard D2_V2(或同等规模),并使其具有至少 50 GB 的本地 SSD、2 个核心和 4 GiB 的内存。”For production workloads, the recommended VM size (SKU) is Standard D2_V2 (or equivalent) with a minimum of 50 GB of local SSD, 2 cores, and 4 GiB of memory. 建议至少使用 50 GB 的本地 SSD,但某些工作负载(例如运行 Windows 容器的工作负荷)需要更大的磁盘。A minimum of 50 GB local SSD is recommended, however some workloads (such as those running Windows containers) will require larger disks. 为生产工作负载选择其他 VM 大小时,请记住以下约束:When choosing other VM sizes for production workloads, keep in mind the following constraints:

  • 不支持部分核心 VM 大小,例如 Standard A0。Partial core VM sizes like Standard A0 are not supported.
  • 由于性能原因,不支持 A系列 VM 大小。A-series VM sizes are not supported for performance reasons.

主节点类型Primary node type

Azure 上的生产工作负载至少需要 5 个主节点(VM 实例)和白银可靠性层。Production workloads on Azure require a minimum of five primary nodes (VM instances) and reliability tier of Silver. 建议将集群主节点类型专用于系统服务,并使用放置约束将应用程序部署到辅助节点类型。It's recommended to dedicate the cluster primary node type to system services, and use placement constraints to deploy your application to secondary node types.

Azure 中的测试工作负载可以运行至少 1 个或 3 个主节点。Test workloads in Azure can run a minimum of one or three primary nodes. 若要配置单节点群集,请确保在资源管理器模板中完全省略 reliabilityLevel 设置(指定空字符串作为 reliabilityLevel 值是不够的)。To configure a one node cluster, be sure that the reliabilityLevel setting is completely omitted in your Resource Manager template (specifying empty string value for reliabilityLevel is not sufficient). 如果使用 Azure 门户设置单节点群集设置,将自动完成此配置。If you set up the one node cluster set up with Azure portal, this configuration is done automatically.

警告

单节点群集以没有可靠性的特殊配置运行,并且不支持横向扩展。One-node clusters run with a special configuration without reliability and where scale out is not supported.

非主节点类型Non-primary node types

非主节点类型的节点数下限取决于节点类型的特定持久性级别The minimum number of nodes for a non-primary node type depends on the specific durability level of the node type. 应根据要为相应节点类型运行的应用程序或服务的副本数,以及工作负载是有状态还是无状态来计划节点数(和持续性级别)。You should plan the number of nodes (and durability level) based on the number of replicas of applications or services that you want to run for the node type, and depending on whether the workload is stateful or stateless. 请记住,部署群集后,可以随时增加或减少节点类型中的 VM 数量。Keep in mind you can increase or decrease the number of VMs in a node type anytime after you have deployed the cluster.

有状态工作负载Stateful workloads

对于使用 Service Fabric 可靠集合或可靠参与者的有状态生产工作负载,建议将副本数下限及目标设为 5。For stateful production workloads using Service Fabric reliable collections or reliable Actors, a minimum and target replica count of five is recommended. 这样一来,在稳定状态下,在每个容错域和升级域中都将有一个副本(来自副本集)。With this, in steady state you end up with a replica (from a replica set) in each fault domain and upgrade domain. 通常,使用为系统服务设置的可靠性级别来指导用于有状态服务的副本计数。In general, use the reliability level you set for system services as a guide for the replica count you use for your stateful services.

无状态工作负载Stateless workloads

对于无状态生产工作负载,为维持仲裁,支持的最小非主节点类型大小为 3,但建议将节点类型大小设为 5。For stateless production workloads, the minimum supported non-primary node type size is three to maintain quorum, however a node type size of five is recommended.

后续步骤Next steps

在配置集群之前,请查看 Not Allowed 集群升级策略,以免稍后由于无法通过其他方式更改系统配置设置而必须重新创建集群。Before configuring your cluster, review the Not Allowed cluster upgrade policies to mitigate having to recreate your cluster later due to otherwise unchangeable system configuration settings.

有关群集计划的详细信息,请参阅:For more on cluster planning, see: