限制 Service Fabric 群集资源管理器Throttling the Service Fabric Cluster Resource Manager

即使已正确配置了群集资源管理器,群集有时也会中断。Even if you've configured the Cluster Resource Manager correctly, the cluster can get disrupted. 例如,可能同时发生节点和容错域故障 - 升级时如果发生这种情况会怎么样?For example, there could be simultaneous node and fault domain failures - what would happen if that occurred during an upgrade? 群集资源管理器始终尝试修复所有问题,同时占用群集的资源,尝试重新整理和修复群集。The Cluster Resource Manager always tries to fix everything, consuming the cluster's resources trying to reorganize and fix the cluster. 限制有助于提供一个停止机制,让群集可以使用资源进行稳定 - 节点恢复正常、网络分区修复、部署已更正的部分。Throttles help provide a backstop so that the cluster can use resources to stabilize - the nodes come back, the network partitions heal, corrected bits get deployed.

为了帮助实现这些目的,Service Fabric 群集 Resource Manager 提供了多个限制机制。To help with these sorts of situations, the Service Fabric Cluster Resource Manager includes several throttles. 这些限制都是相当大的约束。These throttles are all fairly large hammers. 通常未经仔细地规划和测试,不应更改。Generally they shouldn't be changed without careful planning and testing.

如果更改群集资源管理器的限制,则应将其调整到预期的实际负载。If you change the Cluster Resource Manager's throttles, you should tune them to your expected actual load. 你可能会需要让某些限制到位,即使这意味着群集需要更长时间才能在一些情况下达到稳定状态。You may determine you need to have some throttles in place, even if it means the cluster takes longer to stabilize in some situations. 确定限制的正确值时,需要进行测试。Testing is required to determine the correct values for throttles. 限制需要足够高,群集才能在合理的时间内对更改做出响应,需要足够低才能实际防止占用过多资源。Throttles need to be high enough to allow the cluster to respond to changes in a reasonable amount of time, and low enough to actually prevent too much resource consumption.

大多数情况下,我们已了解客户使用限制,因为这些限制已在资源受限环境中。Most of the time we've seen customers use throttles it has been because they were already in a resource constrained environment. 一些例子包括各个节点中受限制的网络带宽或由于吞吐量限制不能并行生成多个有状态副本的磁盘。Some examples would be limited network bandwidth for individual nodes, or disks that aren't able to build many stateful replicas in parallel due to throughput limitations. 如果没有限制,操作会占用大量的这些资源,导致操作失败或很慢。Without throttles, operations could overwhelm these resources, causing operations to fail or be slow. 在这些情况下,客户使用限制并知道他们正在延长使群集达到稳定状态所需的时间量。In these situations customers used throttles and knew they were extending the amount of time it would take the cluster to reach a stable state. 客户还了解,在受到限制时,群集最终会在总体可靠性较低的状态下运行。Customers also understood they could end up running at lower overall reliability while they were throttled.

配置限制Configuring the throttles

Service Fabric 具有两种机制用于限制副本移动数。Service Fabric has two mechanisms for throttling the number of replica movements. Service Fabric 5.7 之前已存在的默认机制将限制以允许移动的绝对数表示。The default mechanism that existed before Service Fabric 5.7 represents throttling as an absolute number of moves allowed. 这不适用于所有大小的群集。This does not work for clusters of all sizes. 具体而言,对于大型群集,该默认值太小,均衡速度明显变慢(即使是必需的),但对较小的群集又不起作用。In particular, for large clusters the default value can be too small, significantly slowing down balancing even when it is necessary, while having no effect in smaller clusters. 基于百分比的限制已取代这项旧机制,该机制在服务数和节点数定期更改的动态群集中能够更好地缩放。This prior mechanism has been superseded by percentage-based throttling, which scales better with dynamic clusters in which the number of services and nodes change regularly.

限制是基于群集中副本数的百分比进行的。The throttles are based on a percentage of the number of replicas in the clusters. 例如,利用基于百分比的限制可以表达规则:“不要在 10 分钟时间间隔内移动 10% 以上的副本”。Percentage based throttles enable expressing the rule: "do not move more than 10% of replicas in a 10 minute interval", for example.

基于百分比的限制的配置设置如下:The configuration settings for percentage-based throttling are:

  • GlobalMovementThrottleThresholdPercentage - 任何时候群集中允许的最大移动数,以群集中副本总数的百分比表示。GlobalMovementThrottleThresholdPercentage - Maximum number of movements allowed in cluster at any time, expressed as percentage of total number of replicas in the cluster. 0 表示没有限制。0 indicates no limit. 默认值为 0。The default value is 0. 如果指定了此设置和 GlobalMovementThrottleThreshold,则使用更保守的限制。If both this setting and GlobalMovementThrottleThreshold are specified, then the more conservative limit is used.
  • GlobalMovementThrottleThresholdPercentageForPlacement - 放置阶段允许的最大移动数,以群集中副本总数的百分比表示。GlobalMovementThrottleThresholdPercentageForPlacement - Maximum number of movements allowed during the placement phase, expressed as percentage of total number of replicas in the cluster. 0 表示没有限制。0 indicates no limit. 默认值为 0。The default value is 0. 如果指定了此设置和 GlobalMovementThrottleThresholdForPlacement,则使用更保守的限制。If both this setting and GlobalMovementThrottleThresholdForPlacement are specified, then the more conservative limit is used.
  • GlobalMovementThrottleThresholdPercentageForBalancing - 均衡阶段允许的最大移动数,以群集中副本总数的百分比表示。GlobalMovementThrottleThresholdPercentageForBalancing - Maximum number of movements allowed during the balancing phase, expressed as percentage of total number of replicas in the cluster. 0 表示没有限制。0 indicates no limit. 默认值为 0。The default value is 0. 如果指定了此设置和 GlobalMovementThrottleThresholdForBalancing,则使用更保守的限制。If both this setting and GlobalMovementThrottleThresholdForBalancing are specified, then the more conservative limit is used.

指定限制百分比时,将 5% 指定为 0.05。When specifying the throttle percentage, you'd specify 5% as 0.05. 控制这些限制的间隔为 GlobalMovementThrottleCountingInterval,以秒为单位进行指定。The interval on which these throttles are governed is the GlobalMovementThrottleCountingInterval, which is specified in seconds.

<Section Name="PlacementAndLoadBalancing">
     <Parameter Name="GlobalMovementThrottleThresholdPercentage" Value="0" />
     <Parameter Name="GlobalMovementThrottleThresholdPercentageForPlacement" Value="0" />
     <Parameter Name="GlobalMovementThrottleThresholdPercentageForBalancing" Value="0" />
     <Parameter Name="GlobalMovementThrottleCountingInterval" Value="600" />
</Section>

通过 ClusterConfig.json 进行独立部署或将 Template.json 用于 Azure 托管群集:via ClusterConfig.json for Standalone deployments or Template.json for Azure hosted clusters:

"fabricSettings": [
  {
    "name": "PlacementAndLoadBalancing",
    "parameters": [
      {
          "name": "GlobalMovementThrottleThresholdPercentage",
          "value": "0.0"
      },
      {
          "name": "GlobalMovementThrottleThresholdPercentageForPlacement",
          "value": "0.0"
      },
      {
          "name": "GlobalMovementThrottleThresholdPercentageForBalancing",
          "value": "0.0"
      },
      {
          "name": "GlobalMovementThrottleCountingInterval",
          "value": "600"
      }
    ]
  }
]

基于默认值计数的限制Default count based throttles

如果具有较旧的群集或仍将这些配置保留在之后已升级的群集中,则提供此信息。This information is provided in case you have older clusters or still retain these configurations in clusters that have since been upgraded. 一般情况下,建议将这些限制替换为以上基于百分比的限制。In general, it is recommended that these are replaced with the percentage-based throttles above. 由于基于百分比的限制默认是禁用的,这些限制保持为群集的默认限制,直到被禁用并替换为基于百分比的限制。Since percentage-based throttling is disabled by default, these throttles remain the default throttles for a cluster until they are disabled and replaced with the percentage-based throttles.

  • GlobalMovementThrottleThreshold - 此设置控制一段时间内群集中移动的总数。GlobalMovementThrottleThreshold - this setting controls the total number of movements in the cluster over some time. 时间量指定为 GlobalMovementThrottleCountingInterval,以秒为单位。The amount of time is specified in seconds as the GlobalMovementThrottleCountingInterval. GlobalMovementThrottleThreshold 的默认值为 1000,GlobalMovementThrottleCountingInterval 的默认值为 600。The default value for the GlobalMovementThrottleThreshold is 1000 and the default value for the GlobalMovementThrottleCountingInterval is 600.
  • MovementPerPartitionThrottleThreshold - 此设置控制一段时间内针对任何服务分区的移动总数。MovementPerPartitionThrottleThreshold - this setting controls the total number of movements for any service partition over some time. 时间量指定为 MovementPerPartitionThrottleCountingInterval,以秒为单位。The amount of time is specified in seconds as the MovementPerPartitionThrottleCountingInterval. MovementPerPartitionThrottleThreshold 的默认值为 50,MovementPerPartitionThrottleCountingInterval 的默认值为 600。The default value for the MovementPerPartitionThrottleThreshold is 50 and the default value for the MovementPerPartitionThrottleCountingInterval is 600.

这些限制的配置遵循与基于百分比的限制相同的模式。The configuration for these throttles follows the same pattern as the percentage-based throttling.

后续步骤Next steps

  • 若要了解群集 Resource Manager 如何管理和均衡群集中的负载,请查看有关平衡负载的文章To find out about how the Cluster Resource Manager manages and balances load in the cluster, check out the article on balancing load
  • 群集 Resource Manager 提供许多用于描述群集的选项。The Cluster Resource Manager has many options for describing the cluster. 若要详细了解这些选项,请查看这篇介绍 Service Fabric 群集的文章To find out more about them, check out this article on describing a Service Fabric cluster