Service Fabric 中指标和负载的碎片整理Defragmentation of metrics and load in Service Fabric

Service Fabric 群集资源管理器用于管理群集中的负载指标的默认策略是分散负载。The Service Fabric Cluster Resource Manager's default strategy for managing load metrics in the cluster is to distribute the load. 确保均匀地使用节点,避免出现导致争用和浪费资源的热点和冷点。Ensuring that nodes are evenly utilized avoids hot and cold spots that lead to both contention and wasted resources. 若要幸免于故障,让工作负荷分布在群集中还是最安全的方法,因为它可确保某个故障不会导致给定的工作负荷大部分失效。Distributing workloads in the cluster is also the safest in terms of surviving failures since it ensures that a failure doesn't take out a large percentage of a given workload.

Service Fabric 群集资源管理器支持另一种用于管理负载的策略 - 重整。The Service Fabric Cluster Resource Manager does support a different strategy for managing load, which is defragmentation. 重整意味着合并指标,而不是尝试将指标的使用分布到群集中。Defragmentation means that instead of trying to distribute the utilization of a metric across the cluster, it is consolidated. 合并只是默认均衡策略的反转 - 群集资源管理器尝试增加指标负载的平均标准偏差,而不是将偏差最小化。Consolidation is just an inversion of the default balancing strategy - instead of minimizing the average standard deviation of metric load, the Cluster Resource Manager tries to increase it.

何时使用重整When to use defragmentation

在群集中分布负载会使用每个节点上的一些资源。Distributing load in the cluster consumes some of the resources on each node. 某些工作负荷会创建使用大多数或全部节点的极大服务。Some workloads create services that are exceptionally large and consume most or all of a node. 在这些情况下,创建大型工作负荷后,很可能在任何节点上都没有足够的空间可用于运行这些工作负荷。In these cases, it's possible that when there are large workloads getting created that there isn't enough space on any node to run them. 大型工作负荷不是 Service Fabric 的问题,在这些情况下,群集资源管理器确定需要重新组织群集,为此大型工作负荷腾出空间。Large workloads aren't a problem in Service Fabric; in these cases the Cluster Resource Manager determines that it needs to reorganize the cluster to make room for this large workload. 但同时,该工作负荷也必须等待,等待系统在群集中对其进行计划。However, in the meantime that workload has to wait to be scheduled in the cluster.

如果要移动的服务和状态很多,则要将大型工作负荷放置到群集中,可能需要等待很长时间。If there are many services and state to move around, then it could take a long time for the large workload to be placed in the cluster. 这更可能是,群集中的其他工作负荷也很大,因此重新组织需要较长时间。This is more likely if other workloads in the cluster are also large and so take longer to reorganize. Service Fabric 团队通过模拟此情形测量了创建时间。The Service Fabric team measured creation times in simulations of this scenario. 我们发现,一旦群集利用率达到 30%-50%,创建大型服务就会花费更长的时间。We found that creating large services took much longer as soon as cluster utilization got above between 30% and 50%. 为了处理这种情形,我们引入了重整作为均衡策略。To handle this scenario, we introduced defragmentation as a balancing strategy. 我们发现,对于大型工作负荷(尤其是创建时间很重要的)来说,重整确实有助于系统在群集中对那些新的工作负荷进行计划。We found that for large workloads, especially ones where creation time was important, defragmentation really helped those new workloads get scheduled in the cluster.

可以对重整指标进行配置,允许群集 Resource Manager 主动尝试将服务负载压缩到较少的节点中。You can configure defragmentation metrics to have the Cluster Resource Manager to proactively try to condense the load of the services into fewer nodes. 这有助于确保(几乎)始终有空间可用于大型服务,而不必重新组织群集。This helps ensure that there is almost always room for large services without reorganizing the cluster. 无需重新组织群集,这样可以快速创建大型工作负荷。Not having to reorganize the cluster allows creating large workloads quickly.

大多数人无需执行重整操作。Most people don't need defragmentation. 服务通常很小,因此不难在群集中为它们找到空间。Services are usually be small, so it's not hard to find room for them in the cluster. 如果可以重新组织,则会再次快速地进行,因为大多数服务很小,可以快速地并行移动。When reorganization is possible, it goes quickly, again because most services are small and can be moved quickly and in parallel. 但是,如果需要快速创建大型服务,则适合使用重整策略。However, if you have large services and need them created quickly then the defragmentation strategy is for you. 接下来将讨论使用重整的折衷方案。We'll discuss the tradeoffs of using defragmentation next.

重整的折衷方案Defragmentation tradeoffs

重整会增加失败的影响力,因为更多服务在发生故障的节点上运行。Defragmentation can increase impactfulness of failures, since more services are running on nodes that fail. 重整还会增加成本,因为必须保留群集中的资源作为备用,等待创建大规模的工作负荷。Defragmentation can also increase costs, since resources in the cluster must be held in reserve, waiting for the creation of large workloads.

下图提供了两个群集的直观表示,其中一个已经过重整,另一个则没有经过重整。The following diagram gives a visual representation of two clusters, one that is defragmented and one that is not.

比较均衡的群集与重整的群集

![Comparing Balanced and Defragmented Clusters][Image1]

在均衡的群集示例中,考虑一下,放置其中一个最大的服务对象需要经过多少次移动。In the balanced case, consider the number of movements that would be necessary to place one of the largest service objects. 在经过重整的群集中,大型工作负荷可放置在四个或五个节点上,而无需等待移动任何其他服务。In the defragmented cluster, the large workload could be placed on nodes four or five without having to wait for any other services to move.

碎片整理的优点和缺点Defragmentation pros and cons

从概念上讲,碎片整理有其他哪些利弊呢?So what are those other conceptual tradeoffs? 下面是要注意的事项一览表:Here's a quick table of things to think about:

碎片整理的优点Defragmentation Pros 碎片整理的缺点Defragmentation Cons
能够更快地创建大型服务Allows faster creation of large services 将负载集中到更少的节点,增大资源争用Concentrates load onto fewer nodes, increasing contention
在创建期间实现较少的数据移动Enables lower data movement during creation 故障可能影响更多服务,并导致更多的服务流动Failures can impact more services and cause more churn
能够丰富描述要求和空间的回收Allows rich description of requirements and reclamation of space 更复杂的整体资源管理配置More complex overall Resource Management configuration

可以在同一群集中混合使用重整的指标和正常指标。You can mix defragmented and normal metrics in the same cluster. 群集 Resource Manager 会尝试尽可能合并重整指标,分散其他指标。The Cluster Resource Manager tries to consolidate the defragmentation metrics as much as possible while spreading out the others. 混合使用重整和均衡策略的结果依赖于多种因素,包括:The results of mixing defragmentation and balancing strategies depends on several factors, including:

  • 均衡指标的数目与重整指标的数目the number of balancing metrics vs. the number of defragmentation metrics
  • 是否有任何服务使用这两种类型的指标Whether any service uses both types of metrics
  • 指标权重the metric weights
  • 当前指标负载current metric loads

需要通过试验来确定具体的必需配置。Experimentation is required to determine the exact configuration necessary. 建议先彻底度量工作负荷,然后再启用生产环境中的重整指标。We recommend thorough measurement of your workloads before you enable defragmentation metrics in production. 在同一服务中混合重整和均衡指标时,尤其如此。This is especially true when mixing defragmentation and balanced metrics within the same service.

配置碎片整理指标Configuring defragmentation metrics

配置碎片整理指标是群集中的全局决策,可以选择单个指标进行碎片整理。Configuring defragmentation metrics is a global decision in the cluster, and individual metrics can be selected for defragmentation. 以下配置代码片段演示如何配置重整的指标。The following config snippets show how to configure metrics for defragmentation. 在这种情况下,“Metric1”会配置为重整指标,而“Metric2”将继续按常规方法进行均衡。In this case, "Metric1" is configured as a defragmentation metric, while "Metric2" will continue to be balanced normally.

ClusterManifest.xml:ClusterManifest.xml:

<Section Name="DefragmentationMetrics">
    <Parameter Name="Metric1" Value="true" />
    <Parameter Name="Metric2" Value="false" />
</Section>

通过用于独立部署的 ClusterConfig.json 或用于 Azure 托管群集的 Template.json:via ClusterConfig.json for Standalone deployments or Template.json for Azure hosted clusters:

"fabricSettings": [
  {
    "name": "DefragmentationMetrics",
    "parameters": [
      {
          "name": "Metric1",
          "value": "true"
      },
      {
          "name": "Metric2",
          "value": "false"
      }
    ]
  }
]

后续步骤Next steps

  • 群集 Resource Manager 提供许多用于描述群集的选项。The Cluster Resource Manager has man options for describing the cluster. 若要详细了解这些选项,请查看这篇描述 Service Fabric 群集的文章To find out more about them, check out this article on describing a Service Fabric cluster
  • 指标是 Service Fabric 群集资源管理器在群集中管理消耗和容量的方式。Metrics are how the Service Fabric Cluster Resource Manger manages consumption and capacity in the cluster. 若要详细了解指标及其配置方式,请查看此文To learn more about metrics and how to configure them, check out this article