在 Service Fabric 中使用指标管理资源消耗和负载Managing resource consumption and load in Service Fabric with metrics

指标是服务关切的、由群集中的节点提供的资源 。Metrics are the resources that your services care about and which are provided by the nodes in the cluster. 指标是要进行管理以提升或监视服务性能的任何信息。A metric is anything that you want to manage in order to improve or monitor the performance of your services. 例如,可能需要监视内存消耗量以了解服务是否过载。For example, you might watch memory consumption to know if your service is overloaded. 另一个用途是确定服务是否可以移动到内存较少受限的其他位置,以便获得更佳性能。Another use is to figure out whether the service could move elsewhere where memory is less constrained in order to get better performance.

内存、磁盘、CPU 使用率等信息都是指标的示例。Things like Memory, Disk, and CPU usage are examples of metrics. 这些指标是物理指标,即对应于节点上需要管理的物理资源的资源。These metrics are physical metrics, resources that correspond to physical resources on the node that need to be managed. 指标也可以是(且通常是)逻辑指标。Metrics can also be (and commonly are) logical metrics. 逻辑指标是类似于“MyWorkQueueDepth”、“MessagesToProcess”或“TotalRecords”的信息。Logical metrics are things like "MyWorkQueueDepth" or "MessagesToProcess" or "TotalRecords". 逻辑指标是应用程序定义的,并间接对应于某些物理资源消耗。Logical metrics are application-defined and indirectly correspond to some physical resource consumption. 逻辑指标很常见,因为可能难以在每个服务的基础上测量和报告物理资源消耗量。Logical metrics are common because it can be hard to measure and report consumption of physical resources on a per-service basis. 测量和报告自己的物理指标的复杂度也是 Service Fabric 提供一些默认指标的原因。The complexity of measuring and reporting your own physical metrics is also why Service Fabric provides some default metrics.

默认指标Default metrics

假设你要开始编写和部署服务。Let's say that you want to get started writing and deploying your service. 但此时不知道该服务要消耗哪些物理或逻辑资源。At this point you don't know what physical or logical resources it consumes. 没有任何问题!That's fine! 未指定其他任何指标时,Service Fabric 群集资源管理器会使用一些默认指标。The Service Fabric Cluster Resource Manager uses some default metrics when no other metrics are specified. 它们具有以下特点:They are:

  • PrimaryCount - 节点上的主要副本计数PrimaryCount - count of Primary replicas on the node
  • ReplicaCount - 节点上的有状态副本总计数ReplicaCount - count of total stateful replicas on the node
  • Count - 节点上的所有服务对象(无状态和有状态)计数Count - count of all service objects (stateless and stateful) on the node
指标Metric 无状态实例负载Stateless Instance Load 有状态辅助负载Stateful Secondary Load 有状态主要负载Stateful Primary Load 重量Weight
PrimaryCountPrimaryCount 00 00 11 High
ReplicaCountReplicaCount 00 11 11 中型Medium
计数Count 11 11 11 Low

对于脚本工作负荷,默认指标实现群集中的适当工作分布。For basic workloads, the default metrics provide a decent distribution of work in the cluster. 在以下示例中,让我们看看创建两个服务并依赖默认指标进行均衡时会发生什么情况。In the following example, let's see what happens when we create two services and rely on the default metrics for balancing. 第一个服务是具有 3 个分区的有状态服务,目标副本集大小为 3。The first service is a stateful service with three partitions and a target replica set size of three. 第二个服务是具有 1 个分区的无状态服务,实例数为 3。The second service is a stateless service with one partition and an instance count of three.

结果如下:Here's what you get:


需要注意的一些事项:Some things to note:

  • 有状态服务的主要副本分布在多个节点上Primary replicas for the stateful service are distributed across several nodes
  • 同一分区的副本分布在不同节点上Replicas for the same partition are on different nodes
  • 主副本与辅助副本的总数在群集中分布The total number of primaries and secondaries is distributed in the cluster
  • 服务对象的总数平均分配在每个节点上The total number of service objects are evenly allocated on each node


开始时,默认指标能够良好运作。The default metrics work great as a start. 但默认指标的作用有限,对于某些问题,就不那么好用了。However, the default metrics will only carry you so far. 例如:所选择的分区方案实现所有分区完全平均使用的可能性有多大?For example: What's the likelihood that the partitioning scheme you picked results in perfectly even utilization by all partitions? 指定服务的负载在一段时间内保持不变的几率有多大,甚或是在现在的多个分区中保持相同的几率有多大?What's the chance that the load for a given service is constant over time, or even just the same across multiple partitions right now?

可以只运行默认指标。You could run with just the default metrics. 但这样做通常意味着群集利用率比预期更低且更不均衡。However, doing so usually means that your cluster utilization is lower and more uneven than you'd like. 这是因为默认指标并非自适应,并假定一切都是等效的。This is because the default metrics aren't adaptive and presume everything is equivalent. 例如,正在使用的主节点和不会同时向 PrimaryCount 指标贡献“1”个单位的节点。For example, a Primary that is busy and one that is not both contribute "1" to the PrimaryCount metric. 在最坏的情况下,仅使用默认指标也可能导致过度安排的节点引起性能问题。In the worst case, using only the default metrics can also result in overscheduled nodes resulting in performance issues. 若要充分利用群集并避免性能问题,需使用自定义指标和动态负载报告。If you're interested in getting the most out of your cluster and avoiding performance issues, you need to use custom metrics and dynamic load reporting.

自定义指标Custom metrics

创建服务时,可根据命名服务实例配置指标。Metrics are configured on a per-named-service-instance basis when you're creating the service.

任何指标都有一些描述它的属性:名称、权重和默认负载。Any metric has some properties that describe it: a name, a weight, and a default load.

  • 指标名称:指标的名称。Metric Name: The name of the metric. 从资源管理器的角度看,指标名称是群集中指标的唯一标识符。The metric name is a unique identifier for the metric within the cluster from the Resource Manager's perspective.
  • 权重:指标权重定义指标对于此服务的重要程度(相对于其他指标)。Weight: Metric weight defines how important this metric is relative to the other metrics for this service.
  • 默认负载:默认负载根据服务是无状态还是有状态服务以不同的方式表示。Default Load: The default load is represented differently depending on whether the service is stateless or stateful.
    • 对于无状态服务,每个指标包含名为 DefaultLoad 的单个属性For stateless services, each metric has a single property named DefaultLoad
    • 对于有状态服务,可以定义:For stateful services you define:
      • PrimaryDefaultLoad:此服务充当主副本时消耗此指标的默认数量PrimaryDefaultLoad: The default amount of this metric this service consumes when it is a Primary
      • SecondaryDefaultLoad:此服务充当辅助副本时消耗此指标的默认数量SecondaryDefaultLoad: The default amount of this metric this service consumes when it is a Secondary


如果定义了自定义指标,并且希望同时使用默认指标,则需重新显式添加默认指标并为其定义权重和值 。If you define custom metrics and you want to also use the default metrics, you need to explicitly add the default metrics back and define weights and values for them. 这是因为必须定义默认指标和自定义指标之间的关系。This is because you must define the relationship between the default metrics and your custom metrics. 例如,与主要分布相比,也许更关心 ConnectionCount 或 WorkQueueDepth。For example, maybe you care about ConnectionCount or WorkQueueDepth more than Primary distribution. 默认情况下,PrimaryCount 指标的权重为“高”,因此需要在添加其他指标时将其降低至“中”,以确保优先处理其他指标。By default the weight of the PrimaryCount metric is High, so you want to reduce it to Medium when you add your other metrics to ensure they take precedence.

为服务定义指标 - 示例Defining metrics for your service - an example

假设需要以下配置:Let's say you want the following configuration:

  • 服务报告一个名为“ConnectionCount”的指标Your service reports a metric named "ConnectionCount"
  • 还想使用默认指标You also want to use the default metrics
  • 已完成一些测量,并且知道该服务的主要副本通常占用 20 个单位的“ConnectionCount”You've done some measurements and know that normally a Primary replica of that service takes up 20 units of "ConnectionCount"
  • 辅助副本占用 5 个单位的“ConnectionCount”Secondaries use 5 units of "ConnectionCount"
  • 已了解“ConnectionCount”是管理此特定服务性能的最重要指标,You know that "ConnectionCount" is the most important metric in terms of managing the performance of this particular service
  • 但仍希望主要副本是均衡的。You still want Primary replicas balanced. 无论如何,均衡主要副本通常都是一个好主意。Balancing primary replicas is generally a good idea no matter what. 这有助于防止某些节点或容错域的损失影响到与之相关的大部分主要副本。This helps prevent the loss of some node or fault domain from impacting a majority of primary replicas along with it.
  • 否则,使用默认指标即可Otherwise, the default metrics are fine

可以编写以下代码来创建包含该指标配置的服务:Here's the code that you would write to create a service with that metric configuration:


StatefulServiceDescription serviceDescription = new StatefulServiceDescription();
StatefulServiceLoadMetricDescription connectionMetric = new StatefulServiceLoadMetricDescription();
connectionMetric.Name = "ConnectionCount";
connectionMetric.PrimaryDefaultLoad = 20;
connectionMetric.SecondaryDefaultLoad = 5;
connectionMetric.Weight = ServiceLoadMetricWeight.High;

StatefulServiceLoadMetricDescription primaryCountMetric = new StatefulServiceLoadMetricDescription();
primaryCountMetric.Name = "PrimaryCount";
primaryCountMetric.PrimaryDefaultLoad = 1;
primaryCountMetric.SecondaryDefaultLoad = 0;
primaryCountMetric.Weight = ServiceLoadMetricWeight.Medium;

StatefulServiceLoadMetricDescription replicaCountMetric = new StatefulServiceLoadMetricDescription();
replicaCountMetric.Name = "ReplicaCount";
replicaCountMetric.PrimaryDefaultLoad = 1;
replicaCountMetric.SecondaryDefaultLoad = 1;
replicaCountMetric.Weight = ServiceLoadMetricWeight.Low;

StatefulServiceLoadMetricDescription totalCountMetric = new StatefulServiceLoadMetricDescription();
totalCountMetric.Name = "Count";
totalCountMetric.PrimaryDefaultLoad = 1;
totalCountMetric.SecondaryDefaultLoad = 1;
totalCountMetric.Weight = ServiceLoadMetricWeight.Low;


await fabricClient.ServiceManager.CreateServiceAsync(serviceDescription);


New-ServiceFabricService -ApplicationName $applicationName -ServiceName $serviceName -ServiceTypeName $serviceTypeName -Stateful -MinReplicaSetSize 3 -TargetReplicaSetSize 3 -PartitionSchemeSingleton -Metric @("ConnectionCount,High,20,5","PrimaryCount,Medium,1,0","ReplicaCount,Low,1,1","Count,Low,1,1")


上述示例以及本文其余部分描述了如何根据每个命名服务管理指标。The above examples and the rest of this document describe managing metrics on a per-named-service basis. 也可以在服务类型级别为服务定义指标 。It is also possible to define metrics for your services at the service type level. 可通过在服务清单中指定它们来实现此目的。This is accomplished by specifying them in your service manifests. 出于以下几个原因,不建议定义类型级别指标。Defining type level metrics is not recommended for several reasons. 第一个原因是指标名称通常是特定于环境的。The first reason is that metric names are frequently environment-specific. 除非有明确协定,否则无法确定某一环境中的指标“Cores”不是其他环境中的“MiliCores”或“CoReS”。Unless there is a firm contract in place, you cannot be sure that the metric "Cores" in one environment isn't "MiliCores" or "CoReS" in others. 如果在清单中定义指标,则需为每个环境创建新的清单。If your metrics are defined in your manifest you need to create new manifests per environment. 这通常会导致只有细微差异的不同清单激增,从而增加管理难度。This usually leads to a proliferation of different manifests with only minor differences, which can lead to management difficulties.

通常基于每个命名服务实例来分配指标负载。Metric loads are commonly assigned on a per-named-service-instance basis. 例如,假设为客户 A 创建一个服务实例,该客户计划较少使用此实例。For example, let's say you create one instance of the service for CustomerA who plans to use it only lightly. 同时为拥有较大工作负荷的客户 B 创建了另一服务实例。Let's also say you create another for CustomerB who has a larger workload. 在此情况下,很可能需要为这些服务调整默认负载。In this case, you'd probably want to tweak the default loads for those services. 如果通过清单定义指标和负载,并想要支持此方案,则需为每位客户提供不同的应用程序和服务类型。If you have metrics and loads defined via manifests and you want to support this scenario, it requires different application and service types for each customer. 创建服务时所定义的值会替代清单中定义的值,因此可以使用它来设置特定的默认值。The values defined at service creation time override those defined in the manifest, so you could use that to set the specific defaults. 但是,这样做会导致清单中声明的值与服务实际运行时使用的值不相符。However, doing that causes the values declared in the manifests to not match those the service actually runs with. 这可能会造成混淆。This can lead to confusion.

特此提醒:如果只想要使用默认指标,则完全不需要处理指标集合,或者在创建服务时执行任何特殊操作。As a reminder: if you just want to use the default metrics, you don't need to touch the metrics collection at all or do anything special when creating your service. 未定义其他指标时,系统自动使用默认指标。The default metrics get used automatically when no others are defined.

现在来详细了解一下每项设置及其所影响的行为。Now, let's go through each of these settings in more detail and talk about the behavior that it influences.


定义指标的整个要点就是表示某种负载。The whole point of defining metrics is to represent some load. “负载”指给定节点上某个服务实例或副本对给定指标的消耗量。 Load is how much of a given metric is consumed by some service instance or replica on a given node. 可在几乎任意时间配置负载。Load can be configured at almost any point. 例如:For example:

  • 创建服务后,可以定义负载。Load can be defined when a service is created. 此类负载配置称为默认负载。This type of load configuration is called default load.
  • 创建服务后,可更新服务的指标信息(包括默认负载)。The metric information, including default loads, for a service can be updated after the service is created. 此指标更新是通过更新服务来完成的。This metric update is done by updating a service.
  • 可将给定分区的负载重置为该服务的默认值。The loads for a given partition can be reset to the default values for that service. 此指标更新称为“重置分区负载”。This metric update is called resetting partition load.
  • 在运行时,可动态报告每个服务对象的负载。Load can be reported on a per service object basis, dynamically during runtime. 此指标更新称为“报告负载”。This metric update is called reporting load.
  • 也可以通过 Fabric API 调用来报告负载值,从而更新分区副本或实例的负载。Load for partition's replicas or instances can also be updated by reporting load values through a Fabric API call. 此指标更新称为“报告分区负载”。This metric update is called reporting load for a partition.

可在同一服务的生存期内使用所有这些策略。All of these strategies can be used within the same service over its lifetime.

默认负载Default load

默认负载是此服务的每个服务对象(无状态实例或有状态副本)对该指标的消耗量。Default load is how much of the metric each service object (stateless instance or stateful replica) of this service consumes. 群集资源管理器将此数值用于服务对象的负载,直至收到动态负载报告等其他信息。The Cluster Resource Manager uses this number for the load of the service object until it receives other information, such as a dynamic load report. 对于较简单的服务,默认负载是一种静态定义。For simpler services, the default load is a static definition. 默认负载从不更新,并用于服务的整个生存期。The default load is never updated and is used for the lifetime of the service. 默认负载对于简单容量规划方案而言非常有用,其中一定量的资源专用于不同的工作负载且不发生更改。Default loads works great for simple capacity planning scenarios where certain amounts of resources are dedicated to different workloads and do not change.


若要深入了解容量管理以及如何在群集中定义节点的容量,请参阅此文For more information on capacity management and defining capacities for the nodes in your cluster, please see this article.

群集资源管理器允许有状态服务为其主要副本和次要副本指定不同的默认负载。The Cluster Resource Manager allows stateful services to specify a different default load for their Primaries and Secondaries. 无状态服务只能指定一个值,该值适用于所有实例。Stateless services can only specify one value that applies to all instances. 对于有状态服务,主要副本和次要副本的默认负载通常不同,因为副本在每个角色中执行不同类型的工作。For stateful services, the default load for Primary and Secondary replicas are typically different since replicas do different kinds of work in each role. 例如,主要副本通常服务于读取和写入,并处理大部分计算负担,而次要副本则不同。For example, Primaries usually serve both reads and writes, and handle most of the computational burden, while secondaries do not. 通常,主要副本的默认负载高于次要副本的默认负载。Usually the default load for a primary replica is higher than the default load for secondary replicas. 实数应取决于你自己的度量值。The real numbers should depend on your own measurements.

动态负载Dynamic load

假设我们已运行服务一段时间。Let's say that you've been running your service for a while. 使用某种监视功能后,我们发现:With some monitoring, you've noticed that:

  1. 给定服务的某些分区或实例比其他分区或实例消耗更多的资源Some partitions or instances of a given service consume more resources than others
  2. 某些服务的负载随时间而变化。Some services have load that varies over time.

很多因素都会导致此类的负载波动。There's lots of things that could cause these types of load fluctuations. 例如,不同服务或分区与具有不同需求的不同客户相关联。For example, different services or partitions are associated with different customers with different requirements. 负载也可能发生变化,因为服务的工作量会在一天中发生变化。Load could also change because the amount of work the service does varies over the course of the day. 无论是什么原因,通常都无法为默认负载使用单个值。Regardless of the reason, there's usually no single number that you can use for default. 如果要充分利用群集,这一点尤其如此。This is especially true if you want to get the most utilization out of the cluster. 在某些情况下,为默认负载选择任何值都会出错。Any value you pick for default load is wrong some of the time. 不正确的默认负载会导致群集资源管理器分配的资源过量或不足。Incorrect default loads result in the Cluster Resource Manager either over or under allocating resources. 因此,即使群集资源管理器认为群集已均衡,节点也处于过度使用或未充分使用状态。As a result, you have nodes that are over or under utilized even though the Cluster Resource Manager thinks the cluster is balanced. 默认负载仍然有用,因为它们会提供初始放置的一些信息,但它们并不是实际工作负载的全貌。Default loads are still good since they provide some information for initial placement, but they're not a complete story for real workloads. 为了准确捕获不断变化的资源需求,群集资源管理器允许每个服务对象在运行时更新其自身的负载。To accurately capture changing resource requirements, the Cluster Resource Manager allows each service object to update its own load during runtime. 这称作动态负载报告。This is called dynamic load reporting.

动态负载报告可让副本或实例在其生存期中调整其分配/报告的指标负载。Dynamic load reports allow replicas or instances to adjust their allocation/reported load of metrics over their lifetime. 空闲且未运行任何工作的服务副本或实例通常会报告其使用少量的给定指标。A service replica or instance that was cold and not doing any work would usually report that it was using low amounts of a given metric. 而繁忙的副本或实例则报告他们使用较多的指标。A busy replica or instance would report that they are using more.

通过报告每个副本或实例的负载,群集资源管理器可以重新组织群集中的各个服务对象。Reporting load per replica or instance allows the Cluster Resource Manager to reorganize the individual service objects in the cluster. 重新组织服务有助于确保服务获得所需的资源。Reorganizing the services helps ensure that they get the resources they require. 繁忙的服务会有效地从其他副本或当前空闲或执行较少工作的实例“回收”资源。Busy services effectively get to "reclaim" resources from other replicas or instances that are currently cold or doing less work.

在 Reliable Services 中,动态报告负载的代码如下所示:Within Reliable Services, the code to report load dynamically looks like this:


this.Partition.ReportLoad(new List<LoadMetric> { new LoadMetric("CurrentConnectionCount", 1234), new LoadMetric("metric1", 42) });

服务可在创建时报告为其定义的任何指标。A service can report on any of the metrics defined for it at creation time. 如果服务针对未配置为要使用的指标报告负载,则 Service Fabric 会忽略该报告。If a service reports load for a metric that it is not configured to use, Service Fabric ignores that report. 如果同时报告了其他有效指标,则接受这些报告。If there are other metrics reported at the same time that are valid, those reports are accepted. 服务代码可以测量和报告其了解并知道如何操作的所有指标,并且操作者可以指定要使用的指标配置,而不必更改服务代码。Service code can measure and report all the metrics it knows how to, and operators can specify the metric configuration to use without having to change the service code.

报告分区负载Reporting load for a partition

上一部分介绍了服务副本或实例如何报告负载。The previous section describes how service replicas or instances report load themselves. 还可以使用 FabricClient 动态报告负载。There is an additional option to dynamically report load with FabricClient. 报告分区负载时,可以同时报告多个分区。When reporting load for a partition, you may report for multiple partitions at once.

将采用来自副本或实例本身的负载报表的使用方式来使用这些报表。Those reports will be used in the exactly same way as load reports that are coming from the replicas or instances themselves. 在通过副本或实例或通过报告分区的新负载值来报告新的负载值之前,已报告的值将一直有效。Reported values will be valid until new load values are reported, either by the replica or instance or by reporting a new load value for a partition.

使用此 API 可以以多种方式更新群集中的负载:With this API, there are multiple ways to update load in the cluster:

  • 有状态服务分区可以更新其主副本负载。A stateful service partition can update its primary replica load.
  • 无状态服务和有状态服务均可更新其所有辅助副本或实例的负载。Both stateless and stateful services can update the load of all its secondary replicas or instances.
  • 无状态服务和有状态服务均可更新节点上特定副本或实例的负载。Both stateless and stateful services can update the load of a specific replica or instance on a node.

此外,还可以同时合并每个分区的任意个更新。It is also possible to combine any of those updates per partition at the same time.

可以使用单个 API 更新多个分区的负载,在这种情况下,输出将包含每个分区的响应。Updating loads for multiple partitions is possible with a single API call, in which case the output will contain a response per partition. 如果由于某种原因未能成功应用分区更新,则将跳过该分区的更新,并提供相应的目标分区的错误代码:In case partition update is not successfully applied for any reason, updates for that partition will be skipped, and corresponding error code for a targeted partition will be provided:

  • PartitionNotFound - 指定的分区 ID 不存在。PartitionNotFound - Specified partition ID doesn't exist.
  • ReconfigurationPending - 分区目前正在重新配置。ReconfigurationPending - Partition is currently reconfiguring.
  • InvalidForStatelessServices - 试图更改属于无状态服务的分区的主副本的负载。InvalidForStatelessServices - An attempt was made to change the load of a primary replica for a partition belonging to a stateless service.
  • ReplicaDoesNotExist - 指定的节点上不存在辅助副本或实例。ReplicaDoesNotExist - Secondary replica or instance does not exist on a specified node.
  • InvalidOperation - 在以下两种情况下可能会出现:更新属于系统应用程序的分区的负载,或未启用更新预测负载。InvalidOperation - Could happen in two cases: updating load for a partition that belongs to the System application or updating predicted load is not enabled.

如果返回了其中一些错误,则可以更新特定分区的输入,然后重试特定分区的更新。If some of those errors are returned, you can update the input for a specific partition and retry the update for a specific partition.


Guid partitionId = Guid.Parse("53df3d7f-5471-403b-b736-bde6ad584f42");
string metricName0 = "CustomMetricName0";
List<MetricLoadDescription> newPrimaryReplicaLoads = new List<MetricLoadDescription>()
    new MetricLoadDescription(metricName0, 100)

string nodeName0 = "NodeName0";
List<MetricLoadDescription> newSpecificSecondaryReplicaLoads = new List<MetricLoadDescription>()
    new MetricLoadDescription(metricName0, 200)

OperationResult<UpdatePartitionLoadResultList> updatePartitionLoadResults =
    await this.FabricClient.UpdatePartitionLoadAsync(
        new UpdatePartitionLoadQueryDescription
            PartitionMetricLoadDescriptionList = new List<PartitionMetricLoadDescription>()
                new PartitionMetricLoadDescription(
                    new List<MetricLoadDescription>(),
                    new List<ReplicaMetricLoadDescription>()
                        new ReplicaMetricLoadDescription(nodeName0, newSpecificSecondaryReplicaLoads)

在此示例中,你将更新分区 53df3d7f-5471-403b-b736-bde6ad584f42 的最新报告负载。With this example, you will perform an update of the last reported load for a partition 53df3d7f-5471-403b-b736-bde6ad584f42. 指标 CustomMetricName0 的主副本加载将更新为值 100。Primary replica load for a metric CustomMetricName0 will be updated with value 100. 同时,位于节点 NodeName0 的特定辅助副本的同一指标的负载将更新为值 200。At the same time, load for the same metric for a specific secondary replica located at the node NodeName0, will be updated with value 200.

更新服务的指标配置Updating a service's metric configuration

可在服务处于活动状态时,动态更新与该服务关联的指标列表以及这些指标的属性。The list of metrics associated with the service, and the properties of those metrics can be updated dynamically while the service is live. 这样就可以进行试验并增加灵活性。This allows for experimentation and flexibility. 下面是一些适用的情况示例:Some examples of when this is useful are:

  • 针对一项特定服务禁用报出错误的指标disabling a metric with a buggy report for a particular service
  • 基于所需行为重新配置指标的权重reconfiguring the weights of metrics based on desired behavior
  • 仅在通过其他机制部署并验证代码之后启用新指标enabling a new metric only after the code has already been deployed and validated via other mechanisms
  • 根据观察到的行为和消耗量,更改服务的默认负载changing the default load for a service based on observed behavior and consumption

用于更改指标配置的主要 API 是 C# 中的 FabricClient.ServiceManagementClient.UpdateServiceAsync 和 PowerShell 中的 Update-ServiceFabricServiceThe main APIs for changing metric configuration are FabricClient.ServiceManagementClient.UpdateServiceAsync in C# and Update-ServiceFabricService in PowerShell. 使用这些 API 指定的任何信息会立即替换服务的现有指标信息。Whatever information you specify with these APIs replaces the existing metric information for the service immediately.

混用默认负载值和动态负载报告Mixing default load values and dynamic load reports

默认负载和动态负载可用于同一服务。Default load and dynamic loads can be used for the same service. 服务同时利用默认负载和动态负载报告时,默认负载将充当估计值,直到显示动态报告。When a service utilizes both default load and dynamic load reports, default load serves as an estimate until dynamic reports show up. 默认负载非常好,因为它为群集资源管理器分配了一些工作。Default load is good because it gives the Cluster Resource Manager something to work with. 默认负载允许群集资源管理器在创建服务对象时将其放置在合理位置。The default load allows the Cluster Resource Manager to place the service objects in good locations when they are created. 如果没有提供默认负载信息,则服务会有效地随机放置。If no default load information is provided, placement of services is effectively random. 当负载报告稍后传入时,初始的随机放置通常是错误的,群集资源管理器不得不移动服务。When load reports arrive later the initial random placement is often wrong and the Cluster Resource Manager has to move services.

让我们沿用前一示例,了解在添加一些自定义指标和动态负载报告时会发生什么情况。Let's take our previous example and see what happens when we add some custom metrics and dynamic load reporting. 在此示例中,使用“MemoryInMb”作为示例指标。In this example, we use "MemoryInMb" as an example metric.


内存是 Service Fabric 可进行资源控制的系统指标之一,自行报告此指标通常很困难。Memory is one of the system metrics that Service Fabric can resource govern, and reporting it yourself is typically difficult. 实际上,不需要自行报告内存消耗量;此处使用内存是为了帮助了解群集资源管理器的功能。We don't actually expect you to report on Memory consumption; Memory is used here as an aid to learning about the capabilities of the Cluster Resource Manager.

假设最初使用以下命令创建了有状态服务:Let's presume that we initially created the stateful service with the following command:


New-ServiceFabricService -ApplicationName $applicationName -ServiceName $serviceName -ServiceTypeName $serviceTypeName -Stateful -MinReplicaSetSize 3 -TargetReplicaSetSize 3 -PartitionSchemeSingleton -Metric @("MemoryInMb,High,21,11","PrimaryCount,Medium,1,0","ReplicaCount,Low,1,1","Count,Low,1,1")

注意,此语法是 ("MetricName, MetricWeight, PrimaryDefaultLoad, SecondaryDefaultLoad")。As a reminder, this syntax is ("MetricName, MetricWeight, PrimaryDefaultLoad, SecondaryDefaultLoad").

让我们看一下可能的群集布局情况:Let's see what one possible cluster layout could look like:


有几个问题值得注意:Some things that are worth noting:

  • 分区内的每个次要副本都可以有自己的负载Secondary replicas within a partition can each have their own load
  • 总体来说,指标看起来已均衡。Overall the metrics look balanced. 对于内存,最大和最小负载之间的比率为 1.75(最大负载的节点是 N3,最小负载的节点是 N2,且 28/16 = 1.75)。For Memory, the ratio between the maximum and minimum load is 1.75 (the node with the most load is N3, the least is N2, and 28/16 = 1.75).

仍需解释的一些事项:There are some things that we still need to explain:

  • 何者确定 1.75 的比率是否合理?What determined whether a ratio of 1.75 was reasonable or not? 群集资源管理器如何知道此值是否够好,或者还需要再做些什么?How does the Cluster Resource Manager know if that's good enough or if there is more work to do?
  • 何时发生平衡?When does balancing happen?
  • 内存的权重“很高”是什么意思?What does it mean that Memory was weighted "High"?

指标权重Metric weights

跨不同服务跟踪相同指标非常重要。Tracking the same metrics across different services is important. 该全局视图允许群集资源管理器跟踪群集中的消耗,均衡节点间的消耗,并确保节点不会超出容量。That global view is what allows the Cluster Resource Manager to track consumption in the cluster, balance consumption across nodes, and ensure that nodes don't go over capacity. 但是,相同指标的重要性对于服务而言可能有所差异。However, services may have different views as to the importance of the same metric. 此外,在具有许多指标和大量服务的群集中,对于所有指标而言,可能不存在能完全均衡的解决方案。Also, in a cluster with many metrics and lots of services, perfectly balanced solutions may not exist for all metrics. 群集 Resource Manager 应如何处理这些情况?How should the Cluster Resource Manager handle these situations?

在没有完美的解答时,群集资源管理器可以使用指标权重来决定如何均衡群集。Metric weights allow the Cluster Resource Manager to decide how to balance the cluster when there's no perfect answer. 指标权重还允许群集 Resource Manager 以不同方式均衡特定服务。Metric weights also allow the Cluster Resource Manager to balance specific services differently. 指标可以有四个不同的权重级别:零、低、中和高。Metrics can have four different weight levels: Zero, Low, Medium, and High. 考虑是否均衡时,权重为“零”的指标没有任何价值。A metric with a weight of Zero contributes nothing when considering whether things are balanced or not. 但是,其负载对于容量管理仍有价值。However, its load does still contribute to capacity management. 权重为“零”的指标仍然有用,且经常用作服务行为和性能监视的一部分。Metrics with Zero weight are still useful and are frequently used as a part of service behavior and performance monitoring. 这篇文章深入介绍了如何使用指标监视和诊断服务。This article provides more information on the use of metrics for monitoring and diagnostics of your services.

群集中不同指标权重的真正影响是群集 Resource Manager 会生成不同的解决方案。The real impact of different metric weights in the cluster is that the Cluster Resource Manager generates different solutions. 指标权重告知群集 Resource Manager 某些指标比其他指标更重要。Metric weights tell the Cluster Resource Manager that certain metrics are more important than others. 当没有完美的解决方案时,群集 Resource Manager 倾向于使用能更好地均衡较高权重指标的解决方案。When there's no perfect solution the Cluster Resource Manager can prefer solutions which balance the higher weighted metrics better. 如果服务认为特定指标不重要,则可能会发现其对该指标的使用不均衡。If a service thinks a particular metric is unimportant, it may find their use of that metric imbalanced. 这可使其他服务均匀分布某些对其而言非常重要的指标。This allows another service to get an even distribution of some metric that is important to it.

让我们查看一个负载报告示例,以及不同的指标权重如何在群集中造成不同的分配。Let's look at an example of some load reports and how different metric weights results in different allocations in the cluster. 在此示例中,我们看到切换指标的相对权重会导致群集资源管理器创建不同的服务排列。In this example, we see that switching the relative weight of the metrics causes the Cluster Resource Manager to create different arrangements of services.


此示例中有 4 种不同的服务,所有服务都针对两个不同指标 MetricA 和 MetricB 报告不同的值。In this example, there are four different services, all reporting different values for two different metrics, MetricA and MetricB. 在其中一个用例中,所有服务定义 MetricA 为最重要的指标(权重 = 高),MetricB 为不重要的指标(权重 = 低)。In one case, all the services define MetricA is the most important one (Weight = High) and MetricB as unimportant (Weight = Low). 因此会看到群集资源管理器在放置服务时会采用使得 MetricA 比 MetricB 更均衡的方式。As a result, we see that the Cluster Resource Manager places the services so that MetricA is better balanced than MetricB. “更均衡”意味着 MetricA 具有比 MetricB 更小的标准偏差。"Better balanced" means that MetricA has a lower has a lower standard deviation than MetricB. 在第二个方案中,我们反转指标权重。In the second case, we reverse the metric weights. 结果,群集资源管理器会交换服务 A 与 B,以产生 MetricB 比 MetricA 更加均衡的分配。As a result, the Cluster Resource Manager swaps services A and B to come up with an allocation where MetricB is better balanced than MetricA.


指标权重决定群集资源管理器应该如何均衡,而不是应何时进行均衡。Metric weights determine how the Cluster Resource Manager should balance, but not when balancing should happen. 有关实现均衡的详细信息,请参阅此文For more information on balancing, check out this article

全局指标权重Global metric weights

假设 ServiceA 将 MetricA 的权重定义为“高”,并且 ServiceB 将 MetricA 的权重定义为“低”或“零”。Let's say ServiceA defines MetricA as weight High, and ServiceB sets the weight for MetricA to Low or Zero. 最终使用的实际权重是多少?What's the actual weight that ends up getting used?

系统会为每个指标跟踪多个权重。There are multiple weights that are tracked for every metric. 第一个权重是在服务创建时为指标定义的权重。The first weight is the one defined for the metric when the service is created. 另一个权重是自动计算的全局权重。The other weight is a global weight, which is computed automatically. 计算解决方案的分数时,群集资源管理器会使用这两个权重。The Cluster Resource Manager uses both these weights when scoring solutions. 应同时将这两个权重纳入考虑,这很重要。Taking both weights into account is important. 这样一来,群集资源管理器可以根据服务自身的优先级平衡每个服务,并确保群集在整体上得到正确分配。This allows the Cluster Resource Manager to balance each service according to its own priorities, and also ensure that the cluster as a whole is allocated correctly.

如果群集资源管理器不在乎全局和局部均衡,会发生什么情况?What would happen if the Cluster Resource Manager didn't care about both global and local balance? 构造全局均衡的解决方案虽然很简单,但这会导致单个服务的资源不够均衡。Well, it's easy to construct solutions that are globally balanced, but which result in poor resource balance for individual services. 在以下示例中,我们关注仅使用默认指标配置的服务,并了解只考虑全局均衡时会发生什么情况:In the following example, let's look at a service configured with just the default metrics, and see what happens when only global balance is considered:


最上面的示例只探讨了全局均衡,群集整体上的确达到均衡。In the top example based only on global balance, the cluster as a whole is indeed balanced. 所有节点的主副本计数和总副本计数都相同。All nodes have the same count of primaries and the same number total replicas. 不过,如果查看此分配的实际影响,就不是那么理想:丢失任何节点对特定工作负荷带来不成比例的影响,因为这除去了其所有的主要副本。However, if you look at the actual impact of this allocation it's not so good: the loss of any node impacts a particular workload disproportionately, because it takes out all of its primaries. 例如,如果第一个节点出现故障,圆形服务的三个不同分区的三个主副本将全部丢失。For example, if the first node fails the three primaries for the three different partitions of the Circle service would all be lost. 相反,三角形和六边形服务的分区丢失一个副本。Conversely, the Triangle and Hexagon services have their partitions lose a replica. 这不会导致中断,只是不得不恢复已停止运行的副本。This causes no disruption, other than having to recover the down replica.

在最下面的示例中,群集 Resource Manager 已根据全局均衡和按服务的均衡来分发副本。In the bottom example, the Cluster Resource Manager has distributed the replicas based on both the global and per-service balance. 当计算解决方案的分数时,它将大部分权重分配给全局解决方案,将(可配置的)部分权重分配给单个服务。When calculating the score of the solution it gives most of the weight to the global solution, and a (configurable) portion to individual services. 指标的全局均衡是根据各服务的指标权重平均值进行计算的。Global balance for a metric is calculated based on the average of the metric weights from each service. 根据服务自身的定义指标权重平衡每个服务。Each service is balanced according to its own defined metric weights. 这可确保服务根据自身需要在自身内部进行均衡。This ensures that the services are balanced within themselves according to their own needs. 因此,如果第一个节点同样发生故障,则所有服务的所有分区都会发生此故障。As a result, if the same first node fails the failure is distributed across all partitions of all services. 这对每个分区的影响都是相同的。The impact to each is the same.

后续步骤Next steps

  • 有关服务配置的详细信息,请参阅了解如何配置服务(service-fabric-cluster-resource-manager-configure-services.md)For more information on configuring services, Learn about configuring Services(service-fabric-cluster-resource-manager-configure-services.md)
  • 定义碎片整理指标是合并(而不是分散)节点上负载的一种方式。若要了解如何配置碎片整理,请参阅此文Defining Defragmentation Metrics is one way to consolidate load on nodes instead of spreading it out. To learn how to configure defragmentation, refer to this article
  • 若要了解群集 Resource Manager 如何管理和均衡群集中的负载,请查看有关均衡负载的文章To find out about how the Cluster Resource Manager manages and balances load in the cluster, check out the article on balancing load
  • 从头开始并获取 Service Fabric 群集 Resource Manager 简介Start from the beginning and get an Introduction to the Service Fabric Cluster Resource Manager
  • 移动成本是向群集 Resource Manager 发出信号,表示移动某些服务比移动其他服务会产生更高成本的方式之一。Movement Cost is one way of signaling to the Cluster Resource Manager that certain services are more expensive to move than others. 若要了解有关移动成本的详细信息,请参阅此文To learn more about movement cost, refer to this article