使用群集资源管理器描述 Service Fabric 群集Describe a Service Fabric cluster by using Cluster Resource Manager

Azure Service Fabric 的群集资源管理器功能提供多种机制用于描述群集:The Cluster Resource Manager feature of Azure Service Fabric provides several mechanisms for describing a cluster:

  • 容错域Fault domains
  • 升级域Upgrade domains
  • 节点属性Node properties
  • 节点容量Node capacities

在运行时,群集资源管理器使用此信息来确保群集中运行的服务的高可用性。During runtime, Cluster Resource Manager uses this information to ensure high availability of the services running in the cluster. 实施这些重要规则时,群集资源管理器还会尝试优化群集中的资源消耗。While enforcing these important rules, it also tries to optimize resource consumption within the cluster.

容错域Fault domains

容错域是协调故障的任何区域。A fault domain is any area of coordinated failure. 一台计算机就是一个容错域。A single machine is a fault domain. 该计算机本身可能出于各种原因而发生故障,包括电源故障、驱动器故障、NIC 固件错误,等等。It can fail on its own for various reasons, from power supply failures to drive failures to bad NIC firmware.

连接到同一以太网交换机的计算机位于同一个容错域中。Machines connected to the same Ethernet switch are in the same fault domain. 共享一个电源或者位于同一个位置的计算机或也是如此。So are machines that share a single source of power or in a single location.

由于硬件故障在性质上是重叠的,容错域原本就有层次性。Because it's natural for hardware faults to overlap, fault domains are inherently hierarchical. 容错域在 Service Fabric 中以 URI 的形式表示。They're represented as URIs in Service Fabric.

对容错域进行正确设置很重要,因为 Service Fabric 根据此信息来安全放置服务。It's important that fault domains are set up correctly because Service Fabric uses this information to safely place services. Service Fabric 不希望在放置服务后,容错域的丢失(由某个组件的故障而导致)会造成服务关闭。Service Fabric doesn't want to place services such that the loss of a fault domain (caused by the failure of some component) causes a service to go down.

在 Azure 环境中,Service Fabric 利用环境提供的容错域信息,代表用户正确配置群集中的节点。In the Azure environment, Service Fabric uses the fault domain information provided by the environment to correctly configure the nodes in the cluster on your behalf. 对于独立的 Service Fabric 实例,将在设置群集时定义容错域。For standalone instances of Service Fabric, fault domains are defined at the time that the cluster is set up.

Warning

请务必确保提供给 Service Fabric 的容错域信息准确无误。It's important that the fault domain information provided to Service Fabric is accurate. 例如,假设 Service Fabric 群集的节点在 5 个物理主机上运行的 10 台虚拟机中运行。For example, let's say that your Service Fabric cluster's nodes are running inside 10 virtual machines, running on 5 physical hosts. 在此情况下,即使有 10 个虚拟机,也只有 5 个不同的(顶层)容错域。In this case, even though there are 10 virtual machines, there are only 5 different (top level) fault domains. 共享同一物理主机会导致 VM 共享同一个根容错域,因此如果物理主机发生故障,共享主机的 VM 也会同时相应发生故障。Sharing the same physical host causes VMs to share the same root fault domain, because the VMs experience coordinated failure if their physical host fails.

Service Fabric 预期节点的容错域不变。Service Fabric expects the fault domain of a node not to change. 确保实现 VM 高可用性的其他机制(例如 HA-VM)可能会导致与 Service Fabric 发生冲突。Other mechanisms of ensuring high availability of the VMs, such as HA-VMs, might cause conflicts with Service Fabric. 这些机制使用主机间的 VM 透明迁移。These mechanisms use transparent migration of VMs from one host to another. 它们不会重新配置或通知 VM 内运行的代码。They don't reconfigure or notify the running code inside the VM. 因此,不支持将这些机制用作运行 Service Fabric 群集的环境。 As such, they're not supported as environments for running Service Fabric clusters.

应使用 Service Fabric 作为唯一的高可用性技术。Service Fabric should be the only high-availability technology employed. 实时 VM 迁移和 SAN 等机制不是必要的机制。Mechanisms like live VM migration and SANs are not necessary. 如果将这些机制与 Service Fabric 一同使用,会降低应用程序的可用性和可靠性。 If these mechanisms are used in conjunction with Service Fabric, they reduce application availability and reliability. 原因是这些机制会增大复杂性,增加并发性故障来源,并且它们使用的可靠性和可用性策略可能会与 Service Fabric 中的策略相冲突。The reason is that they introduce additional complexity, add centralized sources of failure, and use reliability and availability strategies that conflict with those in Service Fabric.

在下图中,我们已将构成容错域的所有实体着色,并列出了生成的所有不同容错域。In the following graphic, we color all the entities that contribute to fault domains and list all the different fault domains that result. 本示例列出了数据中心 (DC)、机架 (R) 和刀片服务器 (B)。In this example, we have datacenters ("DC"), racks ("R"), and blades ("B"). 如果每个刀片服务器包含多个虚拟机,则容错域层次结构中可能有另一个层。If each blade holds more than one virtual machine, there might be another layer in the fault domain hierarchy.

通过故障域组织的节点

在运行时,Service Fabric 群集资源管理器会考虑群集中的容错域,并规划布局。During runtime, Service Fabric Cluster Resource Manager considers the fault domains in the cluster and plans layouts. 它会分发服务的有状态副本和无状态副本,使这些副本位于不同的容错域中。The stateful replicas or stateless instances for a service are distributed so they're in separate fault domains. 容错域在层次结构的任一级别发生故障时,通过跨容错域分发服务可确保服务可用性不受影响。Distributing the service across fault domains ensures that the availability of the service isn't compromised when a fault domain fails at any level of the hierarchy.

群集资源管理器不会考虑容错域层次结构中有多少层。Cluster Resource Manager doesn't care how many layers there are in the fault domain hierarchy. 它会尝试确保丢失层次结构的任何一部分不至于影响在其中运行的服务。It tries to ensure that the loss of any one portion of the hierarchy doesn't affect services running in it.

最好是在容错域层次结构中的每个深度级别部署相同数目的节点。It's best if the same number of nodes is at each level of depth in the fault domain hierarchy. 如果群集中的容错域“树”不均衡,群集资源管理器会更难找到最佳的服务分配。If the "tree" of fault domains is unbalanced in your cluster, it's harder for Cluster Resource Manager to figure out the best allocation of services. 容错域布局失衡意味着,丢失某些域对服务可用性造成的影响比丢失其他域更严重。Imbalanced fault domain layouts mean that the loss of some domains affects the availability of services more than other domains. 因此,群集资源管理器难以在两个目标之间作出取舍:As a result, Cluster Resource Manager is torn between two goals:

  • 它希望通过将服务放在域中来使用计算机。It wants to use the machines in that "heavy" domain by placing services on them.
  • 它还希望将服务放入其他域,以避免域的数据丢失造成问题。It wants to place services in other domains so that the loss of a domain doesn't cause problems.

失衡的域是什么样子?What do imbalanced domains look like? 下图显示了两种不同的群集布局。The following diagram shows two different cluster layouts. 在第一个示例中,节点均匀分散到容错域。In the first example, the nodes are distributed evenly across the fault domains. 在第二个示例中,某一容错域包含的节点比其他容错域多出许多。In the second example, one fault domain has many more nodes than the other fault domains.

两种不同的群集布局

在 Azure 中,系统会为用户选择哪个容错域包含节点。In Azure, the choice of which fault domain contains a node is managed for you. 但是,根据预配的节点数目,仍然可能出现某容错域比其他容错域承载更多节点的情况。But depending on the number of nodes that you provision, you can still end up with fault domains that have more nodes in them than in others.

例如,假设群集中有 5 个容错域,但针对某个节点类型 (NodeType) 预配了 7 个节点。For example, say you have five fault domains in the cluster but provision seven nodes for a node type (NodeType). 在这种情况下,前两个容错域会承载更多的节点。In this case, the first two fault domains end up with more nodes. 如果继续使用几个实例部署更多的 NodeType 实例,问题会变得更严重。If you continue to deploy more NodeType instances with only a couple of instances, the problem gets worse. 因此,我们建议将每个节点类型中的节点数配置为容错域数的倍数。For this reason, we recommend that the number of nodes in each node type is a multiple of the number of fault domains.

升级域Upgrade domains

升级域是另一项功能,可帮助 Service Fabric 群集资源管理器了解群集的布局。Upgrade domains are another feature that helps Service Fabric Cluster Resource Manager understand the layout of the cluster. 升级域定义同时升级的节点集。Upgrade domains define sets of nodes that are upgraded at the same time. 升级域可帮助群集资源管理器了解和安排诸如升级等管理操作。Upgrade domains help Cluster Resource Manager understand and orchestrate management operations like upgrades.

升级域非常类似于容错域,但有几个重要的差异。Upgrade domains are a lot like fault domains, but with a couple key differences. 首先,容错域由协调性硬件故障区域定义。First, areas of coordinated hardware failures define fault domains. 而升级域由策略定义。Upgrade domains, on the other hand, are defined by policy. 需要确定所需的数量,而不是由环境描述所需的数量。You get to decide how many you want, instead of letting the environment dictate the number. 升级域数量可以与节点数量相同。You can have as many upgrade domains as you do nodes. 容错域和升级域之间的另一个差异在于,升级域不是分层的。Another difference between fault domains and upgrade domains is that upgrade domains are not hierarchical. 升级域更像是一个简单标记。Instead, they're more like a simple tag.

下图显示了跨三个容错域条带化的三个升级域。The following diagram shows three upgrade domains striped across three fault domains. 该图还显示了有状态服务的三个不同副本的一种可能的放置方式,即三个副本位于不同的容错域和升级域中。It also shows one possible placement for three different replicas of a stateful service, where each ends up in different fault and upgrade domains. 这种放置方式容许在服务升级过程中丢失一个容错域,在这种情况下仍可运行一个代码和数据副本。This placement allows the loss of a fault domain while in the middle of a service upgrade and still have one copy of the code and data.

包含容错域和升级域的布局

例如,如果有五个升级域,每个域中的节点处理大约 20% 的流量。For example, if you have five upgrade domains, the nodes in each are handling roughly 20 percent of your traffic. 如果需要关闭升级域进行升级,则通常需要将负载转移到某个位置。If you need to take down that upgrade domain for an upgrade, that load usually needs to go somewhere. 由于剩余有 4 个升级域,因此每个升级域必须具有可供 5% 的总流量使用的空间。Because you have four remaining upgrade domains, each must have room for about 5 percent of the total traffic. 更多升级域意味着群集中节点上所需占用的缓冲区更少。More upgrade domains mean that you need less buffer on the nodes in the cluster.

假设你有 10 个升级域。Consider if you had 10 upgrade domains instead. 在此情况下,每个升级域仅处理大约 10% 的总流量。In that case, each upgrade domain would be handling only about 10 percent of the total traffic. 当开始升级群集时,每个域只需具有约 1.1% 的总流量空间即可。When an upgrade steps through the cluster, each domain would need to have room for only about 1.1 percent of the total traffic. 使用更多升级域通常能够以更高的利用率运行节点,因为需要保留的容量更少。More upgrade domains generally allow you to run your nodes at higher utilization, because you need less reserved capacity. 这同样适用于容错域。The same is true for fault domains.

使用多个升级域的缺点是升级需要花费的时间更长。The downside of having many upgrade domains is that upgrades tend to take longer. 当一个升级域完成升级后,Service Fabric 会等待一小段时间,并在开始升级下一个域之前执行检查。Service Fabric waits a short period after an upgrade domain is completed and performs checks before starting to upgrade the next one. 通过这一延迟时间,可在继续升级之前检测升级带来的问题。These delays enable detecting issues introduced by the upgrade before the upgrade proceeds. 这种折衷是可接受的,因为可以防止错误的更改一次性影响过多的服务。The tradeoff is acceptable because it prevents bad changes from affecting too much of the service at a time.

升级域太少会产生负面影响。The presence of too few upgrade domains has many negative side effects. 当每个升级域关闭并进行升级时,大部分的总体容量将不可用。While each upgrade domain is down and being upgraded, a large portion of your overall capacity is unavailable. 例如,如果你只有 3 个升级域,则一次只能采用整体服务或群集容量的 1/3。For example, if you have only three upgrade domains, you're taking down about one-third of your overall service or cluster capacity at a time. 让如此多的服务同时停止并非理想的结果,因为需要在其余的群集中拥有足够的容量才能处理工作负荷。Having so much of your service down at once isn't desirable because you need enough capacity in the rest of your cluster to handle the workload. 维护该缓冲区意味着在正常工作期间,这些节点比其他情况下的负载更少。Maintaining that buffer means that during normal operation, those nodes are less loaded than they would be otherwise. 这会增加运行服务的成本。This increases the cost of running your service.

环境中容错域或升级域的总数没有实际限制,对于它们如何重叠也没有约束。There's no real limit to the total number of fault or upgrade domains in an environment, or constraints on how they overlap. 但是,存在一些通用的模式:But there are common patterns:

  • 容错域和升级域按 1:1 映射Fault domains and upgrade domains mapped 1:1
  • 每个节点一个升级域(物理或虚拟 OS 实例)One upgrade domain per node (physical or virtual OS instance)
  • “条带化”或“矩阵”模型,其中的容错域和升级域构成了通常沿对角线分布的计算机矩阵A "striped" or "matrix" model where the fault domains and upgrade domains form a matrix with machines usually running down the diagonals

容错域和升级域的布局

至于要选择哪种布局并没有最佳答案,There's no best answer for which layout to choose. 每个选项都有各自的利与弊。Each has pros and cons. 例如,1FD:1UD 模型很容易设置。For example, the 1FD:1UD model is simple to set up. 为每个节点配置一个升级域似乎是最常用的模型。The model of one upgrade domain per node model is most like what people are used to. 升级过程中会独立更新每个节点。During upgrades, each node is updated independently. 这类似于过去手动升级少量计算机的方式。This is similar to how small sets of machines were upgraded manually in the past.

最常见的模型是 FD/UD 矩阵,其中容错域和升级域构成一个表,节点沿着对角线开始放置。The most common model is the FD/UD matrix, where the fault domains and upgrade domains form a table and nodes are placed starting along the diagonal. 这是 Azure 中 Service Fabric 群集默认使用的模型。This is the model used by default in Service Fabric clusters in Azure. 对于具有多个节点的群集,最终都会形成与密集矩阵模式类似的模式。For clusters with many nodes, everything ends up looking like a dense matrix pattern.

Note

托管在 Azure 中的 Service Fabric 群集不支持更改默认策略。Service Fabric clusters hosted in Azure don't support changing the default strategy. 只有独立群集提供这种自定义。Only standalone clusters offer that customization.

容错域与升级域约束及最终行为Fault and upgrade domain constraints and resulting behavior

默认方法Default approach

默认情况下,群集资源管理器维持服务在容错域和升级域之间的平衡。By default, Cluster Resource Manager keeps services balanced across fault and upgrade domains. 这将建模为约束This is modeled as a constraint. 容错域和升级域的约束定义如下:“对于给定的服务分区,同一层次结构级别上的任意两个域之间的服务对象(无状态服务实例或有状态服务副本)数量之差应永远不能大于 1”。The constraint for fault and upgrade domains states: "For a given service partition, there should never be a difference greater than one in the number of service objects (stateless service instances or stateful service replicas) between any two domains on the same level of hierarchy."

让我们假设此约束提供“最大差值”保证。Let's say that this constraint provides a "maximum difference" guarantee. 容错域和升级域的约束会阻止违反该规则的某些移动或排列操作。The constraint for fault and upgrade domains prevents certain moves or arrangements that violate the rule.

例如,假设有一个 6 节点群集,其中配置了 5 个容错域和 5 个升级域。For example, let's say that we have a cluster with six nodes, configured with five fault domains and five upgrade domains.

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4
UD0UD0 N1N1
UD1UD1 N6N6 N2N2
UD2UD2 N3N3
UD3UD3 N4N4
UD4UD4 N5N5

现在,假设要创建一个 TargetReplicaSetSize(如果是无状态服务,则为 InstanceCount)值为 5 的服务。Now let's say that we create a service with a TargetReplicaSetSize (or, for a stateless service, InstanceCount) value of five. 副本驻留在 N1-N5 上。The replicas land on N1-N5. 事实上,无论创建多少个此类服务,都不会用到 N6。In fact, N6 is never used no matter how many services like this you create. 为什么?But why? 看看当前布局和选择 N6 时所发生情况之间的差异。Let's look at the difference between the current layout and what would happen if N6 is chosen.

下面是获得的布局,以及每个容错域和升级域的副本总数:Here's the layout we got and the total number of replicas per fault and upgrade domain:

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4 UDTotalUDTotal
UD0UD0 R1R1 11
UD1UD1 R2R2 11
UD2UD2 R3R3 11
UD3UD3 R4R4 11
UD4UD4 R5R5 11
FDTotalFDTotal 11 11 11 11 11 -

此布局在每个容错域与升级域的节点数目上是均衡的。This layout is balanced in terms of nodes per fault domain and upgrade domain. 并且在每个容错域和升级域的副本数目上也是均衡的。It's also balanced in terms of the number of replicas per fault and upgrade domain. 每个域都拥有相同数量的节点,以及相同数量的副本。Each domain has the same number of nodes and the same number of replicas.

现在让我们看看,如果不使用 N2,而是改用 N6 会发生什么情况。Now, let's look at what would happen if we'd used N6 instead of N2. 副本会如何分布?How would the replicas be distributed then?

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4 UDTotalUDTotal
UD0UD0 R1R1 11
UD1UD1 R5R5 11
UD2UD2 R2R2 11
UD3UD3 R3R3 11
UD4UD4 R4R4 11
FDTotalFDTotal 22 00 11 11 11 -

此布局违反了容错域约束的“最大差值”保证的定义。This layout violates our definition of the "maximum difference" guarantee for the fault domain constraint. FD0 有 2 个副本,而 FD1 有 0 个。FD0 has two replicas, whereas FD1 has zero. FD0 与 FD1 之间的总差为 2,这个总差已大于最大差值 1。The difference between FD0 and FD1 is a total of two, which is greater than the maximum difference of one. 由于违反了约束,群集资源管理器不允许这种排列方式。Because the constraint is violated, Cluster Resource Manager does not allow this arrangement. 同样,如果选择 N2 和 N6(而不是 N1 和 N2),则会得到:Similarly, if we picked N2 and N6 (instead of N1 and N2), we'd get:

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4 UDTotalUDTotal
UD0UD0 00
UD1UD1 R5R5 R1R1 22
UD2UD2 R2R2 11
UD3UD3 R3R3 11
UD4UD4 R4R4 11
FDTotalFDTotal 11 11 11 11 11 -

就容错域而言,此布局是均衡的。This layout is balanced in terms of fault domains. 但是,现在违反了升级域约束,因为 UD0 有 0 个副本,而 UD1 有 2 个。But now it's violating the upgrade domain constraint, because UD0 has zero replicas and UD1 has two. 此布局也是无效的,群集资源管理器不会选取此布局。This layout is also invalid and won't be picked by Cluster Resource Manager.

这种有状态副本或无状态实例的分布方法提供了最佳容错能力。This approach to the distribution of stateful replicas or stateless instances provides the best possible fault tolerance. 如果某个域关闭,只会丢失极少量的副本/实例。If one domain goes down, the minimal number of replicas/instances is lost.

另一方面,此方法过于严格,不允许群集使用所有资源。On the other hand, this approach can be too strict and not allow the cluster to utilize all resources. 对于某些群集配置,某些节点不可用。For certain cluster configurations, certain nodes can't be used. 这可能导致 Service Fabric 无法放置服务,从而出现警告消息。This can cause Service Fabric to not place your services, resulting in warning messages. 在之前的示例中,某些群集节点不可用(示例中的 N6)。In the previous example, some of the cluster nodes can't be used (N6 in the example). 即使将节点添加到该群集 (N7-N10),也只会将副本/实例放置在 N1-N5 上,因为容错域和升级域施加了约束。Even if you added nodes to that cluster (N7-N10), replicas/instances would be placed only on N1-N5 because of constraints on fault and upgrade domains.

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4
UD0UD0 N1N1 N10N10
UD1UD1 N6N6 N2N2
UD2UD2 N7N7 N3N3
UD3UD3 N8N8 N4N4
UD4UD4 N9N9 N5N5

替代方法Alternative approach

群集资源管理器支持另一版本的容错域和升级域约束。Cluster Resource Manager supports another version of the constraint for fault and upgrade domains. 该版本在允许放置服务的同时仍然保证最低级别的安全。It allows placement while still guaranteeing a minimum level of safety. 替代约束如下所述:“对于给定的服务分区,跨域的副本分布应保证分区不会遭受仲裁丢失”。The alternative constraint can be stated as follows: "For a given service partition, replica distribution across domains should ensure that the partition does not suffer a quorum loss." 让我们假设此约束提供“仲裁安全”保证。Let's say that this constraint provides a "quorum safe" guarantee.

Note

对于有状态服务,如果大部分分区副本同时关闭,则表示“仲裁丢失” 。For a stateful service, we define quorum loss in a situation when a majority of the partition replicas are down at the same time. 例如,如果 TargetReplicaSetSize 为 5,则任意三个副本一组表示仲裁。For example, if TargetReplicaSetSize is five, a set of any three replicas represents quorum. 同样,如果 TargetReplicaSetSize 为 6,则仲裁必须要 4 个副本。Similarly, if TargetReplicaSetSize is six, four replicas are necessary for quorum. 如果分区想要继续正常运行,则在这两种情况下,不能同时关闭超过两个副本。In both cases, no more than two replicas can be down at the same time if the partition wants to continue functioning normally.

对于无状态服务,不存在类似仲裁丢失的情况。 For a stateless service, there's no such thing as quorum loss. 即使大部分实例同时关闭,无状态服务也仍可继续正常运行。Stateless services continue to function normally even if a majority of instances go down at the same time. 因此,本文的余下内容将重点介绍有状态服务。So, we'll focus on stateful services in the rest of this article.

让我们返回到上一个示例。Let's go back to the previous example. 因为具有“仲裁安全”版本的约束,所有三个布局都是有效的。With the "quorum safe" version of the constraint, all three layouts would be valid. 即使第二个布局中的 FD0 或第三个布局中的 UD1 发生故障,分区仍有仲裁。Even if FD0 failed in the second layout or UD1 failed in the third layout, the partition would still have quorum. (大部分副本仍可运行)。由于该版本的约束,几乎始终可以使用 N6。(A majority of the replicas would still be up.) With this version of the constraint, N6 can almost always be utilized.

“仲裁安全”方法比“最大差值”方法更具灵活性。The "quorum safe" approach provides more flexibility than the "maximum difference" approach. 原因在于,更容易在几乎所有群集拓扑中找到有效的副本分布。The reason is that it's easier to find replica distributions that are valid in almost any cluster topology. 但是,此方法不能保证最佳容错特性,因为有些故障比其他故障更为严重。However, this approach can't guarantee the best fault tolerance characteristics because some failures are worse than others.

在最坏的情况下,大部分副本会因为某个域和某个额外的副本故障而丢失。In the worst case scenario, a majority of the replicas can be lost with the failure of one domain and one additional replica. 例如,相比 5 个副本或实例要求其中 3 个故障才丢失仲裁,现在仅 2 个故障就可能丢失大部分副本。For example, instead of three failures being required to lose quorum with five replicas or instances, you can now lose a majority with just two failures.

自适应方法Adaptive approach

由于这两种方法都有优缺点,现在我们将介绍结合了两种策略的自适应方法。Because both approaches have strengths and weaknesses, we've introduced an adaptive approach that combines these two strategies.

Note

从 Service Fabric 版本 6.2 起,这是默认行为。This is the default behavior starting with Service Fabric version 6.2.

自适应方法默认使用“最大差值”逻辑,并且仅在必要时切换为“仲裁安全”逻辑。The adaptive approach uses the "maximum difference" logic by default and switches to the "quorum safe" logic only when necessary. 群集资源管理器通过查看群集和服务的配置方式,自动辨别必要的策略。Cluster Resource Manager automatically figures out which strategy is necessary by looking at how the cluster and services are configured.

如果同时满足以下两个条件,群集资源管理器应对服务使用“基于仲裁”的逻辑:Cluster Resource Manager should use the "quorum based" logic for a service both of these conditions are true:

  • 该服务的 TargetReplicaSetSize 能被容错域数和升级域数整除。TargetReplicaSetSize for the service is evenly divisible by the number of fault domains and the number of upgrade domains.
  • 节点数小于或等于容错域数乘以升级域数的积。The number of nodes is less than or equal to the number of fault domains multiplied by the number of upgrade domains.

请注意,群集资源管理器对无状态和有状态服务都将使用此方法,即使仲裁丢失与无状态服务无关。Bear in mind that Cluster Resource Manager will use this approach for both stateless and stateful services, even though quorum loss isn't relevant for stateless services.

让我们回到前面的示例,假设群集现在有 8 个节点。Let's go back to the previous example and assume that a cluster now has eight nodes. 群集仍然配置了 5 个容错域和 5 个升级域,该群集上托管的服务的 TargetReplicaSetSize 仍为 5。The cluster is still configured with five fault domains and five upgrade domains, and the TargetReplicaSetSize value of a service hosted on that cluster remains five.

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4
UD0UD0 N1N1
UD1UD1 N6N6 N2N2
UD2UD2 N7N7 N3N3
UD3UD3 N8N8 N4N4
UD4UD4 N5N5

由于满足所有必要条件,群集资源管理器将在分布服务时使用“基于仲裁”逻辑。Because all necessary conditions are satisfied, Cluster Resource Manager will use the "quorum based" logic in distributing the service. 这将启用 N6-N8。This enables usage of N6-N8. 本例中的一个可能的服务分布如下所示:One possible service distribution in this case might look like this:

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4 UDTotalUDTotal
UD0UD0 R1R1 11
UD1UD1 R2R2 11
UD2UD2 R3R3 R4R4 22
UD3UD3 00
UD4UD4 R5R5 11
FDTotalFDTotal 22 11 11 00 11 -

例如,如果服务的 TargetReplicaSetSize 值减为 4,则群集资源管理器会注意到该变化。If your service's TargetReplicaSetSize value is reduced to four (for example), Cluster Resource Manager will notice that change. 它会恢复使用“最大差值”逻辑,因为 TargetReplicaSetSize 不再能被容错域数和升级域数整除。It will resume using the "maximum difference" logic because TargetReplicaSetSize isn't dividable by the number of fault domains and upgrade domains anymore. 因此,为了将余下的 4 个副本分布在节点 N1-N5 上,将移动部分副本。As a result, certain replica movements will occur to distribute the remaining four replicas on nodes N1-N5. 这样就不会违反“最大差值”版本的容错域和升级域逻辑。That way, the "maximum difference" version of the fault domain and upgrade domain logic is not violated.

在上面的布局中,如果 TargetReplicaSetSize 值为 5,并将 N1 从群集中删除,则升级域数将等于 4。In the previous layout, if the TargetReplicaSetSize value is five and N1 is removed from the cluster, the number of upgrade domains becomes equal to four. 同样,由于服务的 TargetReplicaSetSize 值不再能被升级域数整除,群集资源管理器将开始使用“最大差值”逻辑。Again, Cluster Resource Manager starts using "maximum difference" logic because the number of upgrade domains doesn't evenly divide the service's TargetReplicaSetSize value anymore. 因此,再次构建副本 R1 时,它必须位于 N4 上,这样才不会违反容错域和升级域的约束。As a result, replica R1, when built again, has to land on N4 so that the constraint for the fault and upgrade domain is not violated.

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4 UDTotalUDTotal
UD0UD0 不适用N/A 不适用N/A 不适用N/A 不适用N/A 不适用N/A 不适用N/A
UD1UD1 R2R2 11
UD2UD2 R3R3 R4R4 22
UD3UD3 R1R1 11
UD4UD4 R5R5 11
FDTotalFDTotal 11 11 11 11 11 -

配置容错域和升级域Configuring fault and upgrade domains

在 Azure 托管的 Service Fabric 部署中,容错域和升级域是自动定义的。In Azure-hosted Service Fabric deployments, fault domains and upgrade domains are defined automatically. Service Fabric 从 Azure 中选择并使用环境信息。Service Fabric picks up and uses the environment information from Azure.

若要创建自己的群集(或者要在开发环境中运行特定的拓扑),可自行提供容错域和升级域信息。If you're creating your own cluster (or want to run a particular topology in development), you can provide the fault domain and upgrade domain information yourself. 在本示例中,我们定义了一个 9 节点本地开发群集,该群集跨 3 个数据中心(每个数据中心有 3 个机架)。In this example, we define a nine-node local development cluster that spans three datacenters (each with three racks). 该群集还有跨这三个数据中心条带化的三个升级域。This cluster also has three upgrade domains striped across those three datacenters. 下面是 ClusterManifest.xml 中的配置示例:Here's an example of the configuration in ClusterManifest.xml:

  <Infrastructure>
    <!-- IsScaleMin indicates that this cluster runs on one box/one single server -->
    <WindowsServer IsScaleMin="true">
      <NodeList>
        <Node NodeName="Node01" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType01" FaultDomain="fd:/DC01/Rack01" UpgradeDomain="UpgradeDomain1" IsSeedNode="true" />
        <Node NodeName="Node02" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType02" FaultDomain="fd:/DC01/Rack02" UpgradeDomain="UpgradeDomain2" IsSeedNode="true" />
        <Node NodeName="Node03" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType03" FaultDomain="fd:/DC01/Rack03" UpgradeDomain="UpgradeDomain3" IsSeedNode="true" />
        <Node NodeName="Node04" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType04" FaultDomain="fd:/DC02/Rack01" UpgradeDomain="UpgradeDomain1" IsSeedNode="true" />
        <Node NodeName="Node05" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType05" FaultDomain="fd:/DC02/Rack02" UpgradeDomain="UpgradeDomain2" IsSeedNode="true" />
        <Node NodeName="Node06" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType06" FaultDomain="fd:/DC02/Rack03" UpgradeDomain="UpgradeDomain3" IsSeedNode="true" />
        <Node NodeName="Node07" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType07" FaultDomain="fd:/DC03/Rack01" UpgradeDomain="UpgradeDomain1" IsSeedNode="true" />
        <Node NodeName="Node08" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType08" FaultDomain="fd:/DC03/Rack02" UpgradeDomain="UpgradeDomain2" IsSeedNode="true" />
        <Node NodeName="Node09" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType09" FaultDomain="fd:/DC03/Rack03" UpgradeDomain="UpgradeDomain3" IsSeedNode="true" />
      </NodeList>
    </WindowsServer>
  </Infrastructure>

此示例使用 ClusterConfig.json 实现独立部署:This example uses ClusterConfig.json for standalone deployments:

"nodes": [
  {
    "nodeName": "vm1",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc1/r0",
    "upgradeDomain": "UD1"
  },
  {
    "nodeName": "vm2",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc1/r0",
    "upgradeDomain": "UD2"
  },
  {
    "nodeName": "vm3",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc1/r0",
    "upgradeDomain": "UD3"
  },
  {
    "nodeName": "vm4",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc2/r0",
    "upgradeDomain": "UD1"
  },
  {
    "nodeName": "vm5",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc2/r0",
    "upgradeDomain": "UD2"
  },
  {
    "nodeName": "vm6",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc2/r0",
    "upgradeDomain": "UD3"
  },
  {
    "nodeName": "vm7",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc3/r0",
    "upgradeDomain": "UD1"
  },
  {
    "nodeName": "vm8",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc3/r0",
    "upgradeDomain": "UD2"
  },
  {
    "nodeName": "vm9",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc3/r0",
    "upgradeDomain": "UD3"
  }
],

Note

通过 Azure 资源管理器定义群集时,Azure 将分配容错域和升级域。When you're defining clusters via Azure Resource Manager, Azure assigns fault domains and upgrade domains. 因此,Azure 资源管理器模板中节点类型和虚拟机规模集的定义不包含有关容错域或升级域的信息。So the definition of your node types and virtual machine scale sets in your Azure Resource Manager template doesn't include information about fault domain or upgrade domain.

节点属性和放置约束Node properties and placement constraints

有时(事实上是大多数情况下),需要确保只在群集中特定类型的节点上运行某些工作负荷。Sometimes (in fact, most of the time) you'll want to ensure that certain workloads run only on certain types of nodes in the cluster. 例如,某些工作负荷可能需要 GPU 或 SSD,而有些则不用。For example, some workloads might require GPUs or SSDs, and others might not.

一个有说服力的示例就是,几乎在每个 n 层体系结构,都有专门的硬件来处理特定的工作负荷。A great example of targeting hardware to particular workloads is almost every n-tier architecture. 某些计算机充当应用程序的前端或 API 服务端,并向客户端或在 Internet 上公开。Certain machines serve as the front end or API-serving side of the application and are exposed to the clients or the internet. 其他一些计算机(通常具有不同的硬件资源)处理计算或存储层的工作。Different machines, often with different hardware resources, handle the work of the compute or storage layers. 通常不会直接向客户端或 Internet 公开这些计算机 。These are usually not directly exposed to clients or the internet.

Service Fabric 预期存在特定工作负荷需要在特定硬件配置上运行的情况。Service Fabric expects that in some cases, particular workloads might need to run on particular hardware configurations. 例如:For example:

  • 现有的 n 层应用程序已“提升并迁移”到 Service Fabric 环境。An existing n-tier application has been "lifted and shifted" into a Service Fabric environment.
  • 出于性能、规模或安全性隔离原因,某个工作负荷必须在特定硬件上运行。A workload must be run on specific hardware for performance, scale, or security isolation reasons.
  • 出于策略或资源消耗原因,某个工作负荷应该与其他工作负荷隔离。A workload should be isolated from other workloads for policy or resource consumption reasons.

为了支持此类配置,Service Fabric 包含可应用于节点的标记。To support these sorts of configurations, Service Fabric includes tags that you can apply to nodes. 这些标记称为节点属性These tags are called node properties. 放置约束是附加到单个服务的语句,这些服务专供 1 个或多个节点属性选择。 Placement constraints are the statements attached to individual services that you select for one or more node properties. 放置约束定义服务运行的位置。Placement constraints define where services should run. 约束集可扩展。The set of constraints is extensible. 任何键/值对都适用。Any key/value pair can work.

群集布局中的不同工作负荷

内置节点属性Built-in node properties

Service Fabric 定义了一些可自动使用的默认节点属性,你无需定义这些属性。Service Fabric defines some default node properties that can be used automatically so you don't have to define them. 在每个节点上定义的默认属性是 NodeTypeNodeNameThe default properties defined at each node are NodeType and NodeName.

例如,可将放置约束编写为 "(NodeType == NodeType03)"For example, you can write a placement constraint as "(NodeType == NodeType03)". NodeType 是一个常用的属性。NodeType is a commonly used property. 它很有用,因为它与计算机的类型之间存在一一对应关系。It's useful because it corresponds 1:1 with a type of a machine. 每种计算机类型都与一种传统 n 层应用程序的工作负荷类型相对应。Each type of machine corresponds to a type of workload in a traditional n-tier application.

放置约束和节点属性

放置约束和节点属性语法Placement constraints and node property syntax

节点属性中指定的值可以是字符串、布尔值或带符号的长型值。The value specified in the node property can be a string, Boolean, or signed long. 服务处的语句称为放置约束,它可以约束服务在群集中的运行位置 。The statement at the service is called a placement constraint because it constrains where the service can run in the cluster. 约束可以是针对群集中的节点属性运行的任何布尔值语句。The constraint can be any Boolean statement that operates on the node properties in the cluster. 这些布尔值语句中的有效选择器为:The valid selectors in these Boolean statements are:

  • 用于创建特定语句的条件检查:Conditional checks for creating particular statements:

    语句Statement 语法Syntax
    “等于”"equal to" "==""=="
    “不等于”"not equal to" "!=""!="
    “大于”"greater than" ">"">"
    “大于等于”"greater than or equal to" ">="">="
    “小于”"less than" "<""<"
    “小于等于”"less than or equal to" "<=""<="
  • 用于分组和逻辑运算的布尔值语句:Boolean statements for grouping and logical operations:

    语句Statement 语法Syntax
    “和”"and" "&&""&&"
    “或”"or" "||""||"
    “非”"not" "!""!"
    “分组为单个语句”"group as single statement" "()""()"

下面是基本约束语句的一些示例:Here are some examples of basic constraint statements:

  • "Value >= 5"
  • "NodeColor != green"
  • "((OneProperty < 100) || ((AnotherProperty == false) && (OneProperty >= 100)))"

只有整个放置约束语句求值为“True”的节点才能放置服务。Only nodes where the overall placement constraint statement evaluates to "True" can have the service placed on it. 未定义属性的节点不匹配包含该属性的任何放置约束。Nodes that don't have a property defined don't match any placement constraint that contains the property.

假设在 ClusterManifest.xml 中为节点类型定义了以下节点属性:Let's say that the following node properties were defined for a node type in ClusterManifest.xml:

    <NodeType Name="NodeType01">
      <PlacementProperties>
        <Property Name="HasSSD" Value="true"/>
        <Property Name="NodeColor" Value="green"/>
        <Property Name="SomeProperty" Value="5"/>
      </PlacementProperties>
    </NodeType>

以下示例演示了通过 ClusterConfig.json 为独立部署定义的,或者通过 Template.json 为 Azure 托管群集定义的节点属性。The following example shows node properties defined via ClusterConfig.json for standalone deployments or Template.json for Azure-hosted clusters.

Note

在 Azure 资源管理器模板中,节点类型通常已参数化。In your Azure Resource Manager template, the node type is usually parameterized. 它类似于 "[parameters('vmNodeType1Name')]" 而不是 NodeType01。It would look like "[parameters('vmNodeType1Name')]" rather than NodeType01.

"nodeTypes": [
    {
        "name": "NodeType01",
        "placementProperties": {
            "HasSSD": "true",
            "NodeColor": "green",
            "SomeProperty": "5"
        },
    }
],

可以针对服务创建服务放置约束,如下所示 :You can create service placement constraints for a service as follows:

FabricClient fabricClient = new FabricClient();
StatefulServiceDescription serviceDescription = new StatefulServiceDescription();
serviceDescription.PlacementConstraints = "(HasSSD == true && SomeProperty >= 4)";
// Add other required ServiceDescription fields
//...
await fabricClient.ServiceManager.CreateServiceAsync(serviceDescription);
New-ServiceFabricService -ApplicationName $applicationName -ServiceName $serviceName -ServiceTypeName $serviceType -Stateful -MinReplicaSetSize 3 -TargetReplicaSetSize 3 -PartitionSchemeSingleton -PlacementConstraint "HasSSD == true && SomeProperty >= 4"

如果 NodeType01 的所有节点都有效,则也可以使用约束 "(NodeType == NodeType01)" 选择该节点类型。If all nodes of NodeType01 are valid, you can also select that node type with the constraint "(NodeType == NodeType01)".

服务的放置约束可以在运行时动态更新。A service's placement constraints can be updated dynamically during runtime. 如果需要,可以在群集中移动服务、添加和删除要求,等等。If you need to, you can move a service around in the cluster, add and remove requirements, and so on. Service Fabric 确保即使进行了这些类型的更改,服务仍保持运行且可供使用。Service Fabric ensures that the service stays up and available even when these types of changes are made.

StatefulServiceUpdateDescription updateDescription = new StatefulServiceUpdateDescription();
updateDescription.PlacementConstraints = "NodeType == NodeType01";
await fabricClient.ServiceManager.UpdateServiceAsync(new Uri("fabric:/app/service"), updateDescription);
Update-ServiceFabricService -Stateful -ServiceName $serviceName -PlacementConstraints "NodeType == NodeType01"

放置约束是针对每个命名服务实例指定的。Placement constraints are specified for every named service instance. 更新始终会取代(覆盖)以前指定的值。Updates always take the place of (overwrite) what was previously specified.

群集定义用于定义节点上的属性。The cluster definition defines the properties on a node. 更改节点属性需要升级群集配置。Changing a node's properties requires an upgrade to the cluster configuration. 升级节点属性需要重启每个受影响的节点以报告其新属性。Upgrading a node's properties requires each affected node to restart to report its new properties. Service Fabric 会管理这些滚动升级。Service Fabric manages these rolling upgrades.

描述和管理群集资源Describing and managing cluster resources

任何协调器的最重要作业之一是帮助管理群集中的资源消耗。One of the most important jobs of any orchestrator is to help manage resource consumption in the cluster. 而管理群集资源时需要注意的事项有所不同。Managing cluster resources can mean a couple of different things.

首先,确保计算机不会过载。First, there's ensuring that machines are not overloaded. 这是指确保计算机运行的服务数不超过其可处理的服务数。This means making sure that machines aren't running more services than they can handle.

其次,需要权衡和优化与高效运行服务息息相关的因素。Second, there's balancing and optimization, which are critical to running services efficiently. 经济有效型或性能敏感型服务产品不允许某些节点处于热状态,而其他节点处于冷状态。Cost-effective or performance-sensitive service offerings can't allow some nodes to be hot while others are cold. 热节点会导致资源争用和性能不佳。Hot nodes lead to resource contention and poor performance. 冷节点意味着资源浪费和成本增加。Cold nodes represent wasted resources and increased costs.

Service Fabric 使用“指标”表示资源。 Service Fabric represents resources as metrics. 指标是要向 Service Fabric 描述的任何逻辑或物理资源。Metrics are any logical or physical resource that you want to describe to Service Fabric. 指标的示例包括“WorkQueueDepth”或“MemoryInMb”。Examples of metrics are "WorkQueueDepth" or "MemoryInMb." 若要了解 Service Fabric 可在节点上调控的物理资源,请参阅资源调控For information about the physical resources that Service Fabric can govern on nodes, see Resource governance. 有关配置自定义指标及其用法的信息,请参阅此文For information on configuring custom metrics and their uses, see this article.

指标与放置约束和节点属性不同。Metrics are different from placement constraints and node properties. 节点属性是节点自身的静态描述符。Node properties are static descriptors of the nodes themselves. 指标描述节点所含资源,以及当服务在节点上运行时服务所消耗的资源。Metrics describe resources that nodes have and that services consume when they run on a node. 节点属性可能为 HasSSD,可设置为 true 或 false。A node property might be HasSSD and might be set to true or false. 该 SSD 上的可用空间量和服务消耗的空间量是类似于“DriveSpaceInMb”的指标。The amount of space available on that SSD and how much is consumed by services would be a metric like "DriveSpaceInMb."

与放置约束和节点属性相同,Service Fabric 群集资源管理器不理解指标名称的含义。Just like for placement constraints and node properties, Service Fabric Cluster Resource Manager doesn't understand what the names of the metrics mean. 指标名称只是字符串。Metric names are just strings. 建议不明确时,将单位声明为创建的指标名称的一部分。It's a good practice to declare units as a part of the metric names that you create when they might be ambiguous.

容量Capacity

如果关闭所有的资源均衡功能,Service Fabric 群集资源管理器仍会确保最终不会有任何节点超出其容量 。If you turned off all resource balancing, Service Fabric Cluster Resource Manager would still ensure that no node goes over its capacity. 对容量溢出进行管理是可能的,除非群集过于饱和或工作负荷大于任何节点。Managing capacity overruns is possible unless the cluster is too full or the workload is larger than any node. 容量是群集资源管理器用来了解节点包含的资源量的另一个约束 。Capacity is another constraint that Cluster Resource Manager uses to understand how much of a resource a node has. 还会针对整个群集来追踪剩余容量。Remaining capacity is also tracked for the cluster as a whole.

服务级别的容量和消耗量均以指标来表示。Both the capacity and the consumption at the service level are expressed in terms of metrics. 例如,指标可能是“ClientConnections”,节点可能拥有 32,768 个单位的“ClientConnections”容量。For example, the metric might be "ClientConnections" and a node might have a capacity for "ClientConnections" of 32,768. 其他节点可能有其他限制。Other nodes can have other limits. 在该节点上运行的某个服务可以声明其当前正在消耗 32,256 个单位的“ClientConnections”指标。A service running on that node can say it's currently consuming 32,256 of the metric "ClientConnections."

在运行时,群集资源管理器会跟踪群集中和节点上的剩余容量。During runtime, Cluster Resource Manager tracks remaining capacity in the cluster and on nodes. 群集资源管理器通过从运行服务的节点容量中减去每个服务的使用量来跟踪容量。To track capacity, Cluster Resource Manager subtracts each service's usage from a node's capacity where the service runs. 使用此信息,群集资源管理器可找出要放置或移动副本的位置,使节点不会超过容量。With this information, Cluster Resource Manager can figure out where to place or move replicas so that nodes don't go over capacity.

群集节点和容量

StatefulServiceDescription serviceDescription = new StatefulServiceDescription();
ServiceLoadMetricDescription metric = new ServiceLoadMetricDescription();
metric.Name = "ClientConnections";
metric.PrimaryDefaultLoad = 1024;
metric.SecondaryDefaultLoad = 0;
metric.Weight = ServiceLoadMetricWeight.High;
serviceDescription.Metrics.Add(metric);
await fabricClient.ServiceManager.CreateServiceAsync(serviceDescription);
New-ServiceFabricService -ApplicationName $applicationName -ServiceName $serviceName -ServiceTypeName $serviceTypeName -Stateful -MinReplicaSetSize 3 -TargetReplicaSetSize 3 -PartitionSchemeSingleton -Metric @("ClientConnections,High,1024,0)

可以在群集清单中看到定义的容量。You can see capacities defined in the cluster manifest. 下面是 ClusterManifest.xml 的示例:Here's an example for ClusterManifest.xml:

    <NodeType Name="NodeType03">
      <Capacities>
        <Capacity Name="ClientConnections" Value="65536"/>
      </Capacities>
    </NodeType>

下面是通过 ClusterConfig.json 为独立部署定义的,或者通过 Template.json 为 Azure 托管群集定义的容量示例:Here's an example of capacities defined via ClusterConfig.json for standalone deployments or Template.json for Azure-hosted clusters:

"nodeTypes": [
    {
        "name": "NodeType03",
        "capacities": {
            "ClientConnections": "65536",
        }
    }
],

服务的负载往往会动态更改。A service's load often changes dynamically. 假设某个副本的“ClientConnections”负载从 1,024 更改为 2,048。Say that a replica's load of "ClientConnections" changed from 1,024 to 2,048. 当时正在运行该副本的节点只剩余了 512 个单位的该指标容量。The node that it was running on then had a capacity of only 512 remaining for that metric. 现在副本或实例的位置无效,因为该节点上没有足够的空间。Now that replica or instance's placement is invalid, because there's not enough room on that node. 群集资源管理器必须使节点上的容量消耗重新低于阈值。Cluster Resource Manager has to get the node back below capacity. 这可通过将一个或多个副本或实例从该节点转移到其他节点,来减少超出容量的节点上的负载。It reduces load on the node that's over capacity by moving one or more of the replicas or instances from that node to other nodes.

移动副本时,群集资源管理器会尝试将移动成本降至最低。Cluster Resource Manager tries to minimize the cost of moving replicas. 你可以深入了解移动成本重新均衡策略和规则You can learn more about movement cost and about rebalancing strategies and rules.

群集容量Cluster capacity

Service Fabric 群集资源管理器如何防止整个群集过于饱和?How does the Service Fabric Cluster Resource Manager keep the overall cluster from being too full? 对于动态负载,基本上没有有效的解决方法。With dynamic load, there's not a lot it can do. 服务可使自己的负载高峰独立于群集资源管理器所执行的操作。Services can have their load spike independently of actions that Cluster Resource Manager takes. 因此,群集当前或许拥有足够的容量,但将来需要扩大规模时,可能就不够用了。As a result, your cluster with plenty of headroom today might be underpowered if there's a spike tomorrow.

群集资源管理器中的一些控制措施有助于防止问题。Controls in Cluster Resource Manager help prevent problems. 可做的第一件事是防止创建导致群集空间变满的新工作负荷。The first thing you can do is prevent the creation of new workloads that would cause the cluster to become full.

假设要创建一个无状态服务,并且它有某些关联的负载。Let's say that you create a stateless service, and it has some load associated with it. 服务需考虑“DiskSpaceInMb”指标。The service cares about the "DiskSpaceInMb" metric. 服务的每个实例将消耗 5 个单位的“DiskSpaceInMb”。The service will consume five units of "DiskSpaceInMb" for every instance of the service. 需要创建服务的 3 个实例。You want to create three instances of the service. 这意味着,群集中需要有 15 个单位的“DiskSpaceInMb”才能创建这些服务实例。That means you need 15 units of "DiskSpaceInMb" to be present in the cluster for you to even create these service instances.

群集资源管理器会持续计算容量和每个指标的消耗量,因此它可以确定群集中的剩余容量。Cluster Resource Manager continually calculates the capacity and consumption of each metric so it can determine the remaining capacity in the cluster. 如果没有足够的空间,群集资源管理器会拒绝创建服务的调用。If there isn't enough space, Cluster Resource Manager rejects the call to create a service.

因为只要求有 15 个可用单位,所以可以使用多种不同的方式分配此空间。Because the requirement is only that 15 units will be available, you can allocate this space in many different ways. 例如,可能是在 15 个不同节点上各有一个剩余单位的容量,或是在 5 个不同节点上各有三个剩余单位的容量。For example, there might be one remaining unit of capacity on 15 different nodes, or three remaining units of capacity on five different nodes. 如果群集资源管理器能够重新排列服务,在 3 个节点上提供 5 个单位,则可放置服务。If Cluster Resource Manager can rearrange things so there are five units available on three nodes, it places the service. 重新排列群集通常是可行的,除非群集几乎已满,或者出于某种原因无法合并现有服务。Rearranging the cluster is usually possible unless the cluster is almost full or the existing services can't be consolidated for some reason.

缓冲容量Buffered capacity

缓冲容量是群集资源管理器的另一项功能。Buffered capacity is another feature of Cluster Resource Manager. 使用此功能可以预留总节点容量的一部分。It allows reservation of some portion of the overall node capacity. 此容量缓冲区仅用于在升级期间和发生节点故障时放置服务。This capacity buffer is used only to place services during upgrades and node failures.

缓冲容量是全局指定的,即针对所有节点按指标指定的。Buffered capacity is specified globally per metric for all nodes. 为保留容量选择的值取决于群集中的容错域和升级域数目。The value that you pick for the reserved capacity is a function of the number of fault and upgrade domains that you have in the cluster. 较多的容错域和升级域意味着可以选择较少数值的缓冲处理容量。More fault and upgrade domains mean that you can pick a lower number for your buffered capacity. 域越多,则升级和故障过程中无法使用的群集数量越少。If you have more domains, you can expect smaller amounts of your cluster to be unavailable during upgrades and failures. 指定缓冲容量只有在同时指定了指标的节点容量时才有意义。Specifying buffered capacity makes sense only if you have also specified the node capacity for a metric.

以下示例演示如何在 ClusterManifest.xml 中指定缓冲容量:Here's an example of how to specify buffered capacity in ClusterManifest.xml:

        <Section Name="NodeBufferPercentage">
            <Parameter Name="SomeMetric" Value="0.15" />
            <Parameter Name="SomeOtherMetric" Value="0.20" />
        </Section>

以下示例演示如何通过 ClusterConfig.json 为独立部署定义的,或者通过 Template.json 为 Azure 托管群集指定缓冲容量:Here's an example of how to specify buffered capacity via ClusterConfig.json for standalone deployments or Template.json for Azure-hosted clusters:

"fabricSettings": [
  {
    "name": "NodeBufferPercentage",
    "parameters": [
      {
          "name": "SomeMetric",
          "value": "0.15"
      },
      {
          "name": "SomeOtherMetric",
          "value": "0.20"
      }
    ]
  }
]

群集用于某个指标的缓冲容量不足时,创建新服务会失败。The creation of new services fails when the cluster is out of buffered capacity for a metric. 通过防止创建新服务来保留缓冲区,可确保升级和故障不会造成节点超出容量。Preventing the creation of new services to preserve the buffer ensures that upgrades and failures don't cause nodes to go over capacity. 缓冲容量是可选项,但我们建议为定义了指标容量的所有群集启用。Buffered capacity is optional, but we recommend it in any cluster that defines a capacity for a metric.

群集资源管理器会公开此负载信息。Cluster Resource Manager exposes this load information. 对于每个指标,此信息包括:For each metric, this information includes:

  • 缓冲容量设置。The buffered capacity settings.
  • 总容量。The total capacity.
  • 当前消耗量。The current consumption.
  • 每项指标是否视为均衡。Whether each metric is considered balanced or not.
  • 有关标准偏差的统计信息。Statistics about the standard deviation.
  • 负载最大和最小的节点。The nodes that have the most and least load.

以下代码演示了该输出的示例:The following code shows an example of that output:

PS C:\Users\user> Get-ServiceFabricClusterLoadInformation
LastBalancingStartTimeUtc : 9/1/2016 12:54:59 AM
LastBalancingEndTimeUtc   : 9/1/2016 12:54:59 AM
LoadMetricInformation     :
                            LoadMetricName        : Metric1
                            IsBalancedBefore      : False
                            IsBalancedAfter       : False
                            DeviationBefore       : 0.192450089729875
                            DeviationAfter        : 0.192450089729875
                            BalancingThreshold    : 1
                            Action                : NoActionNeeded
                            ActivityThreshold     : 0
                            ClusterCapacity       : 189
                            ClusterLoad           : 45
                            ClusterRemainingCapacity : 144
                            NodeBufferPercentage  : 10
                            ClusterBufferedCapacity : 170
                            ClusterRemainingBufferedCapacity : 125
                            ClusterCapacityViolation : False
                            MinNodeLoadValue      : 0
                            MinNodeLoadNodeId     : 3ea71e8e01f4b0999b121abcbf27d74d
                            MaxNodeLoadValue      : 15
                            MaxNodeLoadNodeId     : 2cc648b6770be1bc9824fa995d5b68b1

后续步骤Next steps