描述 Service Fabric 群集Describing a service fabric cluster

Service Fabric 群集 Resource Manager 提供多种用于描述群集的的机制。The Service Fabric Cluster Resource Manager provides several mechanisms for describing a cluster. 在运行时,群集 Resource Manager 使用此信息来确保群集中运行的服务的高可用性。During runtime, the Cluster Resource Manager uses this information to ensure high availability of the services running in the cluster. 实施这些重要规则时,群集资源管理器还会尝试优化群集中的资源消耗。While enforcing these important rules, it also attempts to optimize the resource consumption within the cluster.

关键概念Key concepts

群集 Resource Manager 支持多种用于描述群集的功能:The Cluster Resource Manager supports several features that describe a cluster:

  • 容错域Fault Domains
  • 升级域Upgrade Domains
  • 节点属性Node Properties
  • 节点容量Node Capacities

容错域Fault domains

容错域是协调故障的任何区域。A Fault Domain is any area of coordinated failure. 单一计算机就是一个容错域(因为它本身可能出于各种原因而发生故障,包括电源故障、驱动器故障、NIC 固件错误,等等)。A single machine is a Fault Domain (since it can fail on its own for various reasons, from power supply failures to drive failures to bad NIC firmware). 连接到同一以太网交换机的计算机位于同一个容错域中,共享一个电源或一个位置的计算机也是如此。Machines connected to the same Ethernet switch are in the same Fault Domain, as are machines sharing a single source of power or in a single location. 硬件故障在性质上是重叠的,因此容错域原本就有层次性,在 Service Fabric 中以 URI 的形式表示。Since it's natural for hardware faults to overlap, Fault Domains are inherently hierarchal and are represented as URIs in Service Fabric.

对容错域进行正确设置很重要,因为 Service Fabric 根据此信息来安全放置服务。It is important that Fault Domains are set up correctly since Service Fabric uses this information to safely place services. Service Fabric 不希望在放置服务后,容错域的丢失(由某个组件的故障而导致)会造成服务关闭。Service Fabric doesn't want to place services such that the loss of a Fault Domain (caused by the failure of some component) causes a service to go down. 在 Azure 环境中,Service Fabric 利用环境提供的容错域信息,代表用户正确配置群集中的节点。In the Azure environment Service Fabric uses the Fault Domain information provided by the environment to correctly configure the nodes in the cluster on your behalf. 对于独立的 Service Fabric,将在设置群集时定义容错域For Service Fabric Standalone, Fault Domains are defined at the time that the cluster is set up

Warning

请务必确保提供给 Service Fabric 的容错域信息准确无误。It is important that the Fault Domain information provided to Service Fabric is accurate. 例如,假设 Service Fabric 群集的节点在 5 个物理主机上运行的 10 台虚拟机中运行。For example, let's say that your Service Fabric cluster's nodes are running inside 10 virtual machines, running on five physical hosts. 在此情况下,即使有 10 个虚拟机,也只有 5 个不同的(顶层)容错域。In this case, even though there are 10 virtual machines, there are only 5 different (top level) fault domains. 共享同一物理主机会导致 VM 共享同一个根容错域,因此如果物理主机发生故障,共享主机的 VM 也会同时相应发生故障。Sharing the same physical host causes VMs to share the same root fault domain, since the VMs experience coordinated failure if their physical host fails.

Service Fabric 预期节点的容错域不变。Service Fabric expects the Fault Domain of a node not to change. 确保 VM 高可用性的其他机制(例如 HA-VM)可能会导致与 Service Fabric 冲突,因为它们使用主机间的 VM 透明迁移。Other mechanisms of ensuring high availability of the VMs such as HA-VMs may cause conflicts with Service Fabric, as they use transparent migration of VMs from one host to another. 这些机制不会重新配置或通知 VM 内运行的代码。These mechanisms do not reconfigure or notify the running code inside the VM. 因此,不支持这些机制作为运行 Service Fabric 群集的环境。As such, they are not supported as environments for running Service Fabric clusters. 应使用 Service Fabric 作为唯一的高可用性技术。Service Fabric should be the only high-availability technology employed. 实时 VM 迁移、SAN 等其他机制不是必要的机制。Mechanisms like live VM migration, SANs, or others are not necessary. 如果将这些机制与 Service Fabric 一同使用,会降低应用程序的可用性和可靠性,因为这些机制会增加复杂性、加大导致并发故障的风险,并且这些机制所使用的可靠性和可用性策略可能会与 Service Fabric 中的策略相冲突。If used in conjunction with Service Fabric, these mechanisms reduce application availability and reliability because they introduce additional complexity, add centralized sources of failure, and utilize reliability and availability strategies that conflict with those in Service Fabric.

下图中,我们将所有导致容错域的实体着色,并列出生成的所有不同容错域。In the graphic below we color all the entities that contribute to Fault Domains and list all the different Fault Domains that result. 本示例列出了数据中心 (DC)、机架 (R) 和刀片服务器 (B)。In this example, we have datacenters ("DC"), racks ("R"), and blades ("B"). 可以想象,如果每个刀片服务器包含多个虚拟机,则容错域层次结构中可能有另一个层。Conceivably, if each blade holds more than one virtual machine, there could be another layer in the Fault Domain hierarchy.

通过容错域组织的节点

![Nodes organized via Fault Domains][Image1]

在运行时,Service Fabric 群集资源管理器会考虑群集中的容错域,并规划布局。During runtime, the Service Fabric Cluster Resource Manager considers the Fault Domains in the cluster and plans layouts. 它会分发给定服务的有状态副本和无状态副本,使这些副本位于不同的容错域中。The stateful replicas or stateless instances for a given service are distributed so they are in separate Fault Domains. 容错域在层次结构的任一级别发生故障时,通过跨容错域分发服务可确保服务可用性不受影响。Distributing the service across fault domains ensures the availability of the service is not compromised when a Fault Domain fails at any level of the hierarchy.

Service Fabric 群集资源管理器不会考虑容错域层次结构中有多少层。Service Fabric's Cluster Resource Manager doesn't care how many layers there are in the Fault Domain hierarchy. 不过,它会尝试确保丢失层次结构的任何一部分不至于影响在其中运行的服务。However, it tries to ensure that the loss of any one portion of the hierarchy doesn't impact services running in it.

最好是在容错域层次结构中的每个深度级别部署相同数目的节点。It is best if there are the same number of nodes at each level of depth in the Fault Domain hierarchy. 如果群集中的容错域“树”不均衡,群集资源管理器会更难找到最佳的服务分配。If the "tree" of Fault Domains is unbalanced in your cluster, it makes it harder for the Cluster Resource Manager to figure out the best allocation of services. 容错域布局失衡意味着,丢失某些域对服务可用性造成的影响比丢失其他域更严重。Imbalanced Fault Domains layouts mean that the loss of some domains impact the availability of services more than other domains. 因此,群集资源管理器难以在两个目标之间作出取舍:是要通过将服务放置在该“繁重”域中来使用计算机,还是要将服务放置在其他域中来避免域丢失造成问题。As a result, the Cluster Resource Manager is torn between two goals: It wants to use the machines in that "heavy" domain by placing services on them, and it wants to place services in other domains so that the loss of a domain doesn't cause problems.

失衡的域是什么样子?What do imbalanced domains look like? 下图显示了两种不同的群集布局。In the diagram below, we show two different cluster layouts. 在第一个示例中,节点均匀分散到容错域。In the first example, the nodes are distributed evenly across the Fault Domains. 在第二个示例中,某一容错域包含的节点比其他容错域多出许多。In the second example, one Fault Domain has many more nodes than the other Fault Domains.

两种不同的群集布局

![Two different cluster layouts][Image2]

在 Azure 中,系统会为用户选择哪个容错域包含节点。In Azure, the choice of which Fault Domain contains a node is managed for you. 但是,根据设置的节点数目,仍然可能出现某容错域比其他容错域承载更多节点的情况。However, depending on the number of nodes that you provision you can still end up with Fault Domains with more nodes in them than others. 例如,假设群集中有 5 个容错域,但针对给定的 NodeType 预配了 7 个节点。For example, say you have five Fault Domains in the cluster but provision seven nodes for a given NodeType. 在这种情况下,前两个容错域会承载更多的节点。In this case, the first two Fault Domains end up with more nodes. 如果继续使用几个实例部署更多的节点类型,问题会变得更严重。If you continue to deploy more NodeTypes with only a couple instances, the problem gets worse. 因此,建议将每个节点类型中的节点数配置为容错域数的倍数。For this reason it's recommended that the number of nodes in each node type is a multiple of the number of Fault Domains.

升级域Upgrade domains

升级域是另一项功能,可帮助 Service Fabric 群集资源管理器了解群集的布局。Upgrade Domains are another feature that helps the Service Fabric Cluster Resource Manager understand the layout of the cluster. 升级域定义同时升级的节点集。Upgrade Domains define sets of nodes that are upgraded at the same time. 升级域可帮助群集资源管理器了解和安排诸如升级等管理操作。Upgrade Domains help the Cluster Resource Manager understand and orchestrate management operations like upgrades.

升级域非常类似于容错域,但有几个重要的差异。Upgrade Domains are a lot like Fault Domains, but with a couple key differences. 首先,容错域由协调性硬件故障区域定义。First, areas of coordinated hardware failures define Fault Domains. 而升级域由策略定义。Upgrade Domains, on the other hand, are defined by policy. 可以确定所需的数量,而不是由环境描述所需的数量。You get to decide how many you want, rather than it being dictated by the environment. 升级域数量可以与节点数量相同。You could have as many Upgrade Domains as you do nodes. 容错域和升级域之间的另一个差异在于,升级域不是分层的。Another difference between Fault Domains and Upgrade Domains is that Upgrade Domains are not hierarchical. 相反,升级域更像是一个简单标记。Instead, they are more like a simple tag.

下图显示了跨三个容错域条带化的三个升级域。The following diagram shows three Upgrade Domains are striped across three Fault Domains. 该图还显示了有状态服务的三个不同副本的一种可能的放置方式,即三个副本位于不同的容错域和升级域中。It also shows one possible placement for three different replicas of a stateful service, where each ends up in different Fault and Upgrade Domains. 这种放置方式容许在服务升级过程中丢失一个容错域,在这种情况下仍可运行一个代码和数据副本。This placement allows the loss of a Fault Domain while in the middle of a service upgrade and still have one copy of the code and data.

包含容错域和升级域的布局

![Placement With Fault and Upgrade Domains][Image3]

使用大量升级域既有利也有弊。There are pros and cons to having large numbers of Upgrade Domains. 更多升级域意味着升级的每个步骤更细微,因此会给较少的节点或服务造成影响。More Upgrade Domains means each step of the upgrade is more granular and therefore affects a smaller number of nodes or services. 这样一来,每次只需移动较少的服务,进而降低系统中的流动。As a result, fewer services have to move at a time, introducing less churn into the system. 这往往会提高可靠性,因为在升级过程中发生的任何问题所影响到的服务更少。This tends to improve reliability, since less of the service is impacted by any issue introduced during the upgrade. 更多升级域也意味着处理升级的影响时,在其他节点上占用的缓冲区更少。More Upgrade Domains also means that you need less available buffer on other nodes to handle the impact of the upgrade. 例如,如果有五个升级域,每个域中的节点处理大约 20% 的流量。For example, if you have five Upgrade Domains, the nodes in each are handling roughly 20% of your traffic. 如果需要关闭升级域进行升级,则通常需要将负载转移到某个位置。If you need to take down that Upgrade Domain for an upgrade, that load usually needs to go somewhere. 由于剩余有 4 个升级域,因此每个升级域必须具有可供 5% 的总流量使用的空间。Since you have four remaining Upgrade Domains, each must have room for about 5% of the total traffic. 更多升级域意味着群集中节点上所需占用的缓冲区更少。More Upgrade Domains means you need less buffer on the nodes in the cluster. 例如,考虑一下有 10 个升级域的情况。For example, consider if you had 10 Upgrade Domains instead. 在此情况下,每个 UD 仅处理约 10% 的总流量。In that case, each UD would only be handling about 10% of the total traffic. 当开始升级群集时,每个域只需具有约 1.1% 的总流量空间即可。When an upgrade steps through the cluster, each domain would only need to have room for about 1.1% of the total traffic. 使用更多升级域通常能够以更高的利用率运行节点,因为需要保留的容量更少。More Upgrade Domains generally allow you to run your nodes at higher utilization, since you need less reserved capacity. 这同样适用于容错域。The same is true for Fault Domains.

拥有多个升级域的缺点是升级需要花费的时间更长。The downside of having many Upgrade Domains is that upgrades tend to take longer. 当一个升级域完成升级后,Service Fabric 会等待一小段时间,并在开始升级下一个域之前执行检查。Service Fabric waits a short period of time after an Upgrade Domain is completed and performs checks before starting to upgrade the next one. 通过这一延迟时间,可在继续升级之前检测升级带来的问题。These delays enable detecting issues introduced by the upgrade before the upgrade proceeds. 这种折衷是可接受的,因为可以防止错误的更改一次性影响过多的服务。The tradeoff is acceptable because it prevents bad changes from affecting too much of the service at a time.

升级域太少会产生许多负面影响 - 当每个升级域关闭并进行升级时,大部分的整体容量均不可用。Too few Upgrade Domains has many negative side effects - while each individual Upgrade Domain is down and being upgraded a large portion of your overall capacity is unavailable. 例如,如果只有三个升级域,则一次只能采用整体服务或群集容量的 1/3。For example, if you only have three Upgrade Domains you are taking down about 1/3 of your overall service or cluster capacity at a time. 让如此多的服务同时停止并非理想的结果,因为必须在其余的群集中拥有足够的容量才能处理工作负荷。Having so much of your service down at once isn't desirable since you have to have enough capacity in the rest of your cluster to handle the workload. 维护该缓冲区意味着在正常工作期间,这些节点比其他情况下的负载更少。Maintaining that buffer means that during normal operation those nodes are less loaded than they would be otherwise. 这会增加运行服务的成本。This increases the cost of running your service.

环境中容错域或升级域的总数没有实际限制,对于它们如何重叠也没有约束。There's no real limit to the total number of fault or Upgrade Domains in an environment, or constraints on how they overlap. 尽管如此,却有几种常见模式:That said, there are several common patterns:

  • 容错域和升级域按 1:1 映射Fault Domains and Upgrade Domains mapped 1:1
  • 每个节点一个升级域(物理或虚拟 OS 实例)One Upgrade Domain per Node (physical or virtual OS instance)
  • “条带化”或“矩阵”模型,其中的容错域和升级域构成了通常沿对角线分布的计算机矩阵A "striped" or "matrix" model where the Fault Domains and Upgrade Domains form a matrix with machines usually running down the diagonals

容错域和升级域布局

![Fault and Upgrade Domain Layouts][Image4]

至于要选择哪种布局并没有最佳答案,每种做法各有利弊。There's no best answer which layout to choose, each has some pros and cons. 例如,1FD:1UD 模型很容易设置。For example, the 1FD:1UD model is simple to set up. 最常见的模型是每个节点 1 个升级域。The 1 Upgrade Domain per Node model is most like what people are used to. 升级过程中会独立更新每个节点。During upgrades each node is updated independently. 这类似于过去手动升级少量计算机的方式。This is similar to how small sets of machines were upgraded manually in the past.

最常见的模型是 FD/UD 矩阵,其中 FD 和 UD 构成一个表,节点沿着对角线开始放置。The most common model is the FD/UD matrix, where the FDs and UDs form a table and nodes are placed starting along the diagonal. 这是 Azure 中 Service Fabric 群集默认使用的模型。This is the model used by default in Service Fabric clusters in Azure. 对于具有多个节点的群集,最终都会形成与上述密集矩阵模式类似的模式。For clusters with many nodes everything ends up looking like the dense matrix pattern above.

容错域与升级域约束及最终行为Fault and Upgrade Domain constraints and resulting behavior

默认方法Default approach

默认情况下,群集资源管理器维持服务在故障域和升级域之间的平衡。By default, the Cluster Resource Manager keeps services balanced across Fault and Upgrade Domains. 这将建模为约束This is modeled as a constraint. 容错域和升级域约束如下所述:“对于给定的服务分区,同一层次结构级别上的任意两个域之间的服务对象(无状态服务实例或有状态服务副本)数量之差应永远不能大于 1”。The Fault and Upgrade Domain constraint states: "For a given service partition, there should never be a difference greater than one in the number of service objects (stateless service instances or stateful service replicas) between any two domains on the same level of hierarchy". 让我们假设此约束提供“最大差值”保证。Let's say this constraint provides a "maximum difference" guarantee. 故障域和容错域约束会阻止违反上述规则的某些移动或排列。The Fault and Upgrade Domain constraint prevents certain moves or arrangements that violate the rule stated above.

让我们看一个示例。Let's look at one example. 假设有一个 6 节点群集,其中配置了 5 个容错域和 5 个升级域。Let's say that we have a cluster with six nodes, configured with five Fault Domains and five Upgrade Domains.

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4
UD0UD0 N1N1
UD1UD1 N6N6 N2N2
UD2UD2 N3N3
UD3UD3 N4N4
UD4UD4 N5N5

配置 1Configuration 1

现在,假设要创建一个 TargetReplicaSetSize(如果是无状态服务,则为 InstanceCount) 为 5 的服务。Now let's say that we create a service with a TargetReplicaSetSize (or, for a stateless service an InstanceCount) of five. 副本驻留在 N1-N5 上。The replicas land on N1-N5. 事实上,无论创建多少个此类服务,都不会用到 N6。In fact, N6 is never used no matter how many services like this you create. 为什么?But why? 看看当前布局和选择 N6 时所发生情况之间的差异。Let's look at the difference between the current layout and what would happen if N6 is chosen.

下面是获得的布局,以及每个容错域和升级域的副本总数:Here's the layout we got and the total number of replicas per Fault and Upgrade Domain:

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4 UDTotalUDTotal
UD0UD0 R1R1 11
UD1UD1 R2R2 11
UD2UD2 R3R3 11
UD3UD3 R4R4 11
UD4UD4 R5R5 11
FDTotalFDTotal 11 11 11 11 11 -

布局 1Layout 1

此布局在每个容错域与升级域的节点数目上是均衡的。This layout is balanced in terms of nodes per Fault Domain and Upgrade Domain. 并且在每个容错域和升级域的副本数目上也是均衡的。It is also balanced in terms of the number of replicas per Fault and Upgrade Domain. 每个域都拥有相同数量的节点,以及相同数量的副本。Each domain has the same number of nodes and the same number of replicas.

现在让我们看看,如果不使用 N2,而是改用 N6 会发生什么情况。Now, let's look at what would happen if we'd used N6 instead of N2. 副本会如何分布?How would the replicas be distributed then?

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4 UDTotalUDTotal
UD0UD0 R1R1 11
UD1UD1 R5R5 11
UD2UD2 R2R2 11
UD3UD3 R3R3 11
UD4UD4 R4R4 11
FDTotalFDTotal 22 00 11 11 11 -

布局 2Layout 2

此布局违反了容错域约束的“最大差值”保证的定义。This layout violates our definition of the "maximum difference" guarantee for the Fault Domain constraint. FD0 有 2 个副本,而 FD1 有 0 个,FD0 与 FD1 之间的总差为 2,这个总差已大于最大差值 1。FD0 has two replicas, while FD1 has zero, making the difference between FD0 and FD1 a total of two, which is greater than the maximum difference of one. 由于违反了约束,群集资源管理器不允许这种排列方式。Since the constraint is violated, the Cluster Resource Manager does not allow this arrangement. 同样,如果选择 N2 和 N6(而不是 N1 和 N2),则会得到:Similarly if we picked N2 and N6 (instead of N1 and N2) we'd get:

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4 UDTotalUDTotal
UD0UD0 00
UD1UD1 R5R5 R1R1 22
UD2UD2 R2R2 11
UD3UD3 R3R3 11
UD4UD4 R4R4 11
FDTotalFDTotal 11 11 11 11 11 -

布局 3Layout 3

就容错域而言,此布局是均衡。This layout is balanced in terms of Fault Domains. 但是,现在违反了升级域约束,因为 UD0 有 0 个副本,而 UD1 有 2 个。However, now it's violating the Upgrade Domain constraint because UD0 has zero replicas while UD1 has two. 因此,此布局也是无效的,群集资源管理器不会选取此布局。Therefore, this layout is also invalid and won't be picked by the Cluster Resource Manager.

这种有状态副本或无状态实例的分布方法提供了最佳容错能力。This approach to the distribution of stateful replicas or stateless instances provides the best possible fault tolerance. 在一个域关闭的情况下,丢失的副本/实例数最少。In a situation when one domain goes down, the minimal number of replicas/instances is lost.

另一方面,此方法过于严格,不允许群集使用所有资源。On the other hand, this approach can be too strict and not allow the cluster to utilize all resources. 对于某些群集配置,某些节点不可用。For certain cluster configurations, certain nodes can't be used. 这可能导致 Service Fabric 无法放置服务,从而出现警告消息。This can lead to Service Fabric not placing your services, resulting in warning messages. 在之前的示例中,某些群集节点不可用(给定示例中的 N6)。In the previous example some of the cluster nodes can't be used (N6 in the given example). 即使将节点添加到该群集 (N7 - N10),由于容错域和升级域约束,副本/实例也只能放置在 N1 - N5 上。Even if you would add nodes to that cluster (N7 - N10), replicas/instances would only be placed on N1 - N5 due to Fault and Upgrade Domain constraints.

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4
UD0UD0 N1N1 N10N10
UD1UD1 N6N6 N2N2
UD2UD2 N7N7 N3N3
UD3UD3 N8N8 N4N4
UD4UD4 N9N9 N5N5

配置 2Configuration 2

替换方法Alternative approach

群集资源管理器支持另一版本的故障域和升级域约束,该版本在允许放置服务的同时仍然保证最低级别的安全。The Cluster Resource Manager supports another version of the Fault and Upgrade Domain constraint which allows placement while still guaranteeing a minimum level of safety. 备用容错和升级域约束如下所述:“对于给定的服务分区,跨域的副本分布应保证分区不会遭受仲裁丢失”。The alternative Fault and Upgrade Domain constraint can be stated as follows: "For a given service partition, replica distribution across domains should ensure that the partition does not suffer a quorum loss". 让我们假设此约束提供“仲裁安全”保证。Let's say this constraint provides a "quorum safe" guarantee.

Note

对于有状态服务,如果大部分分区副本同时关闭,则表示“仲裁丢失”。For a stateful service, we define quorum loss in a situation when a majority of the partition replicas are down at the same time. 例如,如果 TargetReplicaSetSize 为 5,则任意三个副本一组表示仲裁。For example, if TargetReplicaSetSize is five, a set of any three replicas represents quorum. 同样,如果 TargetReplicaSetSize 为 6,则仲裁必须要 4 个副本。Similarly, if TargetReplicaSetSize is 6, four replicas are necessary for quorum. 如果分区想要继续正常运行,则在这两种情况下,不能同时关闭超过两个副本。In both cases no more than two replicas can be down at the same time if the partition wants to continue functioning normally. 对于无状态服务,即使是大部分实例同时关闭,也没有类似“仲裁丢失”的情况,因为无状态服务可以继续正常运行。For a stateless service, there is no such thing as quorum loss as stateless services continue to function normally even if a majority of instances go down at the same time. 因此,文章的其余部分将着重介绍有状态服务。Hence, we will focus on stateful services in the rest of the text.

让我们返回到上一个示例。Let's go back to the previous example. 因为具有“仲裁安全”版本的约束,给定的所有三个布局都是有效的。With the "quorum safe" version of the constraint, all three given layouts would be valid. 这是因为,即使第二个布局中的 FD0 或第三个布局中的 UD1 故障,该分区仍有仲裁(分区中的大部分副本仍可运行)。This is because even if there would be failure of FD0 in the second layout or UD1 in the third layout, the partition would still have quorum (a majority of its replicas would still be up). 由于该版本的约束,几乎始终使用 N6。With this version of the constraint N6 could almost always be utilized.

“仲裁安全”方法比“最大差值”方法更具灵活性,因为更容易在几乎所有群集拓扑中找到有效的副本分布。The "quorum safe" approach provides more flexibility than the "maximum difference" approach as it is easier to find replica distributions that are valid in almost any cluster topology. 但是,此方法不能保证最佳容错特性,因为有些故障比其他故障更为严重。However, this approach can't guarantee the best fault tolerance characteristics because some failures are worse than others. 在最坏的情况下,大部分副本会因为某个域和某个额外的副本故障而丢失。In the worst case scenario, a majority of the replicas could be lost with the failure of one domain and one additional replica. 例如,相比 5 个副本或实例要求其中 3 个故障才丢失仲裁,现在仅 2 个故障就可能丢失大部分副本。For example, instead of 3 failures being required to lose quorum with 5 replicas or instances, you could now lose a majority with just two failures.

自适应方法Adaptive approach

由于这两种方法都有优缺点,现在我们将介绍结合了两种策略的自适应方法。Because both of the approaches have strengths and weaknesses, we've introduced an adaptive approach that combines these two strategies.

Note

从 Service Fabric 版本 6.2 起,这将成为默认行为。This will be the default behavior starting with Service Fabric Version 6.2.

自适应方法默认使用“最大差值”逻辑,并且仅在必要时切换为“仲裁安全”逻辑。The adaptive approach uses the "maximum difference" logic by default and switches to the "quorum safe" logic only when necessary. 群集资源管理器通过查看群集和服务的配置方式,自动辨别必要的策略。The Cluster Resource Manager automatically figures out which strategy is necessary by looking at how the cluster and services are configured. 对于给定的服务:如果 TargetReplicaSetSize 能被容错域数和升级域数整除,并且节点数少于或等于容错域数乘以升级域数的积,则群集资源管理器应针对该服务使用“基于仲裁”逻辑****。For a given service: If the TargetReplicaSetSize is evenly divisible by the number of Fault Domains and the number of Upgrade Domains and the number of nodes is less than or equal to the (number of Fault Domains) * (the number of Upgrade Domains), the Cluster Resource Manager should utilize the "quorum based" logic for that service. 请注意,群集资源管理器对无状态和有状态服务都将使用此方法,尽管仲裁丢失与无状态服务无关。Bear in mind that the Cluster Resource Manager will use this approach for both stateless and stateful services, despite quorum loss not being relevant for stateless services.

让我们返回到上一个示例,假设群集现在有 8 个节点(群集仍然配置了 5 个容错域和 5 个升级域,在该群集上托管的服务的 TargetReplicaSetSize 仍为 5)。Let's go back to the previous example and assume that a cluster now has 8 nodes (the cluster is still configured with five Fault Domains and five Upgrade Domains and the TargetReplicaSetSize of a service hosted on that cluster remains five).

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4
UD0UD0 N1N1
UD1UD1 N6N6 N2N2
UD2UD2 N7N7 N3N3
UD3UD3 N8N8 N4N4
UD4UD4 N5N5

配置 3Configuration 3

由于满足所有必要条件,群集资源管理器将在分布服务时使用“基于仲裁”逻辑。Because all necessary conditions are satisfied, Cluster Resource Manager will utilize the "quorum based" logic in distributing the service. 这将使用 N6 - N8。This enables usage of N6 - N8. 本例中的一个可能的服务分布如下所示:One possible service distribution in this case could look like:

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4 UDTotalUDTotal
UD0UD0 R1R1 11
UD1UD1 R2R2 11
UD2UD2 R3R3 R4R4 22
UD3UD3 00
UD4UD4 R5R5 11
FDTotalFDTotal 22 11 11 00 11 -

布局 4Layout 4

如果服务的 TargetReplicaSetSize 减到 4(例如),群集资源管理器会注意到该变化并恢复使用“最大差值”逻辑,因为 TargetReplicaSetSize 不再能被 FD 数和 UD 数整除。If your service's TargetReplicaSetSize is reduced to four (for example), Cluster Resource Manager will notice that change and resume using the "maximum difference" logic because TargetReplicaSetSize isn't dividable by the number of FDs and UDs anymore. 因此,为了将余下的 4 个副本分布在节点 N1-N5 上,将移动部分副本,这样才不会违反“最大差值”版本的容错域和升级域逻辑。As a result, certain replica movements will occur in order to distribute remaining four replicas on nodes N1-N5 so that the "maximum difference" version of the Fault Domain and Upgrade domain logic is not violated.

回顾第四个布局,其 TargetReplicaSetSize 为 5。Looking back to the fourth layout and the TargetReplicaSetSize of five. 如果 N1 从群集中删除,则升级域数变为 4。If N1 is removed from the cluster, the number of Upgrade Domains becomes equal to four. 同样,由于服务的 TargetReplicaSetSize 不再能被 UD 数整除,群集资源管理器开始使用“最大差值”逻辑。Again, the Cluster Resource Manager starts using "maximum difference" logic as the number of UDs doesn't evenly divide the service's TargetReplicaSetSize anymore. 因此,再次构建副本 R1 时,它必须位于 N4 上,这样才不会违反故障域和升级域约束。As a result, replica R1, when built again, has to land on N4 so that Fault and Upgrade Domain Constraint is not violated.

FD0FD0 FD1FD1 FD2FD2 FD3FD3 FD4FD4 UDTotalUDTotal
UD0UD0 不适用N/A 不适用N/A 不适用N/A 不适用N/A 不适用N/A 不适用N/A
UD1UD1 R2R2 11
UD2UD2 R3R3 R4R4 22
UD3UD3 R1R1 11
UD4UD4 R5R5 11
FDTotalFDTotal 11 11 11 11 11 -

布局 5Layout 5

配置容错域和升级域Configuring fault and Upgrade Domains

对容错域和升级域的定义在 Azure 托管的 Service Fabric 部署中自动完成。Defining Fault Domains and Upgrade Domains is done automatically in Azure hosted Service Fabric deployments. Service Fabric 从 Azure 中选择并使用环境信息。Service Fabric picks up and uses the environment information from Azure.

如果要创建自己的群集(或者要在开发环境中运行特定的拓扑),则可自行提供容错域和升级域信息。If you're creating your own cluster (or want to run a particular topology in development), you can provide the Fault Domain and Upgrade Domain information yourself. 在本示例中,我们定义了一个 9 节点本地开发群集,该群集跨 3 个“数据中心”(每个数据中心有 3 个机架)。In this example, we define a nine node local development cluster that spans three "datacenters" (each with three racks). 该群集还有跨这三个数据中心条带化的三个升级域。This cluster also has three Upgrade Domains striped across those three datacenters. 下面是一个配置示例:An example of the configuration is below:

ClusterManifest.xmlClusterManifest.xml

  <Infrastructure>
    <!-- IsScaleMin indicates that this cluster runs on one-box /one single server -->
    <WindowsServer IsScaleMin="true">
      <NodeList>
        <Node NodeName="Node01" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType01" FaultDomain="fd:/DC01/Rack01" UpgradeDomain="UpgradeDomain1" IsSeedNode="true" />
        <Node NodeName="Node02" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType02" FaultDomain="fd:/DC01/Rack02" UpgradeDomain="UpgradeDomain2" IsSeedNode="true" />
        <Node NodeName="Node03" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType03" FaultDomain="fd:/DC01/Rack03" UpgradeDomain="UpgradeDomain3" IsSeedNode="true" />
        <Node NodeName="Node04" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType04" FaultDomain="fd:/DC02/Rack01" UpgradeDomain="UpgradeDomain1" IsSeedNode="true" />
        <Node NodeName="Node05" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType05" FaultDomain="fd:/DC02/Rack02" UpgradeDomain="UpgradeDomain2" IsSeedNode="true" />
        <Node NodeName="Node06" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType06" FaultDomain="fd:/DC02/Rack03" UpgradeDomain="UpgradeDomain3" IsSeedNode="true" />
        <Node NodeName="Node07" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType07" FaultDomain="fd:/DC03/Rack01" UpgradeDomain="UpgradeDomain1" IsSeedNode="true" />
        <Node NodeName="Node08" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType08" FaultDomain="fd:/DC03/Rack02" UpgradeDomain="UpgradeDomain2" IsSeedNode="true" />
        <Node NodeName="Node09" IPAddressOrFQDN="localhost" NodeTypeRef="NodeType09" FaultDomain="fd:/DC03/Rack03" UpgradeDomain="UpgradeDomain3" IsSeedNode="true" />
      </NodeList>
    </WindowsServer>
  </Infrastructure>

通过 ClusterConfig.json 实现独立部署via ClusterConfig.json for Standalone deployments

"nodes": [
  {
    "nodeName": "vm1",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc1/r0",
    "upgradeDomain": "UD1"
  },
  {
    "nodeName": "vm2",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc1/r0",
    "upgradeDomain": "UD2"
  },
  {
    "nodeName": "vm3",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc1/r0",
    "upgradeDomain": "UD3"
  },
  {
    "nodeName": "vm4",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc2/r0",
    "upgradeDomain": "UD1"
  },
  {
    "nodeName": "vm5",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc2/r0",
    "upgradeDomain": "UD2"
  },
  {
    "nodeName": "vm6",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc2/r0",
    "upgradeDomain": "UD3"
  },
  {
    "nodeName": "vm7",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc3/r0",
    "upgradeDomain": "UD1"
  },
  {
    "nodeName": "vm8",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc3/r0",
    "upgradeDomain": "UD2"
  },
  {
    "nodeName": "vm9",
    "iPAddress": "localhost",
    "nodeTypeRef": "NodeType0",
    "faultDomain": "fd:/dc3/r0",
    "upgradeDomain": "UD3"
  }
],

Note

通过 Azure 资源管理器定义群集时,由 Azure 分配容错域和升级域。When defining clusters via Azure Resource Manager, Fault Domains and Upgrade Domains are assigned by Azure. 因此,Azure 资源管理器模板中节点类型和虚拟机规模集的定义不包含容错域或升级域信息。Therefore, the definition of your Node Types and Virtual Machine Scale Sets in your Azure Resource Manager template does not include Fault Domain or Upgrade Domain information.

节点属性和放置约束Node properties and placement constraints

有时(事实上是大多数情况下),需要确保只在群集中特定类型的节点上运行某些工作负荷。Sometimes (in fact, most of the time) you're going to want to ensure that certain workloads run only on certain types of nodes in the cluster. 例如,某些工作负荷可能需要 GPU 或 SSD,而有些则不用。For example, some workload may require GPUs or SSDs while others may not. 将特定硬件用于特定工作负载的一个很好的示例是 n 层体系结构(几乎每个这样的体系结构都可看作是一个好例子)。A great example of targeting hardware to particular workloads is almost every n-tier architecture out there. 某些计算机充当应用程序的前端或 API 服务端,并向客户端或在 Internet 上公开。Certain machines serve as the front end or API serving side of the application and are exposed to the clients or the internet. 其他一些计算机(通常具有不同的硬件资源)处理计算或存储层的工作。Different machines, often with different hardware resources, handle the work of the compute or storage layers. 通常不会直接向客户端或 Internet 公开这些计算机。These are usually not directly exposed to clients or the internet. Service Fabric 预期存在特定工作负荷需要在特定硬件配置上运行的情况。Service Fabric expects that there are cases where particular workloads need to run on particular hardware configurations. 例如:For example:

  • 现有的 n 层应用程序已“提升并迁移”到 Service Fabric 环境an existing n-tier application has been "lifted and shifted" into a Service Fabric environment
  • 出于性能、规模或安全性隔离原因,某个工作负荷需要在特定硬件上运行a workload wants to run on specific hardware for performance, scale, or security isolation reasons
  • 出于策略或资源消耗原因,某个工作负荷应该与其他工作负荷隔离A workload should be isolated from other workloads for policy or resource consumption reasons

为了支持这种配置,Service Fabric 提供可用于节点的一级标记概念。To support these sorts of configurations, Service Fabric has a first class notion of tags that can be applied to nodes. 这些标记称为节点属性These tags are called node properties. 放置约束是附加到单个服务的语句,这些服务专供 1 个或多个节点属性选择。Placement constraints are the statements attached to individual services that select for one or more node properties. 放置约束定义服务运行的位置。Placement constraints define where services should run. 约束集可扩展 - 任何键/值对都适用。The set of constraints is extensible - any key/value pair can work.

群集布局中的不同工作负荷

![Cluster Layout Different Workloads][Image5]

内置节点属性Built in node properties

Service Fabric 定义了一些默认节点属性,无需用户进行定义,系统即会自动使用这些属性。Service Fabric defines some default node properties that can be used automatically without the user having to define them. 在每个节点上定义的默认属性是 NodeTypeNodeNameThe default properties defined at each node are the NodeType and the NodeName. 因此举例而言,可以将放置约束编写为 "(NodeType == NodeType03)"So for example you could write a placement constraint as "(NodeType == NodeType03)". 通常来说,我们发现 NodeType 是最常用的属性之一。Generally we have found NodeType to be one of the most commonly used properties. 它很有用,因为它与计算机的类型之间存在一一对应关系。It is useful since it corresponds 1:1 with a type of a machine. 每种计算机类型都与一种传统 n 层应用程序的工作负荷类型相对应。Each type of machine corresponds to a type of workload in a traditional n-tier application.

放置约束和节点属性

![Placement Constraints and Node Properties][Image6]

放置约束和节点属性语法Placement Constraint and Node Property Syntax

节点属性中指定的值可以是字符串、布尔值或带符号的长型值。The value specified in the node property can be a string, bool, or signed long. 服务处的语句称为放置 约束 ,它可以约束服务在群集中的运行位置。The statement at the service is called a placement constraint since it constrains where the service can run in the cluster. 约束可以是针对群集中的不同节点属性运行的任何布尔值语句。The constraint can be any Boolean statement that operates on the different node properties in the cluster. 这些布尔值语句中的有效选择器包括:The valid selectors in these boolean statements are:

  1. 用于创建特定语句的条件检查conditional checks for creating particular statements
语句Statement 语法Syntax
“等于”"equal to" "==""=="
“不等于”"not equal to" "!=""!="
“大于”"greater than" ">"">"
“大于等于”"greater than or equal to" ">="">="
“小于”"less than" "<""<"
“小于等于”"less than or equal to" "<=""<="
  1. 用于分组和逻辑运算的布尔值语句boolean statements for grouping and logical operations
语句Statement 语法Syntax
“和”"and" "&&""&&"
“或”"or" "||""||"
“非”"not" "!""!"
“分组为单个语句”"group as single statement" "()""()"

下面是基本约束语句的一些示例。Here are some examples of basic constraint statements.

  • "Value >= 5"
  • "NodeColor != green"
  • "((OneProperty < 100) || ((AnotherProperty == false) && (OneProperty >= 100)))"

只有整个放置约束语句求值为“True”的节点才能放置服务。Only nodes where the overall placement constraint statement evaluates to "True" can have the service placed on it. 未定义属性的节点不匹配包含该属性的任何放置约束。Nodes that do not have a property defined do not match any placement constraint containing that property.

假设为给定节点类型定义了以下节点属性:Let's say that the following node properties were defined for a given node type:

ClusterManifest.xmlClusterManifest.xml

    <NodeType Name="NodeType01">
      <PlacementProperties>
        <Property Name="HasSSD" Value="true"/>
        <Property Name="NodeColor" Value="green"/>
        <Property Name="SomeProperty" Value="5"/>
      </PlacementProperties>
    </NodeType>

通过 ClusterConfig.json 进行独立部署或将 Template.json 用于 Azure 托管群集。via ClusterConfig.json for Standalone deployments or Template.json for Azure hosted clusters.

Note

在 Azure 资源管理器模板中,节点类型通常已参数化。In your Azure Resource Manager template the node type is usually parameterized. 它类似于“[parameters('vmNodeType1Name')]”而不是“NodeType01”。It would look like "[parameters('vmNodeType1Name')]" rather than "NodeType01".

"nodeTypes": [
    {
        "name": "NodeType01",
        "placementProperties": {
            "HasSSD": "true",
            "NodeColor": "green",
            "SomeProperty": "5"
        },
    }
],

可以针对服务创建服务放置 约束 ,如下所示:You can create service placement constraints for a service like as follows:

C#C#

FabricClient fabricClient = new FabricClient();
StatefulServiceDescription serviceDescription = new StatefulServiceDescription();
serviceDescription.PlacementConstraints = "(HasSSD == true && SomeProperty >= 4)";
// add other required servicedescription fields
//...
await fabricClient.ServiceManager.CreateServiceAsync(serviceDescription);

Powershell:Powershell:

New-ServiceFabricService -ApplicationName $applicationName -ServiceName $serviceName -ServiceTypeName $serviceType -Stateful -MinReplicaSetSize 3 -TargetReplicaSetSize 3 -PartitionSchemeSingleton -PlacementConstraint "HasSSD == true && SomeProperty >= 4"

如果 NodeType01 的所有节点都有效,则也可以使用约束“(NodeType == NodeType01)”选择该节点类型。If all nodes of NodeType01 are valid, you can also select that node type with the constraint "(NodeType == NodeType01)".

服务放置约束的一个突出优点是它们可以在运行时动态更新。One of the cool things about a service's placement constraints is that they can be updated dynamically during runtime. 因此如果需要,可以在群集中移动服务、添加和删除要求,等等。Service Fabric 负责确保即使进行了这些类型的更改,服务仍保持运行且可供使用。So if you need to, you can move a service around in the cluster, add and remove requirements, etc. Service Fabric takes care of ensuring that the service stays up and available even when these types of changes are made.

C#:C#:

StatefulServiceUpdateDescription updateDescription = new StatefulServiceUpdateDescription();
updateDescription.PlacementConstraints = "NodeType == NodeType01";
await fabricClient.ServiceManager.UpdateServiceAsync(new Uri("fabric:/app/service"), updateDescription);

Powershell:Powershell:

Update-ServiceFabricService -Stateful -ServiceName $serviceName -PlacementConstraints "NodeType == NodeType01"

放置约束是针对每个不同的命名服务实例指定的。Placement constraints are specified for every different named service instance. 更新始终会取代(覆盖)以前指定的值。Updates always take the place of (overwrite) what was previously specified.

群集定义用于定义节点上的属性。The cluster definition defines the properties on a node. 更改节点属性需要升级群集配置。Changing a node's properties requires a cluster configuration upgrade. 升级节点属性需要重启每个受影响的节点以报告其新属性。Upgrading a node's properties requires each affected node to restart to report its new properties. 这些滚动升级由 Service Fabric 管理。These rolling upgrades are managed by Service Fabric.

描述和管理群集资源Describing and Managing Cluster Resources

任何协调器的最重要作业之一是帮助管理群集中的资源消耗。One of the most important jobs of any orchestrator is to help manage resource consumption in the cluster. 而管理群集资源时需要注意的事项有所不同。Managing cluster resources can mean a couple of different things. 首先,确保计算机不会过载。First, there's ensuring that machines are not overloaded. 这是指确保计算机运行的服务数不超过其可处理的服务数。This means making sure that machines aren't running more services than they can handle. 其次,需要权衡和优化与高效运行服务息息相关的因素。Second, there's balancing and optimization which is critical to running services efficiently. 经济有效型或性能敏感型服务产品不允许某些节点处于热状态,而其他节点处于冷状态。Cost effective or performance sensitive service offerings can't allow some nodes to be hot while others are cold. 热节点会导致资源争用和性能不佳,而冷节点意味着资源浪费和成本增加。Hot nodes lead to resource contention and poor performance, and cold nodes represent wasted resources and increased costs.

Service Fabric 使用 Metrics表示资源。Service Fabric represents resources as Metrics. 指标是要向 Service Fabric 描述的任何逻辑或物理资源。Metrics are any logical or physical resource that you want to describe to Service Fabric. 指标的示例是诸如“WorkQueueDepth”或“MemoryInMb”的参数。Examples of metrics are things like "WorkQueueDepth" or "MemoryInMb". 若要了解 Service Fabric 可在节点上调控的物理资源,请参阅资源调控For information about the physical resources that Service Fabric can govern on nodes, see resource governance. 有关配置自定义指标及其用法的信息,请参阅此文For information on configuring custom metrics and their uses, see this article

指标与放置约束和节点属性不同。Metrics are different from placements constraints and node properties. 节点属性是节点自身的静态描述符。Node properties are static descriptors of the nodes themselves. 指标描述节点所含资源,以及当服务在节点上运行时服务所消耗的资源。Metrics describe resources that nodes have and that services consume when they are run on a node. 节点属性可能为“HasSSD”,可设置为 true 或 false。A node property could be "HasSSD" and could be set to true or false. 该 SSD 上的可用空间量和服务消耗的空间量是类似于“DriveSpaceInMb”的指标。The amount of space available on that SSD and how much is consumed by services would be a metric like "DriveSpaceInMb".

请注意,与放置约束和节点属性相同,Service Fabric 群集 Resource Manager 不理解指标名称的含义。It is important to note that just like for placement constraints and node properties, the Service Fabric Cluster Resource Manager doesn't understand what the names of the metrics mean. 指标名称只是字符串。Metric names are just strings. 建议不明确时,将单位声明为创建的指标名称的一部分。It is a good practice to declare units as a part of the metric names that you create when it could be ambiguous.

容量Capacity

如果关闭所有的资源均衡功能,Service Fabric 群集资源管理器仍会确保最终不会有任何节点超出其容量。If you turned off all resource balancing, Service Fabric's Cluster Resource Manager would still ensure that no node ended up over its capacity. 对容量溢出进行管理是可能的,除非群集过于饱和或工作负荷大于任何节点。Managing capacity overruns is possible unless the cluster is too full or the workload is larger than any node. 容量是群集 Resource Manager 用来了解节点包含的资源量的另一个 约束Capacity is another constraint that the Cluster Resource Manager uses to understand how much of a resource a node has. 还会针对整个群集来追踪剩余容量。Remaining capacity is also tracked for the cluster as a whole. 服务级别的容量和消耗量均以指标来表示。Both the capacity and the consumption at the service level are expressed in terms of metrics. 因此举例而言,指标可能是“ClientConnections”,给定的节点可能拥有 32768 个单位的“ClientConnections”容量。So for example, the metric might be "ClientConnections" and a given Node may have a capacity for "ClientConnections" of 32768. 其他节点可能有其他限制。在该节点上运行的某一服务可以声明其当前正在消耗 32256 个单位的指标“ClientConnections”。Other nodes can have other limits Some service running on that node can say it is currently consuming 32256 of the metric "ClientConnections".

在运行时,群集资源管理器会跟踪群集中和节点上的剩余容量。During runtime, the Cluster Resource Manager tracks remaining capacity in the cluster and on nodes. 群集资源管理器通过从运行服务的节点容量中减去每个服务的使用量来跟踪容量。In order to track capacity the Cluster Resource Manager subtracts each service's usage from node's capacity where the service runs. 使用此信息,Service Fabric 群集资源管理器可找出要放置或移动副本的位置,使节点不会超过容量。With this information, the Service Fabric Cluster Resource Manager can figure out where to place or move replicas so that nodes don't go over capacity.

群集节点和容量

![Cluster nodes and capacity][Image7]

C#:C#:

StatefulServiceDescription serviceDescription = new StatefulServiceDescription();
ServiceLoadMetricDescription metric = new ServiceLoadMetricDescription();
metric.Name = "ClientConnections";
metric.PrimaryDefaultLoad = 1024;
metric.SecondaryDefaultLoad = 0;
metric.Weight = ServiceLoadMetricWeight.High;
serviceDescription.Metrics.Add(metric);
await fabricClient.ServiceManager.CreateServiceAsync(serviceDescription);

Powershell:Powershell:

New-ServiceFabricService -ApplicationName $applicationName -ServiceName $serviceName -ServiceTypeName $serviceTypeName -Stateful -MinReplicaSetSize 3 -TargetReplicaSetSize 3 -PartitionSchemeSingleton -Metric @("ClientConnections,High,1024,0)

可以在群集清单中看到定义的容量:You can see capacities defined in the cluster manifest:

ClusterManifest.xmlClusterManifest.xml

    <NodeType Name="NodeType03">
      <Capacities>
        <Capacity Name="ClientConnections" Value="65536"/>
      </Capacities>
    </NodeType>

通过 ClusterConfig.json 进行独立部署或将 Template.json 用于 Azure 托管群集。via ClusterConfig.json for Standalone deployments or Template.json for Azure hosted clusters.

"nodeTypes": [
    {
        "name": "NodeType03",
        "capacities": {
            "ClientConnections": "65536",
        }
    }
],

通常,服务负载会以动态方式发生更改。Commonly a service's load changes dynamically. 假设副本的“ClientConnections”负载从 1024 更改为 2048,但是当时正在运行该副本的节点只剩余了 512 个单位的该指标容量。Say that a replica's load of "ClientConnections" changed from 1024 to 2048, but the node it was running on then only had 512 capacity remaining for that metric. 现在副本或实例的位置无效,因为该节点上没有足够的空间。Now that replica or instance's placement is invalid, since there's not enough room on that node. 群集资源管理器必须介入,使节点上的容量消耗低于阈值。The Cluster Resource Manager has to kick in and get the node back below capacity. 这可通过将 1 个或多个副本或实例从该节点转移到其他节点,来减少超出容量的节点上的负载。It reduces load on the node that is over capacity by moving one or more of the replicas or instances from that node to other nodes. 移动副本时,群集资源管理器会尝试将移动成本降至最低。When moving replicas, the Cluster Resource Manager tries to minimize the cost of those movements. 此文介绍了移动成本;若要深入了解群集资源管理器的重新均衡策略和规则,请参阅此文Movement cost is discussed in this article and more about the Cluster Resource Manager's rebalancing strategies and rules is described here.

群集容量Cluster capacity

那么,Service Fabric 群集资源管理器如何防止整个群集过于饱和?So how does the Service Fabric Cluster Resource Manager keep the overall cluster from being too full? 对于动态负载,基本上没有有效的解决方法。Well, with dynamic load there's not a lot it can do. 服务可使自己的负载高峰独立于群集 Resource Manager 所执行的操作。Services can have their load spike independently of actions taken by the Cluster Resource Manager. 因此,群集当前或许拥有足够的容量,但将来需要扩大规模时,可能就不够用了。As a result, your cluster with plenty of headroom today may be underpowered when you become famous tomorrow. 尽管如此,仍有一些控件可以防止问题。That said, there are some controls that are baked in to prevent problems. 我们可做的第一件事是防止创建导致群集空间变满的新工作负荷。The first thing we can do is prevent the creation of new workloads that would cause the cluster to become full.

假设要创建一个无状态服务,并且它有某些关联的负载。Say that you create a stateless service and it has some load associated with it. 假设服务需考虑“DiskSpaceInMb”指标。Let's say that the service cares about the "DiskSpaceInMb" metric. 并且假设服务的每个实例使用 5 个单位的“DiskSpaceInMb”。Let's also say that it is going to consume five units of "DiskSpaceInMb" for every instance of the service. 需要创建服务的 3 个实例。You want to create three instances of the service. 很好!Great! 这意味着我们需要群集中有 15 个单位的“DiskSpaceInMb”才能创建这些服务实例。So that means that we need 15 units of "DiskSpaceInMb" to be present in the cluster in order for us to even be able to create these service instances. 群集资源管理器会持续计算容量和每个指标的消耗量,因此它可以确定群集中的剩余容量。The Cluster Resource Manager continually calculates the capacity and consumption of each metric so it can determine the remaining capacity in the cluster. 如果没有足够的空间,群集资源管理器会拒绝创建服务调用。If there isn't enough space, the Cluster Resource Manager rejects the create service call.

因为只要求有 15 个可用单位,所以可以使用不同的方式分配此空间。Since the requirement is only that there be 15 units available, this space could be allocated many different ways. 例如,可能是在 15 个不同节点上各有一个剩余单位的容量,或是在 5 个不同节点上各有三个剩余单位的容量。For example, there could be one remaining unit of capacity on 15 different nodes, or three remaining units of capacity on five different nodes. 如果群集资源管理器能够重新排列服务,在 3 个节点上提供 5 个单位,则可放置服务。If the Cluster Resource Manager can rearrange things so there's five units available on three nodes, it places the service. 重新排列群集通常是可行的,除非群集几乎已满,或者出于某种原因无法合并现有服务。Rearranging the cluster is usually possible unless the cluster is almost full or the existing services can't be consolidated for some reason.

缓冲容量Buffered Capacity

缓冲容量是群集资源管理器的另一项功能。Buffered capacity is another feature of the Cluster Resource Manager. 使用此功能可以保留总节点容量的一部分。It allows reservation of some portion of the overall node capacity. 此容量缓冲区仅用于在升级期间和发生节点故障时放置服务。This capacity buffer is only used to place services during upgrades and node failures. 缓冲容量是全局指定的,即针对所有节点按指标指定的。Buffered Capacity is specified globally per metric for all nodes. 为保留容量选择的值取决于群集中的容错域和升级域数目。The value you pick for the reserved capacity is a function of the number of Fault and Upgrade Domains you have in the cluster. 较多的容错域和升级域意味着可以选择较少数值的缓冲处理容量。More Fault and Upgrade Domains means that you can pick a lower number for your buffered capacity. 域越多,则升级和故障过程中无法使用的群集数量越少。If you have more domains, you can expect smaller amounts of your cluster to be unavailable during upgrades and failures. 指定缓冲容量只有在同时指定了指标的节点容量时才有意义。Specifying Buffered Capacity only makes sense if you have also specified the node capacity for a metric.

以下示例演示如何指定缓冲容量:Here's an example of how to specify buffered capacity:

ClusterManifest.xmlClusterManifest.xml

        <Section Name="NodeBufferPercentage">
            <Parameter Name="SomeMetric" Value="0.15" />
            <Parameter Name="SomeOtherMetric" Value="0.20" />
        </Section>

通过用于独立部署的 ClusterConfig.json 或用于 Azure 托管群集的 Template.json:via ClusterConfig.json for Standalone deployments or Template.json for Azure hosted clusters:

"fabricSettings": [
  {
    "name": "NodeBufferPercentage",
    "parameters": [
      {
          "name": "SomeMetric",
          "value": "0.15"
      },
      {
          "name": "SomeOtherMetric",
          "value": "0.20"
      }
    ]
  }
]

群集用于某个指标的缓冲容量不足时,创建新服务会失败。The creation of new services fails when the cluster is out of buffered capacity for a metric. 通过防止创建新服务来保留缓冲区,可确保升级和故障不会造成节点超出容量。Preventing the creation of new services to preserve the buffer ensures that upgrades and failures don't cause nodes to go over capacity. 缓冲容量是可选项,但建议为定义了指标容量的所有群集启用。Buffered capacity is optional but is recommended in any cluster that defines a capacity for a metric.

群集资源管理器会公开此负载信息。The Cluster Resource Manager exposes this load information. 对于每个指标,此信息包括:For each metric, this information includes:

  • 缓冲容量设置the buffered capacity settings
  • 总容量the total capacity
  • 当前消耗量the current consumption
  • 每项指标是否视为均衡whether each metric is considered balanced or not
  • 有关标准偏差的统计信息statistics about the standard deviation
  • 负载最大和最小的节点the nodes which have the most and least load

下面提供了该输出的示例:Below we see an example of that output:

PS C:\Users\user> Get-ServiceFabricClusterLoadInformation
LastBalancingStartTimeUtc : 9/1/2016 12:54:59 AM
LastBalancingEndTimeUtc   : 9/1/2016 12:54:59 AM
LoadMetricInformation     :
                            LoadMetricName        : Metric1
                            IsBalancedBefore      : False
                            IsBalancedAfter       : False
                            DeviationBefore       : 0.192450089729875
                            DeviationAfter        : 0.192450089729875
                            BalancingThreshold    : 1
                            Action                : NoActionNeeded
                            ActivityThreshold     : 0
                            ClusterCapacity       : 189
                            ClusterLoad           : 45
                            ClusterRemainingCapacity : 144
                            NodeBufferPercentage  : 10
                            ClusterBufferedCapacity : 170
                            ClusterRemainingBufferedCapacity : 125
                            ClusterCapacityViolation : False
                            MinNodeLoadValue      : 0
                            MinNodeLoadNodeId     : 3ea71e8e01f4b0999b121abcbf27d74d
                            MaxNodeLoadValue      : 15
                            MaxNodeLoadNodeId     : 2cc648b6770be1bc9824fa995d5b68b1

后续步骤Next steps

  • 有关群集 Resource Manager 中的体系结构和信息流的信息,请参阅此文For information on the architecture and information flow within the Cluster Resource Manager, check out this article
  • 定义碎片整理指标是合并(而不是分散)节点上负载的一种方式。若要了解如何配置重整,请参阅此文Defining Defragmentation Metrics is one way to consolidate load on nodes instead of spreading it out. To learn how to configure defragmentation, refer to this article
  • 从头开始并获取 Service Fabric 群集 Resource Manager 简介Start from the beginning and get an Introduction to the Service Fabric Cluster Resource Manager
  • 若要了解群集 Resource Manager 如何管理和均衡群集中的负载,请查看有关均衡负载的文章To find out about how the Cluster Resource Manager manages and balances load in the cluster, check out the article on balancing load