Azure Service Fabric 的容量规划和缩放Capacity planning and scaling for Azure Service Fabric

在创建任何 Azure Service Fabric 群集或缩放托管群集的计算资源之前,必须做好容量规划。Before you create any Azure Service Fabric cluster or scale compute resources that host your cluster, it's important to plan for capacity. 有关规划容量的详细信息,请参阅规划 Service Fabric 群集容量For more information about planning for capacity, see Planning the Service Fabric cluster capacity.

除了考虑节点类型和群集特征以外,还要规划好生产环境中需要花费一小时以上才能完成的缩放操作。In addition to considering node type and cluster characteristics, you should expect scaling operations to take longer than an hour to complete for a production environment. 不管要添加多少个 VM,都要考虑到这种情况。This consideration is true regardless of the number of VMs you're adding.

自动缩放Autoscaling

应该通过 Azure 资源管理器模板执行缩放操作,因为最佳做法是将资源配置视为代码You should perform scaling operations via Azure Resource Manager templates, because it's the best practice to treat resource configurations as code.

使用虚拟机规模集自动缩放会导致版本受控的资源管理器模板不准确地定义虚拟机规模集实例计数。Using automatic scaling through virtual machine scale sets will make your versioned Resource Manager template inaccurately define your instance counts for virtual machine scale sets. 不准确的定义会增加将来的部署导致意外缩放操作的风险。Inaccurate definition increases the risk that future deployments will cause unintended scaling operations. 一般而言,如果存在以下情况,则应该使用自动缩放:In general, you should use autoscaling if:

  • 使用相应的声明容量部署资源管理器模板不支持你的用例。Deploying your Resource Manager templates with appropriate capacity declared doesn't support your use case.

    除了手动缩放以外,还可以使用 Azure 资源组部署项目在 Azure DevOps 服务中配置持续集成和交付管道In addition to manual scaling, you can configure a Continuous integration and delivery pipeline in Azure DevOps Services by using Azure resource group deployment projects. 此管道通常由某个逻辑应用触发,而该应用利用从 Azure Monitor REST API 查询的虚拟机性能指标。This pipeline is commonly triggered by a logic app that uses virtual machine performance metrics queried from the Azure Monitor REST API. 该管道基于所需的任意指标进行有效自动缩放,同时针对 Azure 资源管理器进行优化可以增大价值。The pipeline effectively autoscales based on whatever metrics you want, while optimizing for Resource Manager templates.

  • 每次只需水平缩放一个虚拟机规模集节点。You need to horizontally scale only one virtual machine scale set node at a time.

    若要一次性横向扩展三个或更多个节点,应该通过添加虚拟机规模集来横向扩展 Service Fabric 群集To scale out by three or more nodes at a time, you should scale out a Service Fabric cluster by adding a virtual machine scale set. 最安全的做法是每次横向扩展或缩减虚拟机规模集的一个节点。It's safest to scale in and scale out virtual machine scale sets horizontally, one node at a time.

  • 对于 Service Fabric 群集,可以实现银级可靠性;对于配置了自动缩放规则的任何规模集,可以实现银级或更高的持久性。You have Silver reliability or higher for your Service Fabric cluster, and Silver durability or higher on any scale where you configure autoscaling rules.

    自动缩放规则的最小容量必须大于或等于五个虚拟机实例。The minimum capacity for autoscaling rules must be equal to or greater than five virtual machine instances. 此外,它必须大于或等于主节点类型的最低可靠性层。It must also be equal to or greater than your Reliability Tier minimum for your primary node type.

备注

Service Fabric 有状态服务 fabric:/System/InfastructureService/<节点类型名称> 在具有银级或更高持久性的每个节点类型上运行。The Service Fabric stateful service fabric:/System/InfastructureService/<NODE_TYPE_NAME> runs on every node type that has Silver or higher durability. 它是唯一支持在 Azure 中任何群集节点类型上运行的系统服务。It's the only system service that is supported to run in Azure on any of your clusters node types.

重要

Service Fabric 自动缩放支持 DefaultNewestVM 虚拟机规模集横向缩减配置Service Fabric autoscaling supports Default and NewestVM virtual machine scale set scale-in configurations.

垂直缩放注意事项Vertical scaling considerations

垂直缩放 Azure Service Fabric 中的节点类型需要执行许多步骤并考虑多种因素。Vertical scaling a node type in Azure Service Fabric requires a number of steps and considerations. 例如:For example:

  • 在缩放之前,群集必须处于正常状态。The cluster must be healthy before scaling. 否则,会进一步破坏群集的稳定性。Otherwise, you'll destabilize the cluster further.
  • 托管有状态服务的所有 Service Fabric 群集节点类型需要银级或更高级别的持久性。Silver durability level or greater is required for all Service Fabric cluster node types that host stateful services.

备注

托管有状态 Service Fabric 系统服务的主节点类型必须具有银级或更高级别的持久性。Your primary node type that hosts stateful Service Fabric system services must be Silver durability level or greater. 启用银级持久性后,升级、添加或删除节点等群集操作将会变慢,因为系统优化的目标是保护数据安全,而不是加快操作速度。After you enable Silver durability, cluster operations such as upgrades, adding or removing of nodes, and so on will be slower because the system optimizes for data safety over speed of operations.

垂直缩放虚拟机规模集是破坏性的操作。Vertical scaling a virtual machine scale set is a destructive operation. 应该通过添加具有所需 SKU 的新规模集来水平缩放群集。Instead, horizontally scale your cluster by adding a new scale set with the desired SKU. 然后将服务迁移到所需的 SKU,以安全完成垂直缩放操作。Then, migrate your services to your desired SKU to complete a safe vertical scaling operation. 更改虚拟机规模集资源 SKU 是破坏性的操作,因为需要重置主机映像,而这会删除所有本地持久化状态。Changing a virtual machine scale set resource SKU is a destructive operation because it reimages your hosts, which removes all locally persisted state.

群集使用 Service Fabric 节点属性和位置约束来确定要将应用程序服务托管在哪个位置。Your cluster uses Service Fabric node properties and placement constraints to decide where to host your application's services. 垂直缩放主节点类型时,请为 "nodeTypeRef" 声明相同的属性值。When you're vertically scaling your primary node type, declare identical property values for "nodeTypeRef". 可在虚拟机规模集的 Service Fabric 扩展中的找到这些值。You can find these values in the Service Fabric extension for virtual machine scale sets.

资源管理器模板的以下代码片段显示了要声明的属性。The following snippet of a Resource Manager template shows the properties you'll declare. 其中的值与要缩放到的新预配规模集的属性值相同,仅支持用作群集的临时有状态服务。It has the same value for the newly provisioned scale sets that you're scaling to, and it's supported only as a temporary stateful service for your cluster.

"settings": {
   "nodeTypeRef": ["[parameters('primaryNodetypeName')]"]
}

备注

请不要让群集与使用相同 nodeTypeRef 属性值的多个规模集一起运行的时间超过成功完成垂直缩放操作所需的时间。Don't leave your cluster running with multiple scale sets that use the same nodeTypeRef property value longer than required to complete a successful vertical scaling operation.

在尝试进行生产环境更改之前,始终在测试环境中验证操作。Always validate operations in test environments before you attempt changes to the production environment. 默认情况下,Service Fabric 群集系统服务提供仅针对主节点类型的位置约束。By default, Service Fabric cluster system services have a placement constraint to only the target primary node type.

声明节点属性和位置约束后,每次在一个 VM 实例上执行以下步骤。With the node properties and placement constraints declared, do the following steps one VM instance at a time. 这样,可以在其他位置创建新副本时,让系统服务(以及有状态服务)在要删除的 VM 实例上正常关闭。This allows the system services (and your stateful services) to be shut down gracefully on the VM instance you're removing as new replicas are created elsewhere.

  1. 在 PowerShell 中,结合意图 RemoveNode 来运行 Disable-ServiceFabricNode,以禁用要删除的节点。From PowerShell, run Disable-ServiceFabricNode with intent RemoveNode to disable the node you're going to remove. 删除编号最大的节点类型。Remove the node type that has the highest number. 例如,如果你有一个六节点群集,请删除“MyNodeType_5”虚拟机实例。For example, if you have a six-node cluster, remove the "MyNodeType_5" virtual machine instance.

  2. 运行 Get-ServiceFabricNode 以确保该节点已转换为禁用状态。Run Get-ServiceFabricNode to make sure that the node has transitioned to disabled. 如果没有,请等到节点已禁用。If not, wait until the node is disabled. 对于每个节点,此过程可能需要花费几个小时。This might take a couple hours for each node. 在节点转换为禁用状态之前,请不要继续操作。Don't proceed until the node has transitioned to disabled.

  3. 将该节点类型的 VM 数目减少一个。Decrease the number of VMs by one in that node type. 现在,将会删除编号最大的 VM 实例。The highest VM instance will now be removed.

  4. 根据需要重复步骤 1 到 3,但切勿将主节点类型的实例数目横向缩减到少于可靠性层所需的数目。Repeat steps 1 through 3 as needed, but never scale in the number of instances in the primary node types less than what the reliability tier warrants. 有关建议实例的列表,请参阅规划 Service Fabric 群集容量See Planning the Service Fabric cluster capacity for a list of recommended instances.

  5. 所有 VM 都消失(表示为“关闭”)后,fabric:/System/InfrastructureService/[node name] 将显示错误状态。Once all VMs are gone (represented as "Down") the fabric:/System/InfrastructureService/[node name] will show an Error state. 然后,可以更新群集资源以删除节点类型。Then, you can update the cluster resource to remove the node type. 可以使用 ARM 模板部署。You can either use the ARM template deployment. 这将启动群集升级,从而删除处于错误状态的 fabric:/System/InfrastructureService/[node type] 服务。This will start a cluster upgrade which will remove the fabric:/System/InfrastructureService/[node type] service that is in error state.

  6. 在此之后,可以选择删除 VMScaleSet,但仍然会在 Service Fabric Explorer 视图中看到节点为“关闭”。After that you can optionally delete the VMScaleSet, you will still see the nodes as "Down" from Service Fabric Explorer view though. 最后一步是使用 Remove-ServiceFabricNodeState 命令清除它们。The last step would be to clean them up with Remove-ServiceFabricNodeState command.

水平扩展Horizontal scaling

可以手动以编程方式执行水平缩放。You can do horizontal scaling either manually or programmatically.

备注

如果缩放具有银级或金级持久性的节点类型,则缩放速度将很缓慢。If you're scaling a node type that has Silver or Gold durability, scaling will be slow.

扩大Scaling out

通过增加特定虚拟机规模集的实例计数来横向扩展 Service Fabric 群集。Scale out a Service Fabric cluster by increasing the instance count for a particular virtual machine scale set. 可以使用 AzureClient 和所需规模集的 ID 以编程方式进行横向扩展,以增加容量。You can scale out programmatically by using AzureClient and the ID for the desired scale set to increase the capacity.

var scaleSet = AzureClient.VirtualMachineScaleSets.GetById(ScaleSetId);
var newCapacity = (int)Math.Min(MaximumNodeCount, scaleSet.Capacity + 1);
scaleSet.Update().WithCapacity(newCapacity).Apply(); 

若要手动横向扩展,请在所需虚拟机规模集资源的 SKU 属性中更新容量。To scale out manually, update the capacity in the SKU property of the desired virtual machine scale set resource.

"sku": {
    "name": "[parameters('vmNodeType0Size')]",
    "capacity": "[parameters('nt0InstanceCount')]",
    "tier": "Standard"
}

缩减Scaling in

横向缩减的考虑因素比横向扩展要多一些。例如:Scaling in requires more consideration than scaling out. For example:

  • Service Fabric 系统服务在群集的主节点类型中运行。Service Fabric system services run in the primary node type in your cluster. 切勿关闭该节点类型的实例,或者将其实例数目横向缩减到少于可靠性层所需的数目。Never shut down or scale in the number of instances for that node type so that you have fewer instances than what the reliability tier warrants.
  • 对于有状态服务,需要一些始终启动的节点来保持可用性,以及保持服务的状态。For a stateful service, you need a certain number of nodes that are always up to maintain availability and preserve the state of your service. 至少需要与分区或服务的目标副本集计数相等的节点数目。At a minimum, you need a number of nodes equal to the target replica set count of the partition or service.

若要手动横向缩减,请执行以下步骤:To scale in manually, follow these steps:

  1. 在 PowerShell 中,结合意图 RemoveNode 来运行 Disable-ServiceFabricNode,以禁用要删除的节点。From PowerShell, run Disable-ServiceFabricNode with intent RemoveNode to disable the node you're going to remove. 删除编号最大的节点类型。Remove the node type that has the highest number. 例如,如果你有一个六节点群集,请删除“MyNodeType_5”虚拟机实例。For example, if you have a six-node cluster, remove the "MyNodeType_5" virtual machine instance.
  2. 运行 Get-ServiceFabricNode 以确保该节点已转换为禁用状态。Run Get-ServiceFabricNode to make sure that the node has transitioned to disabled. 如果没有,请等到节点已禁用。If not, wait until the node is disabled. 对于每个节点,此过程可能需要花费几个小时。This might take a couple hours for each node. 在节点转换为禁用状态之前,请不要继续操作。Don't proceed until the node has transitioned to disabled.
  3. 将该节点类型的 VM 数目减少一个。Decrease the number of VMs by one in that node type. 现在,将会删除编号最大的 VM 实例。The highest VM instance will now be removed.
  4. 视需要重复步骤 1 到 3,直到预配了所需的容量。Repeat steps 1 through 3 as needed until you provision the capacity you want. 请勿将主节点类型的实例数目横向缩减到少于可靠性层所需的数目。Don't scale in the number of instances in the primary node types to less than what the reliability tier warrants. 有关建议实例的列表,请参阅规划 Service Fabric 群集容量See Planning the Service Fabric cluster capacity for a list of recommended instances.

若要手动横向缩减,请在所需虚拟机规模集资源的 SKU 属性中更新容量。To scale in manually, update the capacity in the SKU property of the desired virtual machine scale set resource.

"sku": {
    "name": "[parameters('vmNodeType0Size')]",
    "capacity": "[parameters('nt0InstanceCount')]",
    "tier": "Standard"
}

必须准备好要关闭的节点才能以编程方式进行横向缩减。You must prepare the node for shutdown to scale in programmatically. 查找要删除的节点(编号最大的实例节点)。Find the node to be removed (the highest-instance node). 例如:For example:

using (var client = new FabricClient())
{
    var mostRecentLiveNode = (await client.QueryManager.GetNodeListAsync())
        .Where(n => n.NodeType.Equals(NodeTypeToScale, StringComparison.OrdinalIgnoreCase))
        .Where(n => n.NodeStatus == System.Fabric.Query.NodeStatus.Up)
        .OrderByDescending(n =>
        {
            var instanceIdIndex = n.NodeName.LastIndexOf("_");
            var instanceIdString = n.NodeName.Substring(instanceIdIndex + 1);
            return int.Parse(instanceIdString);
        })
        .FirstOrDefault();

使用相同的 FabricClient 实例(在本例中为 client)和节点实例名称(在本例中为 instanceIdString)停用并删除该节点:Deactivate and remove the node by using the same FabricClient instance (client in this case) and node instance (instanceIdString in this case) that you used in the previous code:

var scaleSet = AzureClient.VirtualMachineScaleSets.GetById(ScaleSetId);

// Remove the node from the Service Fabric cluster
ServiceEventSource.Current.ServiceMessage(Context, $"Disabling node {mostRecentLiveNode.NodeName}");
await client.ClusterManager.DeactivateNodeAsync(mostRecentLiveNode.NodeName, NodeDeactivationIntent.RemoveNode);

// Wait (up to a timeout) for the node to gracefully shut down
var timeout = TimeSpan.FromMinutes(5);
var waitStart = DateTime.Now;
while ((mostRecentLiveNode.NodeStatus == System.Fabric.Query.NodeStatus.Up || mostRecentLiveNode.NodeStatus == System.Fabric.Query.NodeStatus.Disabling) &&
        DateTime.Now - waitStart < timeout)
{
    mostRecentLiveNode = (await client.QueryManager.GetNodeListAsync()).FirstOrDefault(n => n.NodeName == mostRecentLiveNode.NodeName);
    await Task.Delay(10 * 1000);
}

// Decrement virtual machine scale set capacity
var newCapacity = (int)Math.Max(MinimumNodeCount, scaleSet.Capacity - 1); // Check min count 

scaleSet.Update().WithCapacity(newCapacity).Apply();

备注

横向缩减群集时,你会发现已删除的节点/VM 实例以不正常状态显示在 Service Fabric Explorer 中。When you scale in a cluster, you'll see the removed node/VM instance displayed in an unhealthy state in Service Fabric Explorer. 有关此行为的说明,请参阅可能会在 Service Fabric Explorer 中观察到的行为For an explanation of this behavior, see Behaviors you may observe in Service Fabric Explorer. 方法:You can:

可靠性级别Reliability levels

可靠性级别是 Service Fabric 群集资源的一个属性。The reliability level is a property of your Service Fabric cluster resource. 对于各个节点类型,此属性的配置必须相同。It can't be configured differently for individual node types. 该属性控制群集系统服务的复制因子,是群集资源级别的设置。It controls the replication factor of the system services for the cluster, and is a setting at the cluster resource level.

可靠性级别决定了主节点类型必须具有的节点数下限。The reliability level will determine the minimum number of nodes that your primary node type must have. 可靠性层可以采用以下值:The reliability tier can take the following values:

  • 白金:运行包含 7 个目标副本集和 9 个种子节点的系统服务。Platinum: Runs the system services with a target replica set count of seven and nine seed nodes.
  • 金:运行包含 7 个目标副本集和 7 个种子节点的系统服务。Gold: Runs the system services with a target replica set count of seven and seven seed nodes.
  • 银:运行包含 5 个目标副本集和 5 个种子节点的系统服务。Silver: Runs the system services with a target replica set count of five and five seed nodes.
  • 铜:运行包含 3 个目标副本集和 3 个种子节点的系统服务。Bronze: Runs the system services with a target replica set count of three and three seed nodes.

建议的最低可靠性级别为“银”级。The minimum recommended reliability level is Silver.

可靠性级别在 Microsoft.ServiceFabric/clusters 资源的 properties 节中设置,如下所示:The reliability level is set in the properties section of the Microsoft.ServiceFabric/clusters resource, like this:

"properties":{
    "reliabilityLevel": "Silver"
}

持久性级别Durability levels

警告

以青铜级持续性运行的节点类型不具有任何特权。Node types running with Bronze durability obtain no privileges . 不会停止或延迟对无状态工作负荷产生影响的基础结构作业,这可能影响工作负荷。Infrastructure jobs that affect your stateless workloads will not be stopped or delayed, which might affect your workloads.

铜级持久性仅用于运行无状态工作负荷的节点类型。Use Bronze durability only for node types that run stateless workloads. 对于生产工作负荷,运行银级或更高级别可确保状态一致性。For production workloads, run Silver or higher to ensure state consistency. 请根据容量规划文档中的指导选择适当的可靠性级别。Choose the right reliability based on the guidance in the capacity planning documentation.

必须在两个资源中设置持久性级别。The durability level must be set in two resources. 虚拟机规模集资源的一个扩展配置文件:One is the extension profile of the virtual machine scale set resource:

"extensionProfile": {
    "extensions":          {
        "name": "[concat('ServiceFabricNodeVmExt','_vmNodeType0Name')]",
        "properties": {
            "settings": {
                "durabilityLevel": "Bronze"
            }
        }
    }
}

另一个资源位于 Microsoft.ServiceFabric/clusters 资源中的 nodeTypes 下:The other resource is under nodeTypes in the Microsoft.ServiceFabric/clusters resource:

"nodeTypes": [
    {
        "name": "[variables('vmNodeType0Name')]",
        "durabilityLevel": "Bronze"
    }
]

后续步骤Next steps