通过添加节点类型来纵向扩展 Service Fabric 群集主节点类型Scale up a Service Fabric cluster primary node type by adding a Node Type

本文介绍了如何通过向群集添加额外的节点类型来纵向扩展 Service Fabric 群集主节点类型。This article describes how to scale up a Service Fabric cluster primary node type by adding an additional node type to the cluster. Service Fabric 群集是一组通过网络连接在一起的虚拟机或物理计算机,微服务会在其中部署和管理。A Service Fabric cluster is a network-connected set of virtual or physical machines into which your microservices are deployed and managed. 属于群集一部分的计算机或 VM 称为节点。A machine or VM that's part of a cluster is called a node. 虚拟机规模集是一种 Azure 计算资源,用于将一组 VM 作为一个集进行部署和管理。Virtual machine scale sets are an Azure compute resource that you use to deploy and manage a collection of virtual machines as a set. Azure 群集中定义的每个节点类型设置为独立的规模集Every node type that is defined in an Azure cluster is set up as a separate scale set. 然后可以单独管理每个节点类型。Each node type can then be managed separately.

可在以下位置找到以下教程中的示例模板:Service Fabric 主节点类型缩放示例The sample templates in the following tutorial can be found here: Service Fabric primary node type scaling samples

警告

如果集群状态不正常,请勿尝试主节点类型纵向扩展过程,因为这只会进一步破坏集群的稳定性。Do not attempt a primary node type scale up procedure if the cluster status is unhealthy, as this will only destabilize the cluster further.

备注

本文进行了更新,以便使用新的 Azure PowerShell Az 模块。This article has been updated to use the new Azure PowerShell Az module. 你仍然可以使用 AzureRM 模块,至少在 2020 年 12 月之前,它将继续接收 bug 修补程序。You can still use the AzureRM module, which will continue to receive bug fixes until at least December 2020. 若要详细了解新的 Az 模块和 AzureRM 兼容性,请参阅新 Azure Powershell Az 模块简介To learn more about the new Az module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. 有关 Az 模块安装说明,请参阅安装 Azure PowerShellFor Az module installation instructions, see Install Azure PowerShell.

升级主节点类型的大小和操作系统的过程Process to upgrade the size and operating system of the primary node type

下面是升级主节点类型 VM 的 VM 大小和操作系统的过程。The following is the process for updating the VM size and operating system of the primary node type VMs. 升级后,主节点类型 VM 的大小为 Standard_D4_V2,并且运行带容器的 Windows Server 2019 Datacenter。After the upgrade, the primary node type VMs are size Standard D4_V2 and running Windows Server 2019 Datacenter with Containers.

警告

在生产群集上尝试执行此过程之前,建议先研究示例模板并对测试群集验证此过程。Before attempting this procedure on a production cluster, we recommend that you study the sample templates and verify the process against a test cluster. 群集也可能不可用,但这段不可用的时间较短。The cluster may also be unavailable for a short period of time.

部署初始 Service Fabric 群集Deploy the initial Service Fabric cluster

如果你想要按照示例进行操作,请部署包含单个主节点类型和单个规模集的初始群集:Service Fabric - 初始群集If you want to follow along with the sample, deploy the initial cluster with a single primary node type, and a single scale set Service Fabric - Initial Cluster. 如果你有已部署的现有 Service Fabric 群集,则可以跳过此步骤。You may skip this step if you have an existing Service Fabric cluster already deployed.

[!NOTE] 必须修改从 GitHub 存储库“Azure-Sample”下载或引用的模板,使之适应 Azure 中国云环境。Templates you downloaded or referenced from the GitHub Repo "Azure-Sample" must be modified in order to fit in the Azure China Cloud Environment. 例如,替换某些终结点(将“blob.core.windows.net”替换为“blob.core.chinacloudapi.cn”,将“cloudapp.azure.com”替换为“cloudapp.chinacloudapi.cn”);必要时更改某些不受支持的位置、VM 映像、VM 大小、SKU 以及资源提供程序的 API 版本。.For example, replace some endpoints -- "blob.core.windows.net" by "blob.core.chinacloudapi.cn", "cloudapp.azure.com" by "cloudapp.chinacloudapi.cn"; change some unsupported Location, VM images, VM sizes, SKU and resource-provider's API Version when necessary.

  1. 登录到 Azure 帐户。Login to your Azure account.

    # sign in to your Azure account and select your subscription
    Connect-AzAccount -Environment AzureChinaCloud -SubscriptionId "<your subscription ID>"
    
  2. 创建新的资源组。Create a new resource group.

    # create a resource group for your cluster deployment
    $resourceGroupName = "myResourceGroup"
    $location = "ChinaNorth"
    
    New-AzResourceGroup `
        -Name $resourceGroupName `
        -Location $location
    
  3. 填写模板文件中的参数值。Fill in the parameter values in the template files.

  4. 将群集部署到在步骤 2 中创建的资源组。Deploy the cluster to the resource group created in step 2.

    # deploy the template files to the resource group created above
    $templateFilePath = "C:\AzureDeploy-1.json"
    $parameterFilePath = "C:\AzureDeploy.Parameters.json"
    
    New-AzResourceGroupDeployment `
        -ResourceGroupName $resourceGroupName `
        -TemplateFile $templateFilePath `
        -TemplateParameterFile $parameterFilePath
    

向群集添加新的主节点类型Add a new primary node type to the cluster

备注

缩放操作完成后,以下步骤中创建的资源将成为群集中新的主节点类型。The resources created in the following steps will become the new primary node type in your cluster once the scaling operation is complete. 请确保为初始子网、公共 IP、负载均衡器、虚拟机规模集和节点类型使用唯一的名称。Ensure that you use names that are unique from the initial Subnet, Public IP, Load Balancer, Virtual Machine Scale Set, and Node Type.

可在以下位置找到包含以下所有步骤的模板:Service Fabric - 新节点类型群集You can find a template with all of the following steps completed here: Service Fabric - New Node Type Cluster. 以下步骤包含部分资源代码片段,其中突出显示了新资源中的更改。The following steps contain partial resource snippets that highlight the changes in the new resources.

  1. 在现有虚拟网络中创建新的子网。Create a new Subnet in your existing Virtual Network.

    {
        "name": "[variables('subnet1Name')]",
        "properties": {
            "addressPrefix": "[variables('subnet1Prefix')]"
        }
    }
    
  2. 创建具有唯一 domainNameLabel 的新的公共 IP 资源。Create a new Public IP resource with a unique domainNameLabel.

    {
        "apiVersion": "[variables('publicIPApiVersion')]",
        "type": "Microsoft.Network/publicIPAddresses",
        "name": "[concat(variables('lbIPName'),'-',variables('vmNodeType1Name'))]",
        "location": "[variables('computeLocation')]",
        "properties": {
        "dnsSettings": {
            "domainNameLabel": "[concat(variables('dnsName'),'-','nt2')]"
        },
        "publicIPAllocationMethod": "Dynamic"
        },
        "tags": {
        "resourceType": "Service Fabric",
        "clusterName": "[parameters('clusterName')]"
        }
    }
    
  3. 创建依赖于上面创建的公共 IP 的新负载均衡器资源。Create a new Load Balancer resource which depends on the Public IP created above.

    "dependsOn": [
        "[concat('Microsoft.Network/publicIPAddresses/',concat(variables('lbIPName'),'-',variables('vmNodeType1Name')))]"
    ]
    
  4. 创建一个新的虚拟机规模集,该规模集使用新的 VM SKU 以及你要纵向扩展到的 OS SKU。Create a new Virtual Machine Scale Set that uses the new VM SKU, and OS SKU that you would to like to scale up to.

    节点类型引用Node Type Ref

    "nodeTypeRef": "[variables('vmNodeType1Name')]"
    

    VM SKUVM SKU

    "sku": {
        "name": "[parameters('vmNodeType1Size')]",
        "capacity": "[parameters('nt1InstanceCount')]",
        "tier": "Standard"
    }
    

    OS SKUOS SKU

    "imageReference": {
        "publisher": "[parameters('vmImagePublisher1')]",
        "offer": "[parameters('vmImageOffer1')]",
        "sku": "[parameters('vmImageSku1')]",
        "version": "[parameters('vmImageVersion1')]"
    }
    
  5. 向群集添加新的节点类型,该类型引用前面创建的虚拟机规模集。Add a new node type to the cluster, which references the Virtual Machine Scale Set that was created above. 此节点类型上的 isPrimary 属性应当设置为 true。The isPrimary property on this node type should be set to true.

    "name": "[variables('vmNodeType1Name')]",
    "applicationPorts": {
        "endPort": "[variables('nt0applicationEndPort')]",
        "startPort": "[variables('nt0applicationStartPort')]"
    },
    "clientConnectionEndpointPort": "[variables('nt0fabricTcpGatewayPort')]",
    "durabilityLevel": "Bronze",
    "ephemeralPorts": {
        "endPort": "[variables('nt0ephemeralEndPort')]",
        "startPort": "[variables('nt0ephemeralStartPort')]"
    },
    "httpGatewayEndpointPort": "[variables('nt0fabricHttpGatewayPort')]",
    "isPrimary": true,
    "reverseProxyEndpointPort": "[variables('nt0reverseProxyEndpointPort')]",
    "vmInstanceCount": "[parameters('nt1InstanceCount')]"
    
    1. 部署更新后的 ARM 模板。Deploy the updated ARM template.
    # deploy the updated template files to the existing resource group
    $templateFilePath = "C:\AzureDeploy-2.json"
    $parameterFilePath = "C:\AzureDeploy.Parameters.json"
    
    New-AzResourceGroupDeployment `
        -ResourceGroupName $resourceGroupName `
        -TemplateFile $templateFilePath `
        -TemplateParameterFile $parameterFilePath `
    

部署完成后,Service Fabric 群集现在将具有两个节点类型。The Service Fabric cluster will now have two node types when the deployment is completed.

删除现有节点类型Remove the existing node type

资源完成部署后,可以开始禁用原始主节点类型中的节点。Once the resources have finished deploying, you can begin to disable the nodes in the original primary node type. 因为节点被禁用,所以系统服务将迁移到前面步骤中部署的新的主节点类型。As the nodes are disabled, the system services will migrate to the new primary node type that had been deployed in the step above.

  1. 将 Service Fabric 群集资源中的“主节点类型”属性设置为 false。Set the primary node type property in the Service Fabric cluster resource to false.

    {
        "name": "[variables('vmNodeType0Name')]",
        "applicationPorts": {
            "endPort": "[variables('nt0applicationEndPort')]",
            "startPort": "[variables('nt0applicationStartPort')]"
        },
        "clientConnectionEndpointPort": "[variables('nt0fabricTcpGatewayPort')]",
        "durabilityLevel": "Bronze",
        "ephemeralPorts": {
            "endPort": "[variables('nt0ephemeralEndPort')]",
            "startPort": "[variables('nt0ephemeralStartPort')]"
        },
        "httpGatewayEndpointPort": "[variables('nt0fabricHttpGatewayPort')]",
        "isPrimary": false,
        "reverseProxyEndpointPort": "[variables('nt0reverseProxyEndpointPort')]",
        "vmInstanceCount": "[parameters('nt0InstanceCount')]"
    }
    
  2. 部署在原始节点类型上具有更新的 isPrimary 属性的模板。Deploy the template with the updated isPrimary property on the original node type. 你可以在此处找到原始节点类型上 primary 标志设置为 false 的模板:Service Fabric - 主节点类型 False.You can find a template with the primary flag set to false on the original node type here: Service Fabric - Primary Node Type False.

    # deploy the updated template files to the existing resource group
    $templateFilePath = "C:\AzureDeploy-3.json"
    $parameterFilePath = "C:\AzureDeploy.Parameters.json"
    
    New-AzResourceGroupDeployment `
        -ResourceGroupName $resourceGroupName `
        -TemplateFile $templateFilePath `
        -TemplateParameterFile $parameterFilePath `
    
  3. 禁用节点类型 0 中的节点。Disable the nodes in node type 0.

    Connect-ServiceFabricCluster -ConnectionEndpoint $ClusterConnectionEndpoint `
        -KeepAliveIntervalInSec 10 `
        -X509Credential `
        -ServerCertThumbprint $thumb  `
        -FindType FindByThumbprint `
        -FindValue $thumb `
        -StoreLocation CurrentUser `
        -StoreName My 
    
    Write-Host "Connected to cluster"
    
    $nodeType = "nt1vm" # specify the name of node type
    $nodes = Get-ServiceFabricNode 
    
    Write-Host "Disabling nodes..."
    foreach($node in $nodes)
    {
      if ($node.NodeType -eq $nodeType)
      {
        $node.NodeName
    
        Disable-ServiceFabricNode -Intent RemoveNode -NodeName $node.NodeName -Force
      }
    }
    
    • 对于铜级持续性,请等待所有节点进入已禁用状态。For bronze durability, wait for all nodes to get to disabled state.
    • 对于银级和金级持续性,某些节点将进入已禁用状态,剩余节点将处于正在禁用状态。For silver and gold durability, some nodes will go in to disabled and the rest will be in disabling state. 检查处于正在禁用状态的节点的详细信息选项卡,如果这些节点全部停滞在确保基础结构服务分区的仲裁状态,则可以安全地继续操作。Check the details tab of the nodes in disabling state, if they are all stuck on ensuring quorum for Infrastructure service partitions, then it is safe to continue.

    备注

    此步骤可能需要一段时间才能完成。This step may take a while to complete.

  4. 停止节点类型 0 上的数据。Stop data on node type 0.

    foreach($node in $nodes)
    {
      if ($node.NodeType -eq $nodeType)
      {
        $node.NodeName
    
        Start-ServiceFabricNodeTransition -Stop -OperationId (New-Guid) -NodeInstanceId $node.NodeInstanceId -NodeName $node.NodeName -StopDurationInSeconds 10000
      }
    }
    
  5. 解除分配原始虚拟机规模集中的节点Deallocate nodes in the original Virtual Machine Scale Set

    $scaleSetName="nt1vm"
    $scaleSetResourceType="Microsoft.Compute/virtualMachineScaleSets"
    
    Remove-AzResource -ResourceName $scaleSetName -ResourceType $scaleSetResourceType -ResourceGroupName $resourceGroupName -Force
    

    备注

    如果你已在使用标准 SKU 公共 IP 和标准 SKU 负载均衡器,则步骤 6 和 7 是可选的。Steps 6 and 7 are optional if you are already using a Standard SKU Public IP, and Standard SKU load balancer. 在这种情况下,可以在同一个负载均衡器下拥有多个虚拟机规模集/节点类型。In this case you could have multiple virtual machine scale sets/node types under the same load balancer.

  6. 你现在可以删除原始 IP 和负载均衡器资源。You can now delete the original IP, and Load Balancer resources. 在此步骤中,你还将更新 DNS 名称。In this step you will also update the DNS name.

    $lbname="LB-cluster-name-nt1vm"
    $lbResourceType="Microsoft.Network/loadBalancers"
    $ipResourceType="Microsoft.Network/publicIPAddresses"
    $oldPublicIpName="PublicIP-LB-FE-nt1vm"
    $newPublicIpName="PublicIP-LB-FE-nt2vm"
    
    $oldprimaryPublicIP = Get-AzPublicIpAddress -Name $oldPublicIpName  -ResourceGroupName $resourceGroupName
    $primaryDNSName = $oldprimaryPublicIP.DnsSettings.DomainNameLabel
    $primaryDNSFqdn = $oldprimaryPublicIP.DnsSettings.Fqdn
    
    Remove-AzResource -ResourceName $lbname -ResourceType $lbResourceType -ResourceGroupName $resourceGroupName -Force
    Remove-AzResource -ResourceName $oldPublicIpName -ResourceType $ipResourceType -ResourceGroupName $resourceGroupName -Force
    
    $PublicIP = Get-AzPublicIpAddress -Name $newPublicIpName  -ResourceGroupName $resourceGroupName
    $PublicIP.DnsSettings.DomainNameLabel = $primaryDNSName
    $PublicIP.DnsSettings.Fqdn = $primaryDNSFqdn
    Set-AzPublicIpAddress -PublicIpAddress $PublicIP
    
  7. 更新群集上的管理终结点以引用新 IP。Update the management endpoint on the cluster to reference the new IP.

    "managementEndpoint": "[concat('https://',reference(concat(variables('lbIPName'),'-',variables('vmNodeType1Name'))).dnsSettings.fqdn,':',variables('nt0fabricHttpGatewayPort'))]",
    
  8. 从节点类型 0 中删除节点状态。Remove node state from node type 0.

    foreach($node in $nodes)
    {
      if ($node.NodeType -eq $nodeType)
      {
        $node.NodeName
    
        Remove-ServiceFabricNodeState -NodeName $node.NodeName -Force
      }
    }
    
  9. 从 ARM 模板的 Service Fabric 资源中删除原始节点类型引用。Remove the original node type reference from the Service Fabric resource in the ARM template.

    "name": "[variables('vmNodeType0Name')]",
    "applicationPorts": {
        "endPort": "[variables('nt0applicationEndPort')]",
        "startPort": "[variables('nt0applicationStartPort')]"
    },
    "clientConnectionEndpointPort": "[variables('nt0fabricTcpGatewayPort')]",
    "durabilityLevel": "Bronze",
    "ephemeralPorts": {
        "endPort": "[variables('nt0ephemeralEndPort')]",
        "startPort": "[variables('nt0ephemeralStartPort')]"
    },
    "httpGatewayEndpointPort": "[variables('nt0fabricHttpGatewayPort')]",
    "isPrimary": true,
    "reverseProxyEndpointPort": "[variables('nt0reverseProxyEndpointPort')]",
    "vmInstanceCount": "[parameters('nt0InstanceCount')]"
    

    (仅适用于银级和更高级别持续性的群集)更新模板中的群集资源,并通过在群集资源属性下添加 applicationDeltaHealthPolicies,将运行状况策略配置为忽略 fabric:/System 应用程序运行状况,如下所示。Only for Silver and higher durability clusters, update the cluster resource in the template and configure health policies to ignore fabric:/System application health by adding applicationDeltaHealthPolicies under cluster resource properties as given below. 以下策略应忽略现有错误,但不允许新的运行状况错误。The below policy should ignore existing errors but not allow new health errors.

    "upgradeDescription":  
    { 
     "forceRestart": false, 
     "upgradeReplicaSetCheckTimeout": "10675199.02:48:05.4775807", 
     "healthCheckWaitDuration": "00:05:00", 
     "healthCheckStableDuration": "00:05:00", 
     "healthCheckRetryTimeout": "00:45:00", 
     "upgradeTimeout": "12:00:00", 
     "upgradeDomainTimeout": "02:00:00", 
     "healthPolicy": { 
       "maxPercentUnhealthyNodes": 100, 
       "maxPercentUnhealthyApplications": 100 
     }, 
     "deltaHealthPolicy":  
     { 
       "maxPercentDeltaUnhealthyNodes": 0, 
       "maxPercentUpgradeDomainDeltaUnhealthyNodes": 0, 
       "maxPercentDeltaUnhealthyApplications": 0, 
       "applicationDeltaHealthPolicies":  
       { 
           "fabric:/System":  
           { 
               "defaultServiceTypeDeltaHealthPolicy":  
               { 
                       "maxPercentDeltaUnhealthyServices": 0 
               } 
           } 
       } 
     } 
    }
    
  10. 从 ARM 模板中删除与原始节点类型相关的所有其他资源。Remove all other resources related to the original node type from the ARM template. 如果需要一个已删除所有这些原始资源的模板,请参阅 Service Fabric - 新节点类型群集See Service Fabric - New Node Type Cluster for a template with all of these original resources removed.

  11. 部署修改后的 Azure 资源管理器模板。Deploy the modified Azure Resource Manager template. ** 此步骤需要花费一段时间,通常最长为两个小时。** This step will take a while, usually up to two hours. 此项升级会将设置更改为 InfrastructureService,因此需要重启节点。This upgrade will change settings to the InfrastructureService, therefore a node restart is needed. 在这种情况下,将忽略 forceRestart。In the this case forceRestart is ignored. 参数 upgradeReplicaSetCheckTimeout 指定 Service Fabric 等待分区进入安全状态(如果尚未进入安全状态)的最长时间。The parameter upgradeReplicaSetCheckTimeout specifies the maximum time that Service Fabric waits for a partition to be in a safe state, if not already in a safe state. 一旦节点上的所有分区都已通过安全检查,Service Fabric 就会在该节点上继续升级。Once safety checks pass for all partitions on a node, Service Fabric proceeds with the upgrade on that node. 可将参数 upgradeTimeout 的值减至 6 小时,但若要获得最高安全性,应使用 12 小时。The value for the parameter upgradeTimeout can be reduced to 6 hours, but for maximal safety 12 hours should be used. 然后,验证门户中的 Service Fabric 资源是否显示为就绪状态。Then validate that the Service Fabric resource in Portal shows as ready.

    # deploy the updated template files to the existing resource group
    $templateFilePath = "C:\AzureDeploy-4.json"
    $parameterFilePath = "C:\AzureDeploy.Parameters.json"
    
    New-AzResourceGroupDeployment `
        -ResourceGroupName $resourceGroupName `
        -TemplateFile $templateFilePath `
        -TemplateParameterFile $parameterFilePath `
    

    群集的主节点类型现已升级。The cluster's primary node type has now been upgraded. 验证已部署的所有应用程序是否正常运行以及群集运行状况是否正常。Verify that any deployed applications function properly and cluster health is ok.

后续步骤Next steps