纵向扩展 Service Fabric 群集主节点类型Scale up a Service Fabric cluster primary node type

本文介绍如何通过增加虚拟机资源来纵向扩展 Service Fabric 群集主节点类型。This article describes how to scale up a Service Fabric cluster primary node type by increasing the virtual machine resources. Service Fabric 群集是通过网络连接在一起的一组虚拟机或物理机,可在其中部署和管理微服务。A Service Fabric cluster is a network-connected set of virtual or physical machines into which your microservices are deployed and managed. 属于群集一部分的计算机或 VM 称为节点。A machine or VM that's part of a cluster is called a node. 虚拟机规模集是一种 Azure 计算资源,用于将一组 VM 作为一个集进行部署和管理。Virtual machine scale sets are an Azure compute resource that you use to deploy and manage a collection of virtual machines as a set. Azure 群集中定义的每个节点类型设置为独立的规模集Every node type that is defined in an Azure cluster is set up as a separate scale set. 然后可以单独管理每个节点类型。Each node type can then be managed separately. 创建 Service Fabric 群集后,可以纵向缩放群集节点类型(更改节点的资源)或升级节点类型 VM 的操作系统。After creating a Service Fabric cluster, you can scale a cluster node type vertically (change the resources of the nodes) or upgrade the operating system of the node type VMs. 随时可以缩放群集,即使该群集上正在运行工作负荷。You can scale the cluster at any time, even when workloads are running on the cluster. 在缩放群集的同时,应用程序也会随之自动缩放。As the cluster scales, your applications automatically scale as well.

Warning

如果群集运行状况不正常,请勿开始更改主节点类型 VM SKU。Do not start to change the primary nodetype VM SKU, if the cluster health is unhealthy. 群集运行状况不正常时,如果尝试更改 VM SKU,只会进一步破坏群集的稳定性。If the cluster health is unhealthy, you will only destabilize the cluster further, if you try to change the VM SKU.

我们建议不要更改规模集/节点类型的 VM SKU,除非它在银级持久性或更高的级别运行。We recommend that you do not change the VM SKU of a scale set/node type unless it is running at Silver durability or greater. 更改 VM SKU 大小是一种破坏数据的就地基础结构操作。Changing VM SKU Size is a data-destructive in-place infrastructure operation. 由于无法延迟或监视此更改,此操作可能会导致有状态服务的数据丢失或其他意外操作问题(甚至可能影响无状态工作负载)。Without some ability to delay or monitor this change, it is possible that the operation can cause data loss for stateful services or cause other unforeseen operational issues, even for stateless workloads. 这表示运行有状态 Service Fabric 系统服务的主节点类型,或运行有状态应用程序工作负载的任何节点类型。This means your primary node type, which is running stateful service fabric system services, or any node type that is running your stateful application work loads.

Note

本文进行了更新,以便使用新的 Azure PowerShell Az 模块。This article has been updated to use the new Azure PowerShell Az module. 你仍然可以使用 AzureRM 模块,至少在 2020 年 12 月之前,它将继续接收 bug 修补程序。You can still use the AzureRM module, which will continue to receive bug fixes until at least December 2020. 若要详细了解新的 Az 模块和 AzureRM 兼容性,请参阅新 Azure Powershell Az 模块简介To learn more about the new Az module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. 有关 Az 模块安装说明,请参阅安装 Azure PowerShellFor Az module installation instructions, see Install Azure PowerShell.

升级主节点类型 VM 的大小和操作系统Upgrade the size and operating system of the primary node type VMs

以下是主节点类型 VM 的 VM 大小和操作系统的更新过程。Here is the process for updating the VM size and operating system of the primary node type VMs. 升级后,主节点类型 VM 的大小为标准 D4_V2,并且运行带容器的 Windows Server 2016 Datacenter。After the upgrade, the primary node type VMs are size Standard D4_V2 and running Windows Server 2016 Datacenter with Containers.

Warning

在生产群集上尝试执行此过程之前,建议先研究示例模板并对测试群集验证此过程。Before attempting this procedure on a production cluster, we recommend that you study the sample templates and verify the process against a test cluster. 该群集也会有段时间不可用。The cluster is also unavailable for a time. 不能并行对声明为相同 NodeType 的多个 VMSS 执行更改,需要执行单独的部署操作来单独为每个 NodeType VMSS 应用更改。You can NOT make changes to multiple VMSS declared as the same NodeType in parallel; you will need to perform separated deployment operations to apply changes to each NodeType VMSS individually.

  1. 使用这些示例模板参数文件部署包含两种节点类型和两个规模集(每种节点类型一个规模集)的初始群集。Deploy the initial cluster with two node types and two scale sets (one scale set per node type) using these sample template and parameters files. 这两个规模集的大小均为标准 D2_V2,并且都运行 Windows Server 2012 R2 Datacenter。Both scale sets are size Standard D2_V2 and running Windows Server 2012 R2 Datacenter. 等待群集完成基线升级。Wait for the cluster to complete the baseline upgrade.
  2. 可选:向群集部署有状态示例。Optional- deploy a stateful sample to the cluster.
  3. 在决定升级主节点类型 VM 以后,使用这些示例模板参数文件向主节点类型添加新的规模集,这样一来,主节点类型现在就有两个规模集。After deciding to upgrade the primary node type VMs, add a new scale set to the primary node type using these sample template and parameters files so the primary node type now has two scale sets. 系统服务和用户应用程序能够在两个不同规模集中的 VM 之间迁移。System services and user applications are able to migrate between VMs in the two different scale sets. 新规模集 VM 的大小为标准 D4_V2,并运行带容器的 Windows Server 2016 Datacenter。The new scale set VMs are size Standard D4_V2 and run Windows Server 2016 Datacenter with Containers. 添加新的规模集时也会添加新的负载均衡器和公共 IP 地址。A new load balancer and public IP address are also added with the new scale set.
    若要在模板中查找新的规模集,请搜索由 vmNodeType2Name 参数命名的“Microsoft.Compute/virtualMachineScaleSets”资源。To find the new scale set in the template, search for the "Microsoft.Compute/virtualMachineScaleSets" resource named by the vmNodeType2Name parameter. 系统使用 properties->virtualMachineProfile->extensionProfile->extensions->properties->settings->nodeTypeRef 设置将新的规模集添加到主节点类型中。The new scale set is added to the primary node type using the properties->virtualMachineProfile->extensionProfile->extensions->properties->settings->nodeTypeRef setting.
  4. 检查群集运行状况并验证所有节点是否都处于正常状态。Check the cluster health and verify all the nodes are healthy.
  5. 禁用主节点类型的旧规模集中的节点,以便删除节点。Disable the nodes in the old scale set of the primary node type with the intent to remove node. 可以一次禁用所有节点,并且这些操作会排入队列。You can disable all at once and the operations are queued. 等到所有节点都被禁用,这可能需要一些时间。Wait until all nodes are disabled, which may take some time. 由于禁用了节点类型中较旧的节点,因此,系统服务和种子节点会迁移到主节点类型中新规模集的 VM。As the older nodes in the node type are disabled, the system services and seed nodes migrate to the VMs of the new scale set in the primary node type.
  6. 从主节点类型中删除较旧的规模集。Remove the older scale set from the primary node type.
  7. 删除与旧规模集关联的负载均衡器。Remove the load balancer associated with the old scale set. 在为新规模集配置新的公共 IP 地址和负载均衡器时,群集不可用。The cluster is unavailable while the new public IP address and load balancer are configured for the new scale set.
  8. 将与旧的主节点类型规模集关联的公共 IP 地址的 DNS 设置存储在变量中,并删除该公共 IP 地址。Store DNS settings of the public IP address associated with the old primary node type scale set in a variable and remove that public IP address.
  9. 将与新的主节点类型规模集关联的公共 IP 地址的 DNS 设置替换为已删除的公共 IP 地址的 DNS 设置。Replace the DNS settings of the public IP address associated with the new primary node type scale set with the DNS settings of the deleted public IP address. 现在可以再次访问群集。The cluster is now reachable again.
  10. 从群集中删除节点的节点状态。Remove the node state of the nodes from the cluster. 如果旧规模集的持续性级别为银级或金级,则此步骤由系统自动完成。If the durability level of the old scale set was silver or gold, this step is done by the system automatically.
  11. 如果在之前的步骤中部署了有状态的应用程序,请验证该应用程序能否正常运行。If you deployed the stateful application in a previous step, verify that the application is functional.
# Variables.
$groupname = "sfupgradetestgroup"
$clusterloc="chinaeast"  
$subscriptionID="<your subscription ID>"

# sign in to your Azure account and select your subscription
Connect-AzAccount -Environment AzureChinaCloud -SubscriptionId $subscriptionID 

# Create a new resource group for your deployment and give it a name and a location.
New-AzResourceGroup -Name $groupname -Location $clusterloc

# Deploy the two node type cluster.
New-AzResourceGroupDeployment -ResourceGroupName $groupname -TemplateParameterFile "C:\temp\cluster\Deploy-2NodeTypes-2ScaleSets.parameters.json" `
    -TemplateFile "C:\temp\cluster\Deploy-2NodeTypes-2ScaleSets.json" -Verbose

# Connect to the cluster and check the cluster health.
$ClusterName= "sfupgradetest.chinaeast.cloudapp.chinacloudapi.cn:19000"
$thumb="F361720F4BD5449F6F083DDE99DC51A86985B25B"

Connect-ServiceFabricCluster -ConnectionEndpoint $ClusterName -KeepAliveIntervalInSec 10 `
    -X509Credential `
    -ServerCertThumbprint $thumb  `
    -FindType FindByThumbprint `
    -FindValue $thumb `
    -StoreLocation CurrentUser `
    -StoreName My 

Get-ServiceFabricClusterHealth

# Deploy a new scale set into the primary node type.  Create a new load balancer and public IP address for the new scale set.
New-AzResourceGroupDeployment -ResourceGroupName $groupname -TemplateParameterFile "C:\temp\cluster\Deploy-2NodeTypes-3ScaleSets.parameters.json" `
    -TemplateFile "C:\temp\cluster\Deploy-2NodeTypes-3ScaleSets.json" -Verbose

# Check the cluster health again. All 15 nodes should be healthy.
Get-ServiceFabricClusterHealth

# Disable the nodes in the original scale set.
$nodeNames = @("_NTvm1_0","_NTvm1_1","_NTvm1_2","_NTvm1_3","_NTvm1_4")

Write-Host "Disabling nodes..."
foreach($name in $nodeNames){
    Disable-ServiceFabricNode -NodeName $name -Intent RemoveNode -Force
}

Write-Host "Checking node status..."
foreach($name in $nodeNames){

    $state = Get-ServiceFabricNode -NodeName $name 

    $loopTimeout = 50

    do{
        Start-Sleep 5
        $loopTimeout -= 1
        $state = Get-ServiceFabricNode -NodeName $name
        Write-Host "$name state: " $state.NodeDeactivationInfo.Status
    }

    while (($state.NodeDeactivationInfo.Status -ne "Completed") -and ($loopTimeout -ne 0))

    if ($state.NodeStatus -ne [System.Fabric.Query.NodeStatus]::Disabled)
    {
        Write-Error "$name node deactivation failed with state" $state.NodeStatus
        exit
    }
}

# Remove the scale set
$scaleSetName="NTvm1"
Remove-AzVmss -ResourceGroupName $groupname -VMScaleSetName $scaleSetName -Force
Write-Host "Removed scale set $scaleSetName"

$lbname="LB-sfupgradetest-NTvm1"
$oldPublicIpName="PublicIP-LB-FE-0"
$newPublicIpName="PublicIP-LB-FE-2"

# Store DNS settings of public IP address related to old Primary NodeType into variable 
$oldprimaryPublicIP = Get-AzPublicIpAddress -Name $oldPublicIpName  -ResourceGroupName $groupname

$primaryDNSName = $oldprimaryPublicIP.DnsSettings.DomainNameLabel

$primaryDNSFqdn = $oldprimaryPublicIP.DnsSettings.Fqdn

# Remove Load Balancer related to old Primary NodeType. This will cause a brief period of downtime for the cluster
Remove-AzLoadBalancer -Name $lbname -ResourceGroupName $groupname -Force

# Remove the old public IP
Remove-AzPublicIpAddress -Name $oldPublicIpName -ResourceGroupName $groupname -Force

# Replace DNS settings of Public IP address related to new Primary Node Type with DNS settings of Public IP address related to old Primary Node Type
$PublicIP = Get-AzPublicIpAddress -Name $newPublicIpName  -ResourceGroupName $groupname
$PublicIP.DnsSettings.DomainNameLabel = $primaryDNSName
$PublicIP.DnsSettings.Fqdn = $primaryDNSFqdn
Set-AzPublicIpAddress -PublicIpAddress $PublicIP

# Check the cluster health
Get-ServiceFabricClusterHealth

# Remove node state for the deleted nodes.
foreach($name in $nodeNames){
    # Remove the node from the cluster
    Remove-ServiceFabricNodeState -NodeName $name -TimeoutSec 300 -Force
    Write-Host "Removed node state for node $name"
}

后续步骤Next steps