在 Service Fabric 群集中修补 Windows 操作系统Patch the Windows operating system in your Service Fabric cluster

Important

应用程序版本 1.2.* 将在 2019 年 4 月 30 日停止支持。Application version 1.2.* is going out of support on 30 April 2019. 请升级到最新版本。Please upgrade to the latest version.

Azure 虚拟机规模集自动 OS 映像升级是使操作系统保持在 Azure 中进行修补的最佳做法,而修补业务流程应用程序 (POA) 是 Service Fabrics RepairManager Systems 服务的包装器,它可为非 Azure 托管群集启用基于配置的 OS 修补计划。Azure virtual machine scale set automatic OS image upgrades is the best practice for keeping your operating systems patched in Azure, and the Patch Orchestration Application (POA) is a wrapper around Service Fabrics RepairManager Systems service that enables configuration based OS patch scheduling for non-Azure hosted clusters. 非 Azure 托管群集不需要 POA,但需要按升级域计划修补程序安装,以便在不停机的情况下修补 Service Fabric 群集主机。POA is not required for non-Azure hosted clusters, but scheduling patch installation by Upgrade Domains, is required to patch Service Fabric clusters hosts without downtime.

POA 是一个 Azure Service Fabric 应用程序,可在 Service Fabric 群集中自动修补操作系统,而无需停机。POA is an Azure Service Fabric application that automates operating system patching on a Service Fabric cluster without downtime.

修补业务流程应用提供以下功能:The patch orchestration app provides the following features:

  • 自动完成操作系统更新安装 。Automatic operating system update installation. 自动下载并安装操作系统更新。Operating system updates are automatically downloaded and installed. 可根据需要重启群集节点,且无需让群集停机。Cluster nodes are rebooted as needed without cluster downtime.

  • 群集感知修补和运行状况集成 。Cluster-aware patching and health integration. 在应用更新时,修补业务流程应用会监视群集节点的运行状况。While applying updates, the patch orchestration app monitors the health of the cluster nodes. 群集节点的升级方式为一次一个节点,或一次一个升级域。Cluster nodes are upgraded one node at a time or one upgrade domain at a time. 如果群集的运行状况由于修补进程而恶化,则会停止修补以防止问题加重。If the health of the cluster goes down due to the patching process, patching is stopped to prevent aggravating the problem.

应用的内部详细信息Internal details of the app

修补业务流程应用由以下子组件组成:The patch orchestration app is composed of the following subcomponents:

  • 协调器服务:此有状态服务负责:Coordinator Service: This stateful service is responsible for:
    • 协调整个群集上的 Windows 更新作业。Coordinating the Windows Update job on the entire cluster.
    • 存储已完成的 Windows 更新操作的结果。Storing the result of completed Windows Update operations.
  • 节点代理服务:此无状态服务在所有 Service Fabric 群集节点上运行。Node Agent Service: This stateless service runs on all Service Fabric cluster nodes. 此服务负责:The service is responsible for:
    • 启动节点代理 NTService。Bootstrapping the Node Agent NTService.
    • 监视节点代理 NTService。Monitoring the Node Agent NTService.
  • 节点代理 NTService:此 Windows NT 服务以更高级别的特权 (SYSTEM) 运行。Node Agent NTService: This Windows NT service runs at a higher-level privilege (SYSTEM). 相比之下,节点代理服务和协调器服务以较低级别的特权 (NETWORK SERVICE) 运行。In contrast, the Node Agent Service and the Coordinator Service run at a lower-level privilege (NETWORK SERVICE). 该服务负责在所有群集节点上执行以下 Windows 更新作业:The service is responsible for performing the following Windows Update jobs on all the cluster nodes:
    • 在节点上禁用自动 Windows 更新。Disabling automatic Windows Update on the node.
    • 根据用户提供的策略下载并安装 Windows 更新。Downloading and installing Windows Update according to the policy the user has provided.
    • 安装 Windows 更新后重启计算机。Restarting the machine post Windows Update installation.
    • 将 Windows 更新的结果上传到协调器服务。Uploading the results of Windows updates to the Coordinator Service.
    • 在某个操作用完所有重试次数仍失败后报告运行状况。Reporting health reports in case an operation has failed after exhausting all retries.

Note

修补业务流程应用通过 Service Fabric 的“修复管理器”系统服务来禁用/启用节点和执行运行状况检查。The patch orchestration app uses the Service Fabric repair manager system service to disable or enable the node and perform health checks. 修补业务流程应用创建的修复任务跟踪每个节点的 Windows 更新进度。The repair task created by the patch orchestration app tracks the Windows Update progress for each node.

先决条件Prerequisites

Note

所需的最低 .NET Framework 版本为 4.6。Minimum .NET framework version required is 4.6.

启用“修复管理器”服务(如果尚未运行)Enable the repair manager service (if it's not running already)

修补业务流程应用需要在群集上启用“修复管理器”系统服务。The patch orchestration app requires the repair manager system service to be enabled on the cluster.

Azure 群集Azure clusters

银级持久层中的 Azure 群集默认启用“修复管理器”服务。Azure clusters in the silver durability tier have the repair manager service enabled by default. 黄金级持久层中的 Azure 群集可能启用或不启用“修复管理器”服务,具体取决于这些群集的创建时间。Azure clusters in the gold durability tier might or might not have the repair manager service enabled, depending on when those clusters were created. 铜级持久层中的 Azure 群集默认不启用“修复管理器”服务。Azure clusters in the bronze durability tier, by default, do not have the repair manager service enabled. 如果已启用该服务,可以看到它在 Service Fabric Explorer 的系统服务部分运行。If the service is already enabled, you can see it running in the system services section in the Service Fabric Explorer.

Azure 门户Azure portal

在设置群集时,可以从 Azure 门户启用修复管理器。You can enable repair manager from Azure portal at the time of setting up of cluster. 在配置群集时选择“附加功能” 下的“包含修复管理器” 选项。Select Include Repair Manager option under Add-on features at the time of cluster configuration. 从 Azure 门户启用修复管理器的图像Image of Enabling Repair Manager from Azure portal

Azure Resource Manager 部署模型Azure Resource Manager deployment model

另外,也可以使用 Azure 资源管理器部署模型在新的或现有 Service Fabric 群集上启用修复管理器服务。Alternatively you can use the Azure Resource Manager deployment model to enable the repair manager service on new and existing Service Fabric clusters. 获取要部署的群集的模板。Get the template for the cluster that you want to deploy. 可以使用示例模板,或者创建自定义 Azure 资源管理器部署模型模板。You can either use the sample templates or create a custom Azure Resource Manager deployment model template.

若要使用 Azure 资源管理器部署模型模板启用修复管理器服务,请执行以下操作:To enable the repair manager service using Azure Resource Manager deployment model template:

  1. 首先检查 Microsoft.ServiceFabric/clusters 资源的 apiversion 是否设置为 2017-07-01-previewFirst check that the apiversion is set to 2017-07-01-preview for the Microsoft.ServiceFabric/clusters resource. 如果不是,则需要将 apiVersion 更新为值 2017-07-01-preview 或更高的值:If it is different, then you need to update the apiVersion to the value 2017-07-01-preview or higher:

    {
        "apiVersion": "2017-07-01-preview",
        "type": "Microsoft.ServiceFabric/clusters",
        "name": "[parameters('clusterName')]",
        "location": "[parameters('clusterLocation')]",
        ...
    }
    
  2. 现在,通过在 fabricSettings 节后面添加以下 addonFeatures 节来启用“修复管理器”服务:Now enable the repair manager service by adding the following addonFeatures section after the fabricSettings section:

    "fabricSettings": [
        ...      
    ],
    "addonFeatures": [
        "RepairManager"
    ],
    
  3. 通过这些更改更新群集模板后,应用更改并等待升级完成。After you have updated your cluster template with these changes, apply them and let the upgrade finish. 现在可以看到“修复管理器”系统服务在群集中运行。You can now see the repair manager system service running in your cluster. 它在 Service Fabric Explorer 中的系统服务部分被称为 fabric:/System/RepairManagerServiceIt is called fabric:/System/RepairManagerService in the system services section in the Service Fabric Explorer.

独立的本地群集Standalone on-premises clusters

可以使用独立 Windows 群集的配置设置在新的和现有的 Service Fabric 群集上启用“修复管理器”服务。You can use the Configuration settings for standalone Windows cluster to enable the repair manager service on new and existing Service Fabric cluster.

启用“修复管理器”服务:To enable the repair manager service:

  1. 首先需要检查常规群集配置中的 apiversion 是否设置为 04-2017 或更高:First check that the apiversion in General cluster configurations is set to 04-2017 or higher:

    {
        "name": "SampleCluster",
        "clusterConfigurationVersion": "1.0.0",
        "apiVersion": "04-2017",
        ...
    }
    
  2. 现在,通过在 fabricSettings 节后面添加以下 addonFeatures 节来启用“修复管理器”服务,如下所示:Now enable repair manager service by adding the following addonFeatures section after the fabricSettings section as shown below:

    "fabricSettings": [
        ...      
    ],
    "addonFeatures": [
        "RepairManager"
    ],
    
  3. 通过这些更改更新群集清单后,使用已更新的群集清单创建新群集升级群集配置Update your cluster manifest with these changes, using the updated cluster manifest create a new cluster or upgrade the cluster configuration. 现在,群集使用已更新的群集清单运行后,就可以看到“修复管理器”系统服务在群集中运行,该服务在 Service Fabric Explorer 中的系统服务部分称为 fabric:/System/RepairManagerServiceOnce the cluster is running with updated cluster manifest, you can now see the repair manager system service running in your cluster, which is called fabric:/System/RepairManagerService, under system services section in the Service Fabric explorer.

为所有节点配置 Windows 更新Configure Windows Updates for all nodes

自动 Windows 更新可能会导致可用性丢失,因为多个群集节点可能同时重启。Automatic Windows Updates might lead to availability loss because multiple cluster nodes can restart at the same time. 修补业务流程应用默认会尝试在每个群集节点上禁用自动 Windows 更新。The patch orchestration app, by default, tries to disable the automatic Windows Update on each cluster node. 但是,如果设置由管理员或组策略管理,建议将 Windows 更新策略显式设置为“下载之前发出通知”。However, if the settings are managed by an administrator or Group Policy, we recommend setting the Windows Update policy to "Notify before Download" explicitly.

下载应用包Download the app package

若要下载应用程序包,请访问修补业务流程应用程序的 GitHub 发行页面To download application package, please visit GitHub release page of Patch Orchestration Application.

配置应用Configure the app

可根据需求配置修补业务流程应用的行为。The behavior of the patch orchestration app can be configured to meet your needs. 在创建或更新应用程序的过程中,通过传入应用程序参数来替代默认值。Override the default values by passing in the application parameter during application creation or update. 可以通过在 cmdlet Start-ServiceFabricApplicationUpgradeNew-ServiceFabricApplication 中指定 ApplicationParameter 来提供应用程序参数。Application parameters can be provided by specifying ApplicationParameter to the Start-ServiceFabricApplicationUpgrade or New-ServiceFabricApplication cmdlets.

参数Parameter 类型Type 详细信息Details
MaxResultsToCacheMaxResultsToCache LongLong 应缓存的 Windows 更新结果的最大数。Maximum number of Windows Update results, which should be cached.
在假定以下情况时,默认值为 3000:Default value is 3000 assuming the:
- 节点数为 20。- Number of nodes is 20.
- 节点上每月发生的更新次数为 5。- Number of updates happening on a node per month is five.
- 每个操作的结果数可为 10。- Number of results per operation can be 10.
- 应存储过去三个月的结果。- Results for the past three months should be stored.
TaskApprovalPolicyTaskApprovalPolicy 枚举Enum
{ NodeWise, UpgradeDomainWise }{ NodeWise, UpgradeDomainWise }
TaskApprovalPolicy 所指示的策略将由协调器服务用于跨 Service Fabric 群集节点安装 Windows 更新。TaskApprovalPolicy indicates the policy that is to be used by the Coordinator Service to install Windows updates across the Service Fabric cluster nodes.
允许值包括:Allowed values are:
NodeWise。NodeWise. 每次在一个节点上安装 Windows 更新。Windows Update is installed one node at a time.
UpgradeDomainWise。UpgradeDomainWise. 每次在一个升级域上安装 Windows 更新。Windows Update is installed one upgrade domain at a time. (在最大程度情况下,属于升级域的所有节点都可进行 Windows 更新。)(At the maximum, all the nodes belonging to an upgrade domain can go for Windows Update.)
请参阅常见问题解答部分,了解如何确定最适合你的群集的策略。Refer to FAQ section on how to decide which is best suited policy for your cluster.
LogsDiskQuotaInMBLogsDiskQuotaInMB LongLong
(默认值:1024)(Default: 1024)
可在节点本地持久保存的修补业务流程应用日志的最大大小,以 MB 为单位。Maximum size of patch orchestration app logs in MB, which can be persisted locally on nodes.
WUQueryWUQuery stringstring
(默认值:"IsInstalled=0")(Default: "IsInstalled=0")
用于获取 Windows 更新的查询。Query to get Windows updates. 有关详细信息,请参阅 WuQueryFor more information, see WuQuery.
InstallWindowsOSOnlyUpdatesInstallWindowsOSOnlyUpdates 布尔Boolean
(默认值:false)(default: false)
使用此标志来控制应当下载并安装哪些更新。Use this flag to control which updates should be downloaded and installed. 允许以下值Following values are allowed
true - 仅安装 Windows 操作系统更新。true - Installs only Windows operating system updates.
false - 在计算机上安装所有可用的更新。false - Installs all the available updates on the machine.
WUOperationTimeOutInMinutesWUOperationTimeOutInMinutes intInt
(默认值:90)(Default: 90)
指示任何 Windows 更新操作(搜索、下载或安装)的超时。Specifies the timeout for any Windows Update operation (search or download or install). 在指定的超时内未完成的操作将被中止。If the operation is not completed within the specified timeout, it is aborted.
WURescheduleCountWURescheduleCount intInt
(默认值:5)(Default: 5)
在操作持续失败的情况下,服务重新计划 Windows 更新的最大次数。The maximum number of times the service reschedules the Windows update in case an operation fails persistently.
WURescheduleTimeInMinutesWURescheduleTimeInMinutes intInt
(默认值:30)(Default: 30)
在持续失败的情况下,服务重新计划 Windows 更新的间隔。The interval at which the service reschedules the Windows update in case failure persists.
WUFrequencyWUFrequency 逗号分隔的字符串(默认值:"Weekly, Wednesday, 7:00:00")Comma-separated string (Default: "Weekly, Wednesday, 7:00:00") 安装 Windows 更新的频率。The frequency for installing Windows Update. 其格式和可能的值包括:The format and possible values are:
- Monthly, DD, HH:MM:SS,例如:Monthly, 5,12:22:32。- Monthly, DD, HH:MM:SS, for example, Monthly, 5,12:22:32.
字段 DD(天)允许的值为范围 1-28 中的数字和“last”。Permitted values for field DD (day) are numbers between the range 1-28 and "last".
- Weekly, DAY, HH:MM:SS,例如:Weekly, Tuesday, 12:22:32。- Weekly, DAY, HH:MM:SS, for example, Weekly, Tuesday, 12:22:32.
- Daily, HH:MM:SS,例如:Daily, 12:22:32。- Daily, HH:MM:SS, for example, Daily, 12:22:32.
- None 表示不应执行 Windows 更新。- None indicates that Windows Update shouldn't be done.

请注意,时间采用 UTC。Note that times are in UTC.
AcceptWindowsUpdateEulaAcceptWindowsUpdateEula 布尔Boolean
(默认值:True)(Default: true)
设置此标志即表示该应用程序将代表计算机所有者接受 Windows 更新的最终用户许可协议。By setting this flag, the application accepts the End-User License Agreement for Windows Update on behalf of the owner of the machine.

Tip

若要立即进行 Windows 更新,请依据应用程序部署时间设置 WUFrequencyIf you want Windows Update to happen immediately, set WUFrequency relative to the application deployment time. 例如,假设你有一个 5 节点测试群集,并计划在大约 UTC 下午 5:00 部署应用。For example, suppose that you have a five-node test cluster and plan to deploy the app at around 5:00 PM UTC. 如果假定应用程序升级或部署最多需要 30 分钟,请将 WUFrequency 设置为“Daily, 17:30:00”If you assume that the application upgrade or deployment takes 30 minutes at the maximum, set the WUFrequency as "Daily, 17:30:00"

部署应用Deploy the app

  1. 若要准备群集,请完成所有先决条件步骤。Finish all the prerequisite steps to prepare the cluster.

  2. 像部署任何其他 Service Fabric 应用那样部署修补业务流程应用。Deploy the patch orchestration app like any other Service Fabric app. 可以使用 PowerShell 部署应用。You can deploy the app by using PowerShell. 请按照使用 PowerShell 部署和删除应用程序中的步骤操作。Follow the steps in Deploy and remove applications using PowerShell.

  3. 若要在部署时配置应用程序,请将 ApplicationParameter 传递至 New-ServiceFabricApplication cmdlet。To configure the application at the time of deployment, pass the ApplicationParameter to the New-ServiceFabricApplication cmdlet. 为方便起见,我们随应用程序一同提供了脚本 Deploy.ps1。For your convenience, we've provided the script Deploy.ps1 along with the application. 使用脚本:To use the script:

    • 使用 Connect-ServiceFabricCluster 连接到 Service Fabric 群集。Connect to a Service Fabric cluster by using Connect-ServiceFabricCluster.
    • 结合相应的 ApplicationParameter 值执行 PowerShell 脚本 Deploy.ps1。Execute the PowerShell script Deploy.ps1 with the appropriate ApplicationParameter value.

Note

让脚本和应用程序文件夹 PatchOrchestrationApplication 始终位于同一目录中。Keep the script and the application folder PatchOrchestrationApplication in the same directory.

升级应用Upgrade the app

若要使用 PowerShell 升级现有的修补业务流程应用,请按照使用 PowerShell 进行 Service Fabric 应用程序升级中的步骤操作。To upgrade an existing patch orchestration app by using PowerShell, follow the steps in Service Fabric application upgrade using PowerShell.

删除应用Remove the app

若要删除应用程序,请按照使用 PowerShell 部署和删除应用程序中的步骤操作。To remove the application, follow the steps in Deploy and remove applications using PowerShell.

为方便起见,我们随应用程序一同提供了脚本 Undeploy.ps1。For your convenience, we've provided the script Undeploy.ps1 along with the application. 使用脚本:To use the script:

  • 使用 Connect-ServiceFabricCluster 连接到 Service Fabric 群集。Connect to a Service Fabric cluster by using Connect-ServiceFabricCluster.

  • 执行 PowerShell 脚本 Undeploy.ps1。Execute the PowerShell script Undeploy.ps1.

Note

让脚本和应用程序文件夹 PatchOrchestrationApplication 始终位于同一目录中。Keep the script and the application folder PatchOrchestrationApplication in the same directory.

查看 Windows 更新结果View the Windows Update results

修补业务流程应用公开了 REST API,向用户显示历史结果。The patch orchestration app exposes REST APIs to display the historical results to the user. 生成的 JSON 的示例:An example of the result JSON:

[
  {
    "NodeName": "_stg1vm_1",
    "WindowsUpdateOperationResults": [
      {
        "OperationResult": 0,
        "NodeName": "_stg1vm_1",
        "OperationTime": "2019-05-13T08:44:56.4836889Z",
        "OperationStartTime": "2019-05-13T08:44:33.5285601Z",
        "UpdateDetails": [
          {
            "UpdateId": "7392acaf-6a85-427c-8a8d-058c25beb0d6",
            "Title": "Cumulative Security Update for Internet Explorer 11 for Windows Server 2012 R2 (KB3185319)",
            "Description": "A security issue has been identified in a Azure software product that could affect your system. You can help protect your system by installing this update from 21Vianet. For a complete listing of the issues that are included in this update, see the associated Azure Knowledge Base article. After you install this update, you may have to restart your system.",
            "ResultCode": 0,
            "HResult": 0
          }
        ],
        "OperationType": 1,
        "WindowsUpdateQuery": "IsInstalled=0",
        "WindowsUpdateFrequency": "Daily,10:00:00",
        "RebootRequired": false
      }
    ]
  },
  ...
]

下面介绍了 JSON 的字段。Fields of the JSON are described below.

字段Field Values 详细信息Details
OperationResultOperationResult 0 - 已成功0 - Succeeded
1 - 已成功但有错误1 - Succeeded With Errors
2 - 已失败2 - Failed
3 - 已中止3 - Aborted
4 - 已中止,超时4 - Aborted With Timeout
指示整个操作的结果(通常涉及一个或多个更新的安装)。Indicates the result of overall operation (typically involving installation of one or more updates).
ResultCodeResultCode 与 OperationResult 相同Same as OperationResult 此字段指示单个更新的安装操作的结果。This field indicates result of installation operation for an individual update.
OperationTypeOperationType 1 - 安装1 - Installation
0 - 搜索并下载。0 - Search and Download.
Installation 是默认情况下结果中将显示的唯一 OperationType。Installation is the only OperationType that would be shown in the results by default.
WindowsUpdateQueryWindowsUpdateQuery 默认值为 "IsInstalled=0"Default is "IsInstalled=0" 用来搜索更新的 Windows 更新查询。Windows update query that was used to search for updates. 有关详细信息,请参阅 WuQueryFor more information, see WuQuery.
RebootRequiredRebootRequired true - 需要重新启动true - reboot was required
true - 不需要重新启动false - reboot was not required
指示是否需要重新启动才能完成更新安装。Indicates if reboot was required to complete installation of updates.
OperationStartTimeOperationStartTime DateTimeDateTime 指示启动操作(下载/安装)的时间。Indicates the time at which operation(Download/Installation) started.
OperationTimeOperationTime DateTimeDateTime 指示完成操作(下载/安装)的时间。Indicates the time at which operation(Download/Installation) completed.
HResultHResult 0 - 成功0 - Successful
其他 - 失败other - failure
指示 Windows 更新失败并出现 updateID“7392acaf-6a85-427c-8a8d-058c25beb0d6”的原因。Indicates the reason of failure of the windows update with updateID "7392acaf-6a85-427c-8a8d-058c25beb0d6".

如果尚未计划更新,则生成的 JSON 为空。If no update is scheduled yet, the result JSON is empty.

请登录到群集以查询 Windows 更新结果。Sign in to the cluster to query Windows Update results. 然后找出协调器服务的主终结点的副本地址,并在浏览器中点击此 URL: http://<REPLICA-IP>:<ApplicationPort>/PatchOrchestrationApplication/v1/GetWindowsUpdateResults。Then find out the replica address for the primary of the Coordinator Service, and hit the URL from the browser: http://<REPLICA-IP>:<ApplicationPort>/PatchOrchestrationApplication/v1/GetWindowsUpdateResults.

协调器服务的 REST 终结点有一个动态端口。The REST endpoint for the Coordinator Service has a dynamic port. 若要查看确切的 URL,请参考 Service Fabric Explorer。To check the exact URL, refer to the Service Fabric Explorer. 例如,可在 http://10.0.0.7:20000/PatchOrchestrationApplication/v1/GetWindowsUpdateResults 处获取结果。For example, the results are available at http://10.0.0.7:20000/PatchOrchestrationApplication/v1/GetWindowsUpdateResults.

REST 终结点的图像

如果在群集上启用了反向代理,则也可从群集外部访问该 URL。If the reverse proxy is enabled on the cluster, you can access the URL from outside of the cluster as well. 需要访问的终结点: http://<SERVERURL>:<REVERSEPROXYPORT>/PatchOrchestrationApplication/CoordinatorService/v1/GetWindowsUpdateResults。The endpoint that needs to be hit is http://<SERVERURL>:<REVERSEPROXYPORT>/PatchOrchestrationApplication/CoordinatorService/v1/GetWindowsUpdateResults.

若要在群集上启用反向代理,请按照 Azure Service Fabric 中的反向代理中的步骤操作。To enable the reverse proxy on the cluster, follow the steps in Reverse proxy in Azure Service Fabric.

Warning

配置反向代理后,公开 HTTP 终结点的群集中的所有微服务都可从群集外部进行访问。After the reverse proxy is configured, all micro services in the cluster that expose an HTTP endpoint are addressable from outside the cluster.

诊断/运行状况事件Diagnostics/health events

以下部分介绍如何通过 Service Fabric 群集上的修补业务流程应用程序调试/诊断修补程序更新问题。The following section talks about how to debug/diagnose issues with patch updates through Patch Orchestration Application on Service Fabric clusters.

Note

应安装 POA v1.4.0 才能获得下面所述的许多自助诊断改进。You should have v1.4.0 version of POA installed to get many of the below called out self diagnostic improvements.

NodeAgentNTService 将创建修复任务用于在节点上安装更新。The NodeAgentNTService creates repair tasks to install updates on the nodes. 然后,CoordinatorService 根据任务审批策略准备每个任务。Each task is then prepared by CoordinatorService according to task approval policy. 准备好的任务最终由修复管理器审批,如果群集处于不正常状态,修复管理器不会批准任何任务。The prepared tasks are finally approved by Repair Manager which will not approve any task if cluster is in unhealthy state. 让我们逐步了解如何在节点上进行更新。Lets go step by step to understand how updates proceed on a node.

  1. 在每个节点上运行的 NodeAgentNTService 按计划的时间查找可用的 Windows 更新。NodeAgentNTService, running on every node, looks for available Windows Update at the scheduled time. 如果有可用的更新,它会继续将更新下载到节点上。If updates are available, it goes ahead and downloads them on the node.

  2. 下载更新后,NodeAgentNTService 将为节点创建名为 POS___<唯一 ID> 的相应修复任务。Once the updates are downloaded, NodeAgentNTService, creates corresponding repair task for the node with the name POS___<unique_id>. 可以使用 cmdlet Get-ServiceFabricRepairTask 或节点详细信息部分所述的 SFX 查看这些修复任务。One can view these repair tasks using cmdlet Get-ServiceFabricRepairTask or in SFX in the node details section. 创建修复任务后,请立即转到声明的状态Once the repair task is created, quickly moves to Claimed state.

  3. Coordinator 服务定期查找处于已声明状态的修复任务,并继续根据 TaskApprovalPolicy 将这些任务更新到 Preparing 状态。The Coordinator service, periodically looks for repair tasks in claimed state and goes ahead and updates them to Preparing state based on the TaskApprovalPolicy. 如果 TaskApprovalPolicy 配置为 NodeWise,仅当没有任何其他修复任务当前处于 Preparing/Approved/Executing/Restoring 状态时,才会准备对应于节点的修复任务。If the TaskApprovalPolicy is configured to be NodeWise, a repair task corresponding to a node is prepared only if there is no other repair task currently in Preparing/Approved/Executing/Restoring state. 同理,如果 TaskApprovalPolicy 配置为 UpgradeWise,可以确保在任意时间,只有属于同一个升级域的节点才具有处于上述状态的任务。Similarly, in case of UpgradeWise TaskApprovalPolicy, it is ensured at any point there are tasks in the above states only for nodes which belong to the same upgrade domain. 在修复任务转到 Preparing 状态后,相应的 Service Fabric 节点将会禁用,其意图为“Restart”。Once a repair task is moved to Preparing state, the corresponding Service Fabric node is disabled with intent as "Restart".

    POA(v1.4.0 和更高版本)使用 CoordinaterService 上的属性“ClusterPatchingStatus”发布事件,以显示正在修补的节点。POA(v1.4.0 and above) posts events with property "ClusterPatchingStatus" on CoordinaterService to display the nodes which are being patched. 下图显示正在 _poanode_0 上安装更新:Below image shows that updates are getting installed on _poanode_0:

    群集修补状态的插图Image of Cluster patching status

  4. 禁用该节点后,修复任务将转到 Executing 状态。Once the node is disabled, the repair task is moved to Executing state. 请注意,此后修复任务会停滞在 preparing 状态,因为停滞在 disabling 状态的节点可能会导致阻止新的修复任务,因此会停止群集修补。Note, a repair task stuck in preparing state, after because a node is stuck in disabling state can result in blocking new repair task and hence halt patching of cluster.

  5. 修复任务进入 executing 状态后,将开始在该节点上安装修补程序。Once repair task is in executing state, the patch installation on that node begins. 从现在开始,安装修补程序后,节点不一定重启,具体取决于安装的修补程序。Here on, once the patch is installed, the node may or may not be restarted depending on the patch. 然后,修复任务将转到 restoring 状态,这会重新启用节点,并将其标记为 completed。Post that the repair task is moved to restoring state, which enables back the node again and then it is marked as completed.

    在 v1.4.0 及更高版本的应用程序中,可以通过查看 NodeAgentService 上包含属性“WUOperationStatus-[NodeName]”的运行状况事件,来查找更新状态。In v1.4.0 and above versions of the application, status of the update can be found by looking at the health events on NodeAgentService with property "WUOperationStatus-[NodeName]". 下图中的突出显示部分显示了节点“poanode_0”和“poanode_2”上的 Windows 更新状态:The highlighted sections in the images below show the status of windows update on node 'poanode_0' and 'poanode_2':

    Windows 更新操作状态插图Image of Windows update operation status

    Windows 更新操作状态插图Image of Windows update operation status

    还可以使用 PowerShell 获取详细信息,方法是连接到群集,然后使用 Get-ServiceFabricRepairTask 提取修复任务的状态。One can also get the details using powershell, by connecting to the cluster and fetching the state of the repair task using Get-ServiceFabricRepairTask. 以下示例显示“POS__poanode_2_125f2969-933c-4774-85d1-ebdf85e79f15”任务处于 DownloadComplete 状态。Like below example shows that "POS__poanode_2_125f2969-933c-4774-85d1-ebdf85e79f15" task is in DownloadComplete state. 这表示更新下载到“poanode_2”节点,一旦任务转到 Executing 状态,就会尝试安装这些更新。It means that updates have been downloaded on the node "poanode_2" and installation will be attempted once the task moves to Executing state.

    D:\service-fabric-poa-bin\service-fabric-poa-bin\Release> $k = Get-ServiceFabricRepairTask -TaskId "POS__poanode_2_125f2969-933c-4774-85d1-ebdf85e79f15"
    
    D:\service-fabric-poa-bin\service-fabric-poa-bin\Release> $k.ExecutorData
    {"ExecutorSubState":2,"ExecutorTimeoutInMinutes":90,"RestartRequestedTime":"0001-01-01T00:00:00"}
    

    如果需要查找更多信息,请登录到特定的 VM,使用 Windows 事件日志了解有关问题的详细信息。If there is still more to be found then, sign in to specific VM/VMs to find more about the issue using Windows event logs. 上述修复任务只能处于这些执行器子状态:The above mentioned repair task can only have these executor sub-states:

    ExecutorSubStateExecutorSubState 详细信息Detail
    None=1None=1 表示节点上没有正在进行的操作。Implies that there wasn't an ongoing operation on the node. 可能的状态转换。Possible state transitions.
    DownloadCompleted=2DownloadCompleted=2 表示下载操作已完成,状态为成功、部分失败或失败。Implies download operation has completed with success, partial failure, or failure.
    InstallationApproved=3InstallationApproved=3 表示下载操作已提前完成,修复管理器已批准安装。Implies download operation was completed earlier and Repair Manager has approved the installation.
    InstallationInProgress=4InstallationInProgress=4 对应于修复任务的执行状态。Corresponds to state of execution of the repair task.
    InstallationCompleted=5InstallationCompleted=5 表示安装已完成,状态为成功、部分成功或失败。Implies installation completed with success, partial success, or failure.
    RestartRequested=6RestartRequested=6 表示修补程序安装已完成,节点上有一个挂起的重启操作。Implies patch installation completed and there is a pending restart action on the node.
    RestartNotNeeded=7RestartNotNeeded=7 表示修补程序安装完成后不需要重启。Implies that restart was not needed after completion of patch installation.
    RestartCompleted=8RestartCompleted=8 表示重启已成功完成。Implies that restart completed successfully.
    OperationCompleted=9OperationCompleted=9 Windows 更新操作已成功完成。Windows update operation completed successfully.
    OperationAborted=10OperationAborted=10 表示 Windows 更新操作已中止。Implies that windows update operation is aborted.
  6. 在 v1.4.0 及更高版本的应用程序中,当节点上的更新尝试完成后,将在 NodeAgentService 上发布一个包含属性“WUOperationStatus-[NodeName]”的事件,以通知下一次要在何时尝试下载并安装更新以及启动。In v1.4.0 and above of the application, when update attempt on a node completes, an event with property "WUOperationStatus-[NodeName]" is posted on the NodeAgentService to notify when will the next attempt, to download and install update, start. 参阅下图:See the image below:

    Windows 更新操作状态插图Image of Windows update operation status

诊断日志Diagnostic logs

修补业务流程应用日志是作为 Service Fabric 运行时日志的一部分进行收集的。Patch orchestration app logs are collected as part of Service Fabric runtime logs.

在想要通过所选的诊断工具/管道捕获日志的情况下使用。In case you want to capture logs via diagnostic tool/pipeline of your choice. 修补业务流程应用程序使用以下固定的提供程序 ID 通过 eventsource 记录事件Patch orchestration application uses below fixed provider IDs to log events via eventsource

  • e39b723c-590c-4090-abb0-11e3e6616346e39b723c-590c-4090-abb0-11e3e6616346
  • fc0028ff-bfdc-499f-80dc-ed922c52c5e9fc0028ff-bfdc-499f-80dc-ed922c52c5e9
  • 24afa313-0d3b-4c7c-b485-1047fd964b6024afa313-0d3b-4c7c-b485-1047fd964b60
  • 05dc046c-60e9-4ef7-965e-91660adffa6805dc046c-60e9-4ef7-965e-91660adffa68

运行状况报告Health reports

对于以下情况,修补业务流程应用还会针对协调器服务或节点代理服务发布运行状况报告:The patch orchestration app also publishes health reports against the Coordinator Service or the Node Agent Service in the following cases:

节点代理 NTService 关闭The Node Agent NTService is down

如果某个节点上的节点代理 NTService 关闭,将会针对节点代理服务生成警告级别的运行状况报告。If the Node Agent NTService is down on a node, a warning-level health report is generated against the Node Agent Service.

未启用”修复管理器”服务The repair manager service is not enabled

如果在群集上找不到”修复管理器”服务,将会针对协调器服务生成警告级别的运行状况报告。If the repair manager service is not found on the cluster, a warning-level health report is generated for the Coordinator Service.

常见问题Frequently asked questions

问:Q. 为什么在修补业务流程应用运行时,我发现群集处于错误状态? Why do I see my cluster in an error state when the patch orchestration app is running?

A.A. 在安装过程中,修补业务流程应用会禁用或重启节点,这可能会暂时导致群集的运行状况变差。During the installation process, the patch orchestration app disables or restarts nodes, which can temporarily result in the health of the cluster going down.

根据应用程序的策略,执行修补操作期间可以让一个节点关闭,也可以让整个升级域同时关闭。 Based on the policy for the application, either one node can go down during a patching operation or an entire upgrade domain can go down simultaneously.

在 Windows 更新安装结束时,节点会在重启后重新启用。By the end of the Windows Update installation, the nodes are reenabled post restart.

在以下示例中,由于两个节点关闭且违反了 MaxPercentageUnhealthyNodes 策略,群集暂时进入了错误状态。In the following example, the cluster went to an error state temporarily because two nodes were down and the MaxPercentageUnhealthyNodes policy was violated. 这是暂时性错误,在修补操作继续后即可恢复。The error is temporary until the patching operation is ongoing.

不正常群集的图像

如果问题持续出现,请参阅“故障排除”部分。If the issue persists, refer to the Troubleshooting section.

问:Q. 修补业务流程应用处于警告状态 Patch orchestration app is in warning state

A.A. 查看针对应用程序发布的运行状况报告所报告的情况是否是根本原因。Check to see if a health report posted against the application is the root cause. 通常,警告中会包含问题的详细信息。Usually, the warning contains details of the problem. 如果该问题是暂时性的,则应用程序会自动从此状态中恢复。If the issue is transient, the application is expected to auto-recover from this state.

问:Q. 如果群集运行不正常,而我需要进行紧急的操作系统更新,该怎么办? What can I do if my cluster is unhealthy and I need to do an urgent operating system update?

A.A. 群集运行不正常时,修补业务流程应用不会安装更新。The patch orchestration app does not install updates while the cluster is unhealthy. 请尝试将群集恢复正常状态,消除修补业务流程应用工作流的阻碍。Try to bring your cluster to a healthy state to unblock the patch orchestration app workflow.

问:Q. 对于我的群集,应将 TaskApprovalPolicy 设置为“NodeWise”还是“UpgradeDomainWise”?Should I set TaskApprovalPolicy as 'NodeWise' or 'UpgradeDomainWise' for my cluster?

A.A. “UpgradeDomainWise”通过并行修补属于升级域的所有节点,使整个群集修补速度更快。'UpgradeDomainWise' makes the overall cluster patching faster by patching all the nodes belonging to an upgrade domain in parallel. 这意味着在修补过程中,属于整个升级域的节点将不可用(处于已禁用状态)。This means that nodes belonging to an entire upgrade domain would be unavailable (in Disabled state) during the patching process.

相比之下,“NodeWise”策略一次只修补一个节点,这意味着整个群集修补需要更长时间。In contrast 'NodeWise' policy patches only one node at a time, this implies overall cluster patching would take longer time. 但是,在修补过程中最多只有一个节点不可用(处于已禁用状态)。However, at max, only one node would be unavailable (in Disabled state) during the patching process.

如果你的群集在修补周期内可以容忍在 N-1 个升级域上运行(其中 N 是群集上升级域的总数),那么你可以将策略设置为“UpgradeDomainWise”,否则将其设置为“NodeWise”。If your cluster can tolerate running on N-1 number of upgrade domains during patching cycle (where N is the total number of upgrade domains on your cluster), then you can set the policy as 'UpgradeDomainWise', otherwise set it to 'NodeWise'.

问:Q. 修补一个节点需要多长时间?How much time does it take to patch a node?

A.A. 修补一个节点可能需要几分钟(例如:Windows Defender 定义更新)到几小时(例如:Windows 累积更新)。Patching a node may take minutes (for example: Windows Defender definition updates) to hours (for example: Windows Cumulative updates). 修补一个节点所需的时间主要取决于Time required to patch a node depends mostly on

  • 更新的大小The size of updates
  • 必须在修补窗口中应用的更新数Number of updates, which have to be applied in a patching window
  • 安装更新、重新启动节点(如果需要)以及完成重新启动后安装步骤所需的时间。Time it takes to install the updates, reboot the node (if required), and finish post-reboot installation steps.
  • VM/计算机的性能和网络条件。Performance of VM/machine and network conditions.

问:Q. 修补整个群集需要多长时间?How long does it take to patch an entire cluster?

A.A. 修补整个群集所需的时间取决于以下因素:The time needed to patch an entire cluster depends on the following factors:

  • 修补一个节点所需的时间。Time needed to patch a node.
  • 协调器服务的策略。The policy of the Coordinator Service. - 默认策略 NodeWise 导致一次仅修补一个节点,这将慢于 UpgradeDomainWise- The default policy, NodeWise, results in patching only one node at a time, which would be slower than UpgradeDomainWise. 例如:如果修补一个节点需要约 1 小时,想要修补 5 个升级域(每个升级域包含 4 个节点)的 20 个节点(相同类型的节点)群集。For example: If a node takes ~1 hour to be patched, in order to patch a 20 node (same type of nodes) cluster with 5 upgrade domains, each containing 4 nodes.
    • 如果策略为 NodeWise,则应需要大约 20 个小时来修补整个群集It should take ~20 hours to patch the entire cluster, if policy is NodeWise
    • 如果策略为 UpgradeDomainWise,则应需要大约 5 个小时It should take ~5 hours if policy is UpgradeDomainWise
  • 群集负载 - 每个修补操作都需要将客户工作负载重新分配到群集中的其他可用节点。Cluster load - Each patching operation requires relocating the customer workload to other available nodes in the cluster. 正在进行修补的节点将在此期间处于禁用状态。Node undergoing patch would be in Disabling state during this time. 如果群集正在运行接近峰值负载,则禁用过程将需要更长时间。If the cluster is running near peak load, the disabling process would take longer time. 因此,在这种重压条件下,整个修补过程可能会看起来很慢。Hence overall patching process may appear to be slow in such stressed conditions.
  • 修补期间的群集运行状况故障 - 群集运行状况中的任何降级都会中断修补过程。Cluster health failures during patching - Any degradation in health of the cluster would interrupt the patching process. 这将增加修补整个群集所需的总时间。This would add to the overall time required to patch the entire cluster.

问:Q. 为什么某些更新会出现在通过 REST API 获得的 Windows 更新结果中,而不是在计算机的 Windows 更新历史记录下?Why do I see some updates in Windows Update results obtained via REST API, but not under the Windows Update history on the machine?

A.A. 某些产品更新仅会显示在其各自的更新/修补历史记录中。Some product updates would only appear in their respective update/patch history. 例如,Windows Defender 更新不一定会显示在 Windows Server 2016 的 Windows 更新历史记录中。For example, Windows Defender updates may or may not show up in Windows Update history on Windows Server 2016.

问:Q. 修补业务流程应用是否可用来修补开发群集(单节点群集)?Can Patch Orchestration app be used to patch my dev cluster (one-node cluster)?

A.A. 否,修补业务流程应用不能用来修补单节点群集。No, Patch orchestration app cannot be used to patch one-node cluster. 此限制是设计使然,因为 Service Fabric 系统服务或者任意客户应用将面临停机时间,因此修复管理器不会批准任何修复工作进行修补。This limitation is by design, as service fabric system services or any customer apps will face downtime and hence any repair job for patching would never get approved by repair manager.

问:为何更新周期需要花费这么长时间?Q.Why is update cycle taking so long?

A.A. 可以在结果 JSON 中查询所有节点的更新周期对应的条目,然后,可以尝试使用 OperationStartTime 和 OperationTime(OperationCompletionTime) 来了解在每个节点上安装更新所花费的时间。Query for the result json, then, go through the entry of the update cycle for all nodes and then, you can try to find out the time taken by update installation on every node using OperationStartTime and OperationTime(OperationCompletionTime). 如果在某个较长时间段内未进行更新,原因可能是群集处于错误状态,因此,修复管理器未批准任何其他 POA 修复任务。If there was large time window in which no update was going on, it could be because the cluster was in error state and because of that repair manager did not approve any other POA repair tasks. 如果任一节点上的更新安装花费了较长时间,原因可能是该节点长时间未更新,并且有大量的更新等待安装,因此花费了较长时间。If update installation took long on any node, then, it could be possible that node was not updated from long time and a lot of updates were pending installation, which took time. 也可能是阻止了节点上的修补,原因是节点处于 disabling 状态,这种情况往往是禁用节点导致仲裁/数据丢失造成的。There could also be a case in which patching on a node is blocked due to node being stuck in disabling state which usually happens because disabling the node might lead to quorum/data loss situations.

问:Q. POA 修补节点时为何需要禁用该节点?Why is it required to disable the node when POA is patching it?

A.A. 修补业务流程应用程序使用“restart”意图禁用节点,这会停止/重新分配节点上运行的所有 Service Fabric 服务。Patch orchestration application disables the node with 'restart' intent which stops/reallocates all the Service fabric services running on the node. 目的是确保应用程序最终不会混用新的和旧的 DLL,因此,我们不建议在未禁用节点的情况下对其进行修补。This is done to ensure that applications do not end up using a mix of new and old dlls, so it is not recommended to patch a node without disabling it.

免责声明Disclaimers

  • 修补业务流程应用代表用户接受 Windows 更新的最终用户许可协议。The patch orchestration app accepts the End-User License Agreement of Windows Update on behalf of the user. 可以选择在应用程序的配置中关闭该设置。Optionally, the setting can be turned off in the configuration of the application.

  • 修补业务流程应用会通过收集遥测来跟踪使用情况和性能。The patch orchestration app collects telemetry to track usage and performance. 应用程序的遥测遵循 Service Fabric 运行时的遥测设置(默认为启用)。The application's telemetry follows the setting of the Service Fabric runtime's telemetry setting (which is on by default).

故障排除Troubleshooting

节点无法恢复启动状态A node is not coming back to up state

节点可能会卡在“正在禁用”状态,因为 :The node might be stuck in a disabling state because:

安全检查已挂起。A safety check is pending. 若要纠正此情况,请确保有足够多的节点处于正常状态。To remedy this situation, ensure that enough nodes are available in a healthy state.

节点可能会卡在“已禁用”状态,因为 :The node might be stuck in a disabled state because:

  • 节点已被手动禁用。The node was disabled manually.
  • 某个正在进行的 Azure 基础结构作业导致节点被禁用。The node was disabled due to an ongoing Azure infrastructure job.
  • 修补节点的修补业务流程应用暂时禁用了节点。The node was disabled temporarily by the patch orchestration app to patch the node.

节点可能会卡在关闭状态,因为 :The node might be stuck in a down state because:

  • 已手动将节点置于关闭状态。The node was put in a down state manually.
  • 节点正在重启(可能由修补业务流程应用触发)。The node is undergoing a restart (which might be triggered by the patch orchestration app).
  • VM 或计算机故障、网络连接问题导致节点关闭。The node is down due to a faulty VM or machine or network connectivity issues.

在某些节点上跳过了更新Updates were skipped on some nodes

修补业务流程应用根据重新计划策略尝试安装 Windows 更新。The patch orchestration app tries to install a Windows update according to the rescheduling policy. 服务根据应用程序策略尝试恢复节点并跳过更新。The service tries to recover the node and skip the update according to the application policy.

在这种情况下,将针对节点代理服务生成警告级别的运行状况报告。In such a case, a warning-level health report is generated against the Node Agent Service. Windows 更新结果也包含可能的失败原因。The result for Windows Update also contains the possible reason for the failure.

安装更新时群集运行状况转为错误状态The health of the cluster goes to error while the update installs

Windows 更新发生故障时,会使特定节点或升级域上的应用程序或群集的运行状况恶化。A faulty Windows update can bring down the health of an application or cluster on a particular node or upgrade domain. 修补业务流程应用会终止任何后续的 Windows 更新操作,直到群集再次正常运行。The patch orchestration app discontinues any subsequent Windows Update operation until the cluster is healthy again.

管理员必须介入,并判断为何 Windows 更新会导致应用程序或群集运行不正常。An administrator must intervene and determine why the application or cluster became unhealthy due to Windows Update.