在 Service Fabric 群集中修补 Windows 操作系统Patch the Windows operating system in your Service Fabric cluster

Azure 虚拟机规模集自动 OS 映像升级是使操作系统保持在 Azure 中进行修补的最佳做法,而修补业务流程应用程序 (POA) 是 Service Fabrics RepairManager Systems 服务的包装器,它可为非 Azure 托管群集启用基于配置的 OS 修补计划。Azure virtual machine scale set automatic OS image upgrades is the best practice for keeping your operating systems patched in Azure, and the Patch Orchestration Application (POA) is a wrapper around Service Fabrics RepairManager Systems service that enables configuration based OS patch scheduling for non-Azure hosted clusters. 非 Azure 托管群集不需要 POA,但需要按升级域计划修补程序安装,以便在不停机的情况下修补 Service Fabric 群集主机。POA is not required for non-Azure hosted clusters, but scheduling patch installation by Upgrade Domains, is required to patch Service Fabric clusters hosts without downtime.

POA 是一个 Azure Service Fabric 应用程序,可在 Service Fabric 群集中自动修补操作系统,而无需停机。POA is an Azure Service Fabric application that automates operating system patching on a Service Fabric cluster without downtime.

修补业务流程应用提供以下功能:The patch orchestration app provides the following features:

  • 自动完成操作系统更新安装。Automatic operating system update installation. 自动下载并安装操作系统更新。Operating system updates are automatically downloaded and installed. 可根据需要重启群集节点,且无需让群集停机。Cluster nodes are rebooted as needed without cluster downtime.

  • 群集感知修补和运行状况集成。Cluster-aware patching and health integration. 在应用更新时,修补业务流程应用会监视群集节点的运行状况。While applying updates, the patch orchestration app monitors the health of the cluster nodes. 群集节点的升级方式为一次一个节点,或一次一个升级域。Cluster nodes are upgraded one node at a time or one upgrade domain at a time. 如果群集的运行状况由于修补进程而恶化,则会停止修补以防止问题加重。If the health of the cluster goes down due to the patching process, patching is stopped to prevent aggravating the problem.

应用的内部详细信息Internal details of the app

修补业务流程应用由以下子组件组成:The patch orchestration app is composed of the following subcomponents:

  • 协调器服务:此有状态服务负责:Coordinator Service: This stateful service is responsible for:
    • 协调整个群集上的 Windows 更新作业。Coordinating the Windows Update job on the entire cluster.
    • 存储已完成的 Windows 更新操作的结果。Storing the result of completed Windows Update operations.
  • 节点代理服务:此无状态服务在所有 Service Fabric 群集节点上运行。Node Agent Service: This stateless service runs on all Service Fabric cluster nodes. 此服务负责:The service is responsible for:
    • 启动节点代理 NTService。Bootstrapping the Node Agent NTService.
    • 监视节点代理 NTService。Monitoring the Node Agent NTService.
  • 节点代理 NTService:此 Windows NT 服务以更高级别的特权 (SYSTEM) 运行。Node Agent NTService: This Windows NT service runs at a higher-level privilege (SYSTEM). 相比之下,节点代理服务和协调器服务以较低级别的特权 (NETWORK SERVICE) 运行。In contrast, the Node Agent Service and the Coordinator Service run at a lower-level privilege (NETWORK SERVICE). 该服务负责在所有群集节点上执行以下 Windows 更新作业:The service is responsible for performing the following Windows Update jobs on all the cluster nodes:
    • 在节点上禁用自动 Windows 更新。Disabling automatic Windows Update on the node.
    • 根据用户提供的策略下载并安装 Windows 更新。Downloading and installing Windows Update according to the policy the user has provided.
    • 安装 Windows 更新后重启计算机。Restarting the machine post Windows Update installation.
    • 将 Windows 更新的结果上传到协调器服务。Uploading the results of Windows updates to the Coordinator Service.
    • 在某个操作用完所有重试次数仍失败后报告运行状况。Reporting health reports in case an operation has failed after exhausting all retries.

Note

修补业务流程应用通过 Service Fabric 的“修复管理器”系统服务来禁用/启用节点和执行运行状况检查。 修补业务流程应用创建的修复任务跟踪每个节点的 Windows 更新进度。

先决条件Prerequisites

启用“修复管理器”服务(如果尚未运行)Enable the repair manager service (if it's not running already)

修补业务流程应用需要在群集上启用“修复管理器”系统服务。The patch orchestration app requires the repair manager system service to be enabled on the cluster.

Azure 群集Azure clusters

银级持久层中的 Azure 群集默认启用“修复管理器”服务。Azure clusters in the silver durability tier have the repair manager service enabled by default. 黄金级持久层中的 Azure 群集可能启用或不启用“修复管理器”服务,具体取决于这些群集的创建时间。Azure clusters in the gold durability tier might or might not have the repair manager service enabled, depending on when those clusters were created. 铜级持久层中的 Azure 群集默认不启用“修复管理器”服务。Azure clusters in the bronze durability tier, by default, do not have the repair manager service enabled. 如果已启用该服务,可以看到它在 Service Fabric Explorer 的系统服务部分运行。If the service is already enabled, you can see it running in the system services section in the Service Fabric Explorer.

Azure 门户Azure portal

在设置群集时,可以从 Azure 门户启用修复管理器。You can enable repair manager from Azure portal at the time of setting up of cluster. 在配置群集时选择“附加功能”下的“包含修复管理器”选项。Select Include Repair Manager option under Add-on features at the time of cluster configuration. 从 Azure 门户启用修复管理器的图像Image of Enabling Repair Manager from Azure portal

Azure Resource Manager 部署模型Azure Resource Manager deployment model

另外,也可以使用 Azure 资源管理器部署模型在新的或现有 Service Fabric 群集上启用修复管理器服务。Alternatively you can use the Azure Resource Manager deployment model to enable the repair manager service on new and existing Service Fabric clusters. 获取要部署的群集的模板。Get the template for the cluster that you want to deploy. 可以使用示例模板,或者创建自定义 Azure 资源管理器部署模型模板。You can either use the sample templates or create a custom Azure Resource Manager deployment model template.

若要使用 Azure 资源管理器部署模型模板启用修复管理器服务,请执行以下操作:To enable the repair manager service using Azure Resource Manager deployment model template:

  1. 首先检查 Microsoft.ServiceFabric/clusters 资源的 apiversion 是否设置为 2017-07-01-previewFirst check that the apiversion is set to 2017-07-01-preview for the Microsoft.ServiceFabric/clusters resource. 如果不是,则需要将 apiVersion 更新为值 2017-07-01-preview 或更高的值:If it is different, then you need to update the apiVersion to the value 2017-07-01-preview or higher:

    {
        "apiVersion": "2017-07-01-preview",
        "type": "Microsoft.ServiceFabric/clusters",
        "name": "[parameters('clusterName')]",
        "location": "[parameters('clusterLocation')]",
        ...
    }
    
  2. 现在,通过在 fabricSettings 节后面添加以下 addonFeatures 节来启用“修复管理器”服务:Now enable the repair manager service by adding the following addonFeatures section after the fabricSettings section:

    "fabricSettings": [
        ...      
    ],
    "addonFeatures": [
        "RepairManager"
    ],
    
  3. 通过这些更改更新群集模板后,应用更改并等待升级完成。After you have updated your cluster template with these changes, apply them and let the upgrade finish. 现在可以看到“修复管理器”系统服务在群集中运行。You can now see the repair manager system service running in your cluster. 它在 Service Fabric Explorer 中的系统服务部分被称为 fabric:/System/RepairManagerServiceIt is called fabric:/System/RepairManagerService in the system services section in the Service Fabric Explorer.

独立的本地群集Standalone on-premises clusters

可以使用独立 Windows 群集的配置设置在新的和现有的 Service Fabric 群集上启用“修复管理器”服务。You can use the Configuration settings for standalone Windows cluster to enable the repair manager service on new and existing Service Fabric cluster.

启用“修复管理器”服务:To enable the repair manager service:

  1. 首先需要检查常规群集配置中的 apiversion 是否设置为 04-2017 或更高:First check that the apiversion in General cluster configurations is set to 04-2017 or higher:

    {
        "name": "SampleCluster",
        "clusterConfigurationVersion": "1.0.0",
        "apiVersion": "04-2017",
        ...
    }
    
  2. 现在,通过在 fabricSettings 节后面添加以下 addonFeatures 节来启用“修复管理器”服务,如下所示:Now enable repair manager service by adding the following addonFeatures section after the fabricSettings section as shown below:

    "fabricSettings": [
        ...      
    ],
    "addonFeatures": [
        "RepairManager"
    ],
    
  3. 通过这些更改更新群集清单后,使用已更新的群集清单创建新群集升级群集配置Update your cluster manifest with these changes, using the updated cluster manifest create a new cluster or upgrade the cluster configuration. 现在,群集使用已更新的群集清单运行后,就可以看到“修复管理器”系统服务在群集中运行,该服务在 Service Fabric Explorer 中的系统服务部分称为 fabric:/System/RepairManagerServiceOnce the cluster is running with updated cluster manifest, you can now see the repair manager system service running in your cluster, which is called fabric:/System/RepairManagerService, under system services section in the Service Fabric explorer.

为所有节点配置 Windows 更新Configure Windows Updates for all nodes

自动 Windows 更新可能会导致可用性丢失,因为多个群集节点可能同时重启。Automatic Windows Updates might lead to availability loss because multiple cluster nodes can restart at the same time. 修补业务流程应用默认会尝试在每个群集节点上禁用自动 Windows 更新。The patch orchestration app, by default, tries to disable the automatic Windows Update on each cluster node. 但是,如果设置由管理员或组策略管理,建议将 Windows 更新策略显式设置为“下载之前发出通知”。However, if the settings are managed by an administrator or Group Policy, we recommend setting the Windows Update policy to "Notify before Download" explicitly.

下载应用包Download the app package

可以从存档链接下载应用程序和安装脚本。Application along with installation scripts can be downloaded from Archive link.

可以从 sfpkg 链接下载 sfpkg 格式的应用程序。Application in sfpkg format can be downloaded from sfpkg link. 这对基于 Azure 资源管理器的应用程序部署非常有用。This comes handy for Azure Resource Manager based application deployment.

配置应用Configure the app

可根据需求配置修补业务流程应用的行为。The behavior of the patch orchestration app can be configured to meet your needs. 在创建或更新应用程序的过程中,通过传入应用程序参数来替代默认值。Override the default values by passing in the application parameter during application creation or update. 可以通过在 cmdlet Start-ServiceFabricApplicationUpgradeNew-ServiceFabricApplication 中指定 ApplicationParameter 来提供应用程序参数。Application parameters can be provided by specifying ApplicationParameter to the Start-ServiceFabricApplicationUpgrade or New-ServiceFabricApplication cmdlets.

参数Parameter 类型Type 详细信息Details
MaxResultsToCacheMaxResultsToCache LongLong 应缓存的 Windows 更新结果的最大数。Maximum number of Windows Update results, which should be cached.
在假定以下情况时,默认值为 3000:Default value is 3000 assuming the:
- 节点数为 20。- Number of nodes is 20.
- 节点上每月发生的更新次数为 5。- Number of updates happening on a node per month is five.
- 每个操作的结果数可为 10。- Number of results per operation can be 10.
- 应存储过去三个月的结果。- Results for the past three months should be stored.
TaskApprovalPolicyTaskApprovalPolicy 枚举Enum
{ NodeWise, UpgradeDomainWise }{ NodeWise, UpgradeDomainWise }
TaskApprovalPolicy 所指示的策略将由协调器服务用于跨 Service Fabric 群集节点安装 Windows 更新。TaskApprovalPolicy indicates the policy that is to be used by the Coordinator Service to install Windows updates across the Service Fabric cluster nodes.
允许值包括:Allowed values are:
NodeWise。NodeWise. 每次在一个节点上安装 Windows 更新。Windows Update is installed one node at a time.
UpgradeDomainWise。UpgradeDomainWise. 每次在一个升级域上安装 Windows 更新。Windows Update is installed one upgrade domain at a time. (在最大程度情况下,属于升级域的所有节点都可进行 Windows 更新。)(At the maximum, all the nodes belonging to an upgrade domain can go for Windows Update.)
请参阅常见问题解答部分,了解如何确定最适合你的群集的策略。Refer to FAQ section on how to decide which is best suited policy for your cluster.
LogsDiskQuotaInMBLogsDiskQuotaInMB LongLong
(默认值:1024)(Default: 1024)
可在节点本地持久保存的修补业务流程应用日志的最大大小,以 MB 为单位。Maximum size of patch orchestration app logs in MB, which can be persisted locally on nodes.
WUQueryWUQuery 字符串string
(默认值:"IsInstalled=0")(Default: "IsInstalled=0")
用于获取 Windows 更新的查询。Query to get Windows updates. 有关详细信息,请参阅 WuQueryFor more information, see WuQuery.
InstallWindowsOSOnlyUpdatesInstallWindowsOSOnlyUpdates 布尔Boolean
(默认值:false)(default: false)
使用此标志来控制应当下载并安装哪些更新。Use this flag to control which updates should be downloaded and installed. 允许以下值Following values are allowed
true - 仅安装 Windows 操作系统更新。true - Installs only Windows operating system updates.
false - 在计算机上安装所有可用的更新。false - Installs all the available updates on the machine.
WUOperationTimeOutInMinutesWUOperationTimeOutInMinutes intInt
(默认值:90)(Default: 90)
指示任何 Windows 更新操作(搜索、下载或安装)的超时。Specifies the timeout for any Windows Update operation (search or download or install). 在指定的超时内未完成的操作将被中止。If the operation is not completed within the specified timeout, it is aborted.
WURescheduleCountWURescheduleCount intInt
(默认值:5)(Default: 5)
在操作持续失败的情况下,服务重新计划 Windows 更新的最大次数。The maximum number of times the service reschedules the Windows update in case an operation fails persistently.
WURescheduleTimeInMinutesWURescheduleTimeInMinutes intInt
(默认值:30)(Default: 30)
在持续失败的情况下,服务重新计划 Windows 更新的间隔。The interval at which the service reschedules the Windows update in case failure persists.
WUFrequencyWUFrequency 逗号分隔的字符串(默认值:"Weekly, Wednesday, 7:00:00")Comma-separated string (Default: "Weekly, Wednesday, 7:00:00") 安装 Windows 更新的频率。The frequency for installing Windows Update. 其格式和可能的值包括:The format and possible values are:
- Monthly, DD, HH:MM:SS,例如:Monthly, 5,12:22:32。- Monthly, DD, HH:MM:SS, for example, Monthly, 5,12:22:32.
字段 DD(天)允许的值为范围 1-28 中的数字和“last”。Permitted values for field DD (day) are numbers between the range 1-28 and "last".
- Weekly, DAY, HH:MM:SS,例如:Weekly, Tuesday, 12:22:32。- Weekly, DAY, HH:MM:SS, for example, Weekly, Tuesday, 12:22:32.
- Daily, HH:MM:SS,例如:Daily, 12:22:32。- Daily, HH:MM:SS, for example, Daily, 12:22:32.
- None 表示不应执行 Windows 更新。- None indicates that Windows Update shouldn't be done.

请注意,时间采用 UTC。Note that times are in UTC.
AcceptWindowsUpdateEulaAcceptWindowsUpdateEula 布尔Boolean
(默认值:True)(Default: true)
设置此标志即表示该应用程序将代表计算机所有者接受 Windows 更新的最终用户许可协议。By setting this flag, the application accepts the End-User License Agreement for Windows Update on behalf of the owner of the machine.

Tip

若要立即进行 Windows 更新,请依据应用程序部署时间设置 WUFrequency。 例如,假设你有一个 5 节点测试群集,并计划在大约 UTC 下午 5:00 部署应用。 如果假定应用程序升级或部署最多需要 30 分钟,请将 WUFrequency 设置为“Daily, 17:30:00”

部署应用Deploy the app

  1. 若要准备群集,请完成所有先决条件步骤。Finish all the prerequisite steps to prepare the cluster.

  2. 像部署任何其他 Service Fabric 应用那样部署修补业务流程应用。Deploy the patch orchestration app like any other Service Fabric app. 可以使用 PowerShell 部署应用。You can deploy the app by using PowerShell. 请按照使用 PowerShell 部署和删除应用程序中的步骤操作。Follow the steps in Deploy and remove applications using PowerShell.

  3. 若要在部署时配置应用程序,请将 ApplicationParameter 传递至 New-ServiceFabricApplication cmdlet。To configure the application at the time of deployment, pass the ApplicationParameter to the New-ServiceFabricApplication cmdlet. 为方便起见,我们随应用程序一同提供了脚本 Deploy.ps1。For your convenience, we've provided the script Deploy.ps1 along with the application. 使用脚本:To use the script:

    • 使用 Connect-ServiceFabricCluster 连接到 Service Fabric 群集。Connect to a Service Fabric cluster by using Connect-ServiceFabricCluster.
    • 结合相应的 ApplicationParameter 值执行 PowerShell 脚本 Deploy.ps1。Execute the PowerShell script Deploy.ps1 with the appropriate ApplicationParameter value.

Note

让脚本和应用程序文件夹 PatchOrchestrationApplication 始终位于同一目录中。

升级应用Upgrade the app

若要使用 PowerShell 升级现有的修补业务流程应用,请按照使用 PowerShell 进行 Service Fabric 应用程序升级中的步骤操作。To upgrade an existing patch orchestration app by using PowerShell, follow the steps in Service Fabric application upgrade using PowerShell.

删除应用Remove the app

若要删除应用程序,请按照使用 PowerShell 部署和删除应用程序中的步骤操作。To remove the application, follow the steps in Deploy and remove applications using PowerShell.

为方便起见,我们随应用程序一同提供了脚本 Undeploy.ps1。For your convenience, we've provided the script Undeploy.ps1 along with the application. 使用脚本:To use the script:

  • 使用 Connect-ServiceFabricCluster 连接到 Service Fabric 群集。Connect to a Service Fabric cluster by using Connect-ServiceFabricCluster.

  • 执行 PowerShell 脚本 Undeploy.ps1。Execute the PowerShell script Undeploy.ps1.

Note

让脚本和应用程序文件夹 PatchOrchestrationApplication 始终位于同一目录中。

查看 Windows 更新结果View the Windows Update results

修补业务流程应用公开了 REST API,向用户显示历史结果。The patch orchestration app exposes REST APIs to display the historical results to the user. 生成的 JSON 的示例:An example of the result JSON:

[
  {
    "NodeName": "_stg1vm_1",
    "WindowsUpdateOperationResults": [
      {
        "OperationResult": 0,
        "NodeName": "_stg1vm_1",
        "OperationTime": "2017-05-21T11:46:52.1953713Z",
        "UpdateDetails": [
          {
            "UpdateId": "7392acaf-6a85-427c-8a8d-058c25beb0d6",
            "Title": "Cumulative Security Update for Internet Explorer 11 for Windows Server 2012 R2 (KB3185319)",
            "Description": "A security issue has been identified in a Azure software product that could affect your system. You can help protect your system by installing this update from 21Vianet. For a complete listing of the issues that are included in this update, see the associated Azure Knowledge Base article. After you install this update, you may have to restart your system.",
            "ResultCode": 0
          }
        ],
        "OperationType": 1,
        "WindowsUpdateQuery": "IsInstalled=0",
        "WindowsUpdateFrequency": "Daily,10:00:00",
        "RebootRequired": false
      }
    ]
  },
  ...
]

下面介绍了 JSON 的字段。Fields of the JSON are described below.

字段Field Values 详细信息Details
OperationResultOperationResult 0 - 已成功0 - Succeeded
1 - 已成功但有错误1 - Succeeded With Errors
2 - 已失败2 - Failed
3 - 已中止3 - Aborted
4 - 已中止,超时4 - Aborted With Timeout
指示整个操作的结果(通常涉及一个或多个更新的安装)。Indicates the result of overall operation (typically involving installation of one or more updates).
ResultCodeResultCode 与 OperationResult 相同Same as OperationResult 此字段指示单个更新的安装操作的结果。This field indicates result of installation operation for an individual update.
OperationTypeOperationType 1 - 安装1 - Installation
0 - 搜索并下载。0 - Search and Download.
Installation 是默认情况下结果中将显示的唯一 OperationType。Installation is the only OperationType that would be shown in the results by default.
WindowsUpdateQueryWindowsUpdateQuery 默认值为 "IsInstalled=0"Default is "IsInstalled=0" 用来搜索更新的 Windows 更新查询。Windows update query that was used to search for updates. 有关详细信息,请参阅 WuQueryFor more information, see WuQuery.
RebootRequiredRebootRequired true - 需要重新启动true - reboot was required
true - 不需要重新启动false - reboot was not required
指示是否需要重新启动才能完成更新安装。Indicates if reboot was required to complete installation of updates.

如果尚未计划更新,则生成的 JSON 为空。If no update is scheduled yet, the result JSON is empty.

请登录到群集以查询 Windows 更新结果。Log in to the cluster to query Windows Update results. 然后找出协调器服务的主终结点的副本地址,并在浏览器中点击此 URL: http://<REPLICA-IP>:<ApplicationPort>/PatchOrchestrationApplication/v1/GetWindowsUpdateResults。Then find out the replica address for the primary of the Coordinator Service, and hit the URL from the browser: http://<REPLICA-IP>:<ApplicationPort>/PatchOrchestrationApplication/v1/GetWindowsUpdateResults.

协调器服务的 REST 终结点有一个动态端口。The REST endpoint for the Coordinator Service has a dynamic port. 若要查看确切的 URL,请参考 Service Fabric Explorer。To check the exact URL, refer to the Service Fabric Explorer. 例如,可在 http://10.0.0.7:20000/PatchOrchestrationApplication/v1/GetWindowsUpdateResults 处获取结果。For example, the results are available at http://10.0.0.7:20000/PatchOrchestrationApplication/v1/GetWindowsUpdateResults.

REST 终结点的图像

如果在群集上启用了反向代理,则也可从群集外部访问该 URL。If the reverse proxy is enabled on the cluster, you can access the URL from outside of the cluster as well. 需要访问的终结点: http://<SERVERURL>:<REVERSEPROXYPORT>/PatchOrchestrationApplication/CoordinatorService/v1/GetWindowsUpdateResults。The endpoint that needs to be hit is http://<SERVERURL>:<REVERSEPROXYPORT>/PatchOrchestrationApplication/CoordinatorService/v1/GetWindowsUpdateResults.

若要在群集上启用反向代理,请按照 Azure Service Fabric 中的反向代理中的步骤操作。To enable the reverse proxy on the cluster, follow the steps in Reverse proxy in Azure Service Fabric.

Warning

配置反向代理后,公开 HTTP 终结点的群集中的所有微服务都可从群集外部进行访问。

诊断/运行状况事件Diagnostics/health events

诊断日志Diagnostic logs

修补业务流程应用日志是作为 Service Fabric 运行日志的一部分进行收集的。Patch orchestration app logs are collected as part of Service Fabric runtime logs.

在想要通过所选的诊断工具/管道捕获日志的情况下使用。In case you want to capture logs via diagnostic tool/pipeline of your choice. 修补业务流程应用程序使用以下固定的提供程序 ID 通过 eventsource 记录事件Patch orchestration application uses below fixed provider ID's to log events via eventsource

  • e39b723c-590c-4090-abb0-11e3e6616346e39b723c-590c-4090-abb0-11e3e6616346
  • fc0028ff-bfdc-499f-80dc-ed922c52c5e9fc0028ff-bfdc-499f-80dc-ed922c52c5e9
  • 24afa313-0d3b-4c7c-b485-1047fd964b6024afa313-0d3b-4c7c-b485-1047fd964b60
  • 05dc046c-60e9-4ef7-965e-91660adffa6805dc046c-60e9-4ef7-965e-91660adffa68

运行状况报告Health reports

对于以下情况,修补业务流程应用还会针对协调器服务或节点代理服务发布运行状况报告:The patch orchestration app also publishes health reports against the Coordinator Service or the Node Agent Service in the following cases:

Windows 更新操作失败A Windows Update operation failed

如果某个节点上的 Windows 更新操作失败,则会针对节点代理服务生成运行状况报告。If a Windows Update operation fails on a node, a health report is generated against the Node Agent Service. 运行状况报告的详细信息包含有问题的节点名称。Details of the health report contain the problematic node name.

在有问题的节点上成功完成修补后,将自动清除该报告。After patching is successfully completed on the problematic node, the report is automatically cleared.

节点代理 NTService 关闭The Node Agent NTService is down

如果某个节点上的节点代理 NTService 关闭,将会针对节点代理服务生成警告级别的运行状况报告。If the Node Agent NTService is down on a node, a warning-level health report is generated against the Node Agent Service.

未启用”修复管理器”服务The repair manager service is not enabled

如果在群集上找不到”修复管理器”服务,将会针对协调器服务生成警告级别的运行状况报告。If the repair manager service is not found on the cluster, a warning-level health report is generated for the Coordinator Service.

常见问题Frequently asked questions

问:Q. 为什么在修补业务流程应用运行时,我发现群集处于错误状态?Why do I see my cluster in an error state when the patch orchestration app is running?

A.A. 在安装过程中,修补业务流程应用会禁用或重启节点,这可能会暂时导致群集的运行状况变差。During the installation process, the patch orchestration app disables or restarts nodes, which can temporarily result in the health of the cluster going down.

根据应用程序的策略,执行修补操作期间可以让一个节点关闭,也可以让整个升级域同时关闭。Based on the policy for the application, either one node can go down during a patching operation or an entire upgrade domain can go down simultaneously.

在 Windows 更新安装结束时,节点会在重启后重新启用。By the end of the Windows Update installation, the nodes are reenabled post restart.

在以下示例中,由于两个节点关闭且违反了 MaxPercentageUnhealthyNodes 策略,群集暂时进入了错误状态。In the following example, the cluster went to an error state temporarily because two nodes were down and the MaxPercentageUnhealthyNodes policy was violated. 这是暂时性错误,在修补操作继续后即可恢复。The error is temporary until the patching operation is ongoing.

不正常群集的图像

如果问题持续出现,请参阅“故障排除”部分。If the issue persists, refer to the Troubleshooting section.

问:Q. 修补业务流程应用处于警告状态Patch orchestration app is in warning state

A.A. 查看针对应用程序发布的运行状况报告所报告的情况是否是根本原因。Check to see if a health report posted against the application is the root cause. 通常,警告中会包含问题的详细信息。Usually, the warning contains details of the problem. 如果该问题是暂时性的,则应用程序会自动从此状态中恢复。If the issue is transient, the application is expected to auto-recover from this state.

问:Q. 如果群集运行不正常,而我需要进行紧急的操作系统更新,该怎么办?What can I do if my cluster is unhealthy and I need to do an urgent operating system update?

A.A. 群集运行不正常时,修补业务流程应用不会安装更新。The patch orchestration app does not install updates while the cluster is unhealthy. 请尝试将群集恢复正常状态,消除修补业务流程应用工作流的阻碍。Try to bring your cluster to a healthy state to unblock the patch orchestration app workflow.

问:Q. 对于我的群集,应将 TaskApprovalPolicy 设置为“NodeWise”还是“UpgradeDomainWise”?Should i set TaskApprovalPolicy as 'NodeWise' or 'UpgradeDomainWise' for my cluster?

A.A. “UpgradeDomainWise”通过并行修补属于升级域的所有节点,使整个群集修补速度更快。'UpgradeDomainWise' makes the overall cluster patching faster by patching all the nodes belonging to an upgrade domain in parallel. 这意味着在修补过程中,属于整个升级域的节点将不可用(处于已禁用状态)。This means that nodes belonging to an entire upgrade domain would be unavailable (in Disabled state) during the patching process.

相比之下,“NodeWise”策略一次只修补一个节点,这意味着整个群集修补需要更长时间。In contrast 'NodeWise' policy patches only one node at a time, this implies overall cluster patching would take longer time. 但是,在修补过程中最多只有一个节点不可用(处于已禁用状态)。However, at max, only one node would be unavailable (in Disabled state) during the patching process.

如果你的群集在修补周期内可以容忍在 N-1 个升级域上运行(其中 N 是群集上升级域的总数),那么你可以将策略设置为“UpgradeDomainWise”,否则将其设置为“NodeWise”。If your cluster can tolerate running on N-1 number of upgrade domains during patching cycle (where N is the total number of upgrade domains on your cluster), then you can set the policy as 'UpgradeDomainWise', otherwise set it to 'NodeWise'.

问:Q. 修补一个节点需要多长时间?How much time does it take to patch a node?

A.A. 修补一个节点可能需要几分钟(例如:Windows Defender 定义更新)到几小时(例如:Windows 累积更新)。Patching a node may take minutes (for example: Windows Defender definition updates) to hours (for example: Windows Cumulative updates). 修补一个节点所需的时间主要取决于Time required to patch a node depends mostly on

  • 更新的大小The size of updates
  • 必须在修补窗口中应用的更新数Number of updates, which have to be applied in a patching window
  • 安装更新、重新启动节点(如果需要)以及完成重新启动后安装步骤所需的时间。Time it takes to install the updates, reboot the node (if required), and finish post-reboot installation steps.
  • VM/计算机的性能和网络条件。Performance of VM/machine and network conditions.

问:Q. 修补整个群集需要多长时间?How long does it take to patch an entire cluster?

A.A. 修补整个群集所需的时间取决于以下因素:The time needed to patch an entire cluster depends on the following factors:

  • 修补一个节点所需的时间。Time needed to patch a node.
  • 协调器服务的策略。The policy of the Coordinator Service. - 默认策略 NodeWise 导致一次仅修补一个节点,这将慢于 UpgradeDomainWise- The default policy, NodeWise, results in patching only one node at a time, which would be slower than UpgradeDomainWise. 例如:如果修补一个节点需要约 1 小时,想要修补 5 个升级域(每个升级域包含 4 个节点)的 20 个节点(相同类型的节点)群集。For example: If a node takes ~1 hour to be patched, in order to patch a 20 node (same type of nodes) cluster with 5 upgrade domains, each containing 4 nodes.
    • 如果策略为 NodeWise,则应需要大约 20 个小时来修补整个群集It should take ~20 hours to patch the entire cluster, if policy is NodeWise
    • 如果策略为 UpgradeDomainWise,则应需要大约 5 个小时It should take ~5 hours if policy is UpgradeDomainWise
  • 群集负载 - 每个修补操作都需要将客户工作负载重新分配到群集中的其他可用节点。Cluster load - Each patching operation requires relocating the customer workload to other available nodes in the cluster. 正在进行修补的节点将在此期间处于禁用状态。Node undergoing patch would be in Disabling state during this time. 如果群集正在运行接近峰值负载,则禁用过程将需要更长时间。If the cluster is running near peak load, the disabling process would take longer time. 因此,在这种重压条件下,整个修补过程可能会看起来很慢。Hence overall patching process may appear to be slow in such stressed conditions.
  • 修补期间的群集运行状况故障 - 群集运行状况中的任何降级都会中断修补过程。Cluster health failures during patching - Any degradation in health of the cluster would interrupt the patching process. 这将增加修补整个群集所需的总时间。This would add to the overall time required to patch the entire cluster.

问:Q. 为什么某些更新会出现在通过 REST API 获得的 Windows 更新结果中,而不是在计算机的 Windows 更新历史记录下?Why do I see some updates in Windows Update results obtained via REST API, but not under the Windows Update history on the machine?

A.A. 某些产品更新仅会显示在其各自的更新/修补历史记录中。Some product updates would only appear in their respective update/patch history. 例如,Windows Defender 更新不一定会显示在 Windows Server 2016 的 Windows 更新历史记录中。For example, Windows Defender updates may or may not show up in Windows Update history on Windows Server 2016.

问:Q. 修补业务流程应用是否可用来修补开发群集(单节点群集)?Can Patch Orchestration app be used to patch my dev cluster (one-node cluster)?

A.A. 否,修补业务流程应用不能用来修补单节点群集。No, Patch orchestration app cannot be used to patch one-node cluster. 此限制是设计使然,因为 Service Fabric 系统服务或者任意客户应用将面临停机时间,因此修复管理器不会批准任何修复工作进行修补。This limitation is by design, as service fabric system services or any customer apps will face downtime and hence any repair job for patching would never get approved by repair manager.

免责声明Disclaimers

  • 修补业务流程应用代表用户接受 Windows 更新的最终用户许可协议。The patch orchestration app accepts the End-User License Agreement of Windows Update on behalf of the user. 可以选择在应用程序的配置中关闭该设置。Optionally, the setting can be turned off in the configuration of the application.

  • 修补业务流程应用会通过收集遥测来跟踪使用情况和性能。The patch orchestration app collects telemetry to track usage and performance. 应用程序的遥测遵循 Service Fabric 运行时的遥测设置(默认为启用)。The application's telemetry follows the setting of the Service Fabric runtime's telemetry setting (which is on by default).

故障排除Troubleshooting

节点无法恢复启动状态A node is not coming back to up state

节点可能会卡在“正在禁用”状态,因为:The node might be stuck in a disabling state because:

安全检查已挂起。A safety check is pending. 若要纠正此情况,请确保有足够多的节点处于正常状态。To remedy this situation, ensure that enough nodes are available in a healthy state.

节点可能会卡在“已禁用”状态,因为:The node might be stuck in a disabled state because:

  • 节点已被手动禁用。The node was disabled manually.
  • 某个正在进行的 Azure 基础结构作业导致节点被禁用。The node was disabled due to an ongoing Azure infrastructure job.
  • 修补节点的修补业务流程应用暂时禁用了节点。The node was disabled temporarily by the patch orchestration app to patch the node.

节点可能会卡在关闭状态,因为:The node might be stuck in a down state because:

  • 已手动将节点置于关闭状态。The node was put in a down state manually.
  • 节点正在重启(可能由修补业务流程应用触发)。The node is undergoing a restart (which might be triggered by the patch orchestration app).
  • VM 或计算机故障、网络连接问题导致节点关闭。The node is down due to a faulty VM or machine or network connectivity issues.

在某些节点上跳过了更新Updates were skipped on some nodes

修补业务流程应用根据重新计划策略尝试安装 Windows 更新。The patch orchestration app tries to install a Windows update according to the rescheduling policy. 服务根据应用程序策略尝试恢复节点并跳过更新。The service tries to recover the node and skip the update according to the application policy.

在这种情况下,将针对节点代理服务生成警告级别的运行状况报告。In such a case, a warning-level health report is generated against the Node Agent Service. Windows 更新结果也包含可能的失败原因。The result for Windows Update also contains the possible reason for the failure.

安装更新时群集运行状况转为错误状态The health of the cluster goes to error while the update installs

Windows 更新发生故障时,会使特定节点或升级域上的应用程序或群集的运行状况恶化。A faulty Windows update can bring down the health of an application or cluster on a particular node or upgrade domain. 修补业务流程应用会终止任何后续的 Windows 更新操作,直到群集再次正常运行。The patch orchestration app discontinues any subsequent Windows Update operation until the cluster is healthy again.

管理员必须介入,并判断为何 Windows 更新会导致应用程序或群集运行不正常。An administrator must intervene and determine why the application or cluster became unhealthy due to Windows Update.