在 Service Fabric 群集中修补 Linux 操作系统Patch the Linux operating system in your Service Fabric cluster

修补业务流程应用程序是一个 Azure Service Fabric 应用程序,可在 Service Fabric 群集中自动修补操作系统,而无需停机。The patch orchestration application is an Azure Service Fabric application that automates operating system patching on a Service Fabric cluster without downtime.

修补业务流程应用提供以下功能:The patch orchestration app provides the following features:

  • 自动完成操作系统更新安装。Automatic operating system update installation. 自动下载并安装操作系统更新。Operating system updates are automatically downloaded and installed. 可根据需要重启群集节点,且无需让群集停机。Cluster nodes are rebooted as needed without cluster downtime.

  • 群集感知修补和运行状况集成。Cluster-aware patching and health integration. 在应用更新时,修补业务流程应用会监视群集节点的运行状况。While applying updates, the patch orchestration app monitors the health of the cluster nodes. 群集节点的升级方式为一次一个节点,或一次一个升级域。Cluster nodes are upgraded one node at a time or one upgrade domain at a time. 如果群集的运行状况由于修补进程而恶化,则会停止修补以防止问题加重。If the health of the cluster goes down due to the patching process, patching is stopped to prevent aggravating the problem.

应用的内部详细信息Internal details of the app

修补业务流程应用由以下子组件组成:The patch orchestration app is composed of the following subcomponents:

  • 协调器服务:此有状态服务负责:Coordinator Service: This stateful service is responsible for:
    • 协调整个群集上的 OS 更新作业。Coordinating the OS Update job on the entire cluster.
    • 存储已完成的 OS 更新操作的结果。Storing the result of completed OS Update operations.
  • 节点代理服务:此无状态服务在所有 Service Fabric 群集节点上运行。Node Agent Service: This stateless service runs on all Service Fabric cluster nodes. 此服务负责:The service is responsible for:
    • 在 Linux 上启动节点代理守护程序。Bootstrapping the Node Agent daemon on Linux.
    • 监视守护程序服务。Monitoring the daemon service.
  • 节点代理守护程序:此 Linux 守护程序服务以更高级别的特权 (root) 运行。Node Agent daemon: This Linux daemon service runs at a higher-level privilege (root). 相比之下,节点代理服务和协调器服务以较低级别的特权运行。In contrast, the Node Agent Service and the Coordinator Service run at a lower-level privilege. 该服务负责在所有群集节点上执行以下更新作业:The service is responsible for performing the following Update jobs on all the cluster nodes:
    • 在节点上禁用自动 OS 更新。Disabling automatic OS Update on the node.
    • 根据用户提供的策略下载并安装 OS 更新。Downloading and installing OS Update according to the policy the user has provided.
    • 安装 OS 更新后,根据需要重启计算机。Restarting the machine post OS Update installation if needed.
    • 将 OS 更新的结果上传到协调器服务。Uploading the results of OS updates to the Coordinator Service.
    • 在某个操作用完所有重试次数仍失败后报告运行状况。Reporting health reports in case an operation has failed after exhausting all retries.

Note

修补业务流程应用通过 Service Fabric 的“修复管理器”系统服务来禁用/启用节点和执行运行状况检查。 修补业务流程应用创建的修复任务跟踪每个节点的更新进度。

先决条件Prerequisites

确保 Azure VM 正在运行 Ubuntu 16.04Ensure that your Azure VMs are running Ubuntu 16.04

截至编写本文档时,Ubuntu 16.04 (Xenial Xerus) 是唯一受支持的版本。At the time of writing this document, Ubuntu 16.04 (Xenial Xerus) is the only supported version.

确保 Service Fabric Linux 群集版本为 6.2.x 或以上Ensure that the service fabric linux cluster is version 6.2.x and above

Linux 版修补业务流程应用使用特定的运行时功能,这些功能只能在 Service Fabric 运行时 6.2.x 或更高版本中使用。Patch orchestration app linux uses certain features of runtime that are only available in service fabric runtime version 6.2.x and above.

启用“修复管理器”服务(如果尚未运行)Enable the repair manager service (if it's not running already)

修补业务流程应用需要在群集上启用“修复管理器”系统服务。The patch orchestration app requires the repair manager system service to be enabled on the cluster.

Azure 群集Azure clusters

银级和金级持久层中的 Azure Linux 群集默认已启用修复管理器服务。Azure linux clusters in the silver and gold durability tier have the repair manager service enabled by default. 铜级持久层中的 Azure 群集默认不启用“修复管理器”服务。Azure clusters in the bronze durability tier, by default, do not have the repair manager service enabled. 如果已启用该服务,可以看到它在 Service Fabric Explorer 的系统服务部分运行。If the service is already enabled, you can see it running in the system services section in the Service Fabric Explorer.

Azure 门户Azure portal

在设置群集时,可以从 Azure 门户启用修复管理器。You can enable repair manager from Azure portal at the time of setting up of cluster. 在配置群集时选择“附加功能”下的“包含修复管理器”选项。Select Include Repair Manager option under Add-on features at the time of cluster configuration. 从 Azure 门户启用修复管理器的图像Image of Enabling Repair Manager from Azure portal

Azure Resource Manager 部署模型Azure Resource Manager deployment model

另外,也可以使用 Azure 资源管理器部署模型在新的或现有 Service Fabric 群集上启用修复管理器服务。Alternatively you can use the Azure Resource Manager deployment model to enable the repair manager service on new and existing Service Fabric clusters. 获取要部署的群集的模板。Get the template for the cluster that you want to deploy. 可以使用示例模板,或者创建自定义 Azure 资源管理器部署模型模板。You can either use the sample templates or create a custom Azure Resource Manager deployment model template.

若要使用 Azure 资源管理器部署模型模板启用修复管理器服务,请执行以下操作:To enable the repair manager service using Azure Resource Manager deployment model template:

  1. 首先检查 Microsoft.ServiceFabric/clusters 资源的 apiversion 是否设置为 2017-07-01-previewFirst check that the apiversion is set to 2017-07-01-preview for the Microsoft.ServiceFabric/clusters resource. 如果不是,则需要将 apiVersion 更新为值 2017-07-01-preview 或更高的值:If it is different, then you need to update the apiVersion to the value 2017-07-01-preview or higher:

    {
        "apiVersion": "2017-07-01-preview",
        "type": "Microsoft.ServiceFabric/clusters",
        "name": "[parameters('clusterName')]",
        "location": "[parameters('clusterLocation')]",
        ...
    }
    
  2. 现在,通过在 fabricSettings 节后面添加以下 addonFeatures 节来启用“修复管理器”服务:Now enable the repair manager service by adding the following addonFeatures section after the fabricSettings section:

    "fabricSettings": [
        ...      
    ],
    "addonFeatures": [
        "RepairManager"
    ],
    
  3. 通过这些更改更新群集模板后,应用更改并等待升级完成。After you have updated your cluster template with these changes, apply them and let the upgrade finish. 现在可以看到“修复管理器”系统服务在群集中运行。You can now see the repair manager system service running in your cluster. 它在 Service Fabric Explorer 中的系统服务部分被称为 fabric:/System/RepairManagerServiceIt is called fabric:/System/RepairManagerService in the system services section in the Service Fabric Explorer.

独立的本地群集Standalone on-premises clusters

截至编写本文档时,Service Fabric Linux 独立群集不受支持。Standalone Service Fabric Linux clusters aren't supported at the time of writing this document.

在所有节点上禁用自动 OS 更新Disable automatic OS Update on all nodes

自动 OS 更新可能导致失去可用性,或者运行中应用程序的行为发生变化。Automatic OS updates might lead to availability loss and or change in behavior of the running applications. 修补业务流程应用默认会尝试在每个群集节点上禁用自动 OS 更新,以防止出现此类情况。The patch orchestration app, by default, tries to disable the automatic OS Update on each cluster node to prevent such scenarios. 对于 Ubuntu,修补业务流程应用会禁用 unattended-upgradesFor Ubuntu unattended-upgrades are disabled by patch orchestration app.

下载应用包Download the app package

可以从存档链接下载应用程序和安装脚本。Application along with installation scripts can be downloaded from Archive link.

可以从 sfpkg 链接下载 sfpkg 格式的应用程序。Application in sfpkg format can be downloaded from sfpkg link. 这对基于 Azure 资源管理器的应用程序部署非常有用。This comes handy for Azure Resource Manager based application deployment.

配置应用Configure the app

可根据需求配置修补业务流程应用的行为。The behavior of the patch orchestration app can be configured to meet your needs. 在创建或更新应用程序的过程中,通过传入应用程序参数来替代默认值。Override the default values by passing in the application parameter during application creation or update. 可以通过在 cmdlet Start-ServiceFabricApplicationUpgradeNew-ServiceFabricApplication 中指定 ApplicationParameter 来提供应用程序参数。Application parameters can be provided by specifying ApplicationParameter to the Start-ServiceFabricApplicationUpgrade or New-ServiceFabricApplication cmdlets.

参数Parameter 类型Type 详细信息Details
MaxResultsToCacheMaxResultsToCache LongLong 应缓存的更新结果数上限。Maximum number of Update results, which should be cached.
在假定以下情况时,默认值为 3000:Default value is 3000 assuming the:
- 节点数为 20。- Number of nodes is 20.
- 节点上每月发生的更新次数为 5。- Number of updates happening on a node per month is five.
- 每个操作的结果数可为 10。- Number of results per operation can be 10.
- 应存储过去三个月的结果。- Results for the past three months should be stored.
TaskApprovalPolicyTaskApprovalPolicy 枚举Enum
{ NodeWise, UpgradeDomainWise }{ NodeWise, UpgradeDomainWise }
TaskApprovalPolicy 指示协调器服务用于跨 Service Fabric 群集节点安装更新的策略。TaskApprovalPolicy indicates the policy that is to be used by the Coordinator Service to install updates across the Service Fabric cluster nodes.
允许值包括:Allowed values are:
NodeWise。NodeWise. 每次在一个节点上安装更新。Updates are installed one node at a time.
UpgradeDomainWise。UpgradeDomainWise. 每次在一个升级域上安装更新。Updates are installed one upgrade domain at a time. (在最大程度情况下,属于升级域的所有节点都可进行更新。)(At the maximum, all the nodes belonging to an upgrade domain can go for update.)
UpdateOperationTimeOutInMinutesUpdateOperationTimeOutInMinutes intInt
(默认值:180)(Default: 180)
指示任何更新操作(下载或安装)的超时。Specifies the timeout for any Update operation (download or install). 在指定的超时内未完成的操作将被中止。If the operation is not completed within the specified timeout, it is aborted.
RescheduleCountRescheduleCount intInt
(默认值:5)(Default: 5)
在操作持续失败的情况下,服务重新计划 OS 更新的最大次数。The maximum number of times the service reschedules the OS update in case an operation fails persistently.
RescheduleTimeInMinutesRescheduleTimeInMinutes intInt
(默认值:30)(Default: 30)
在持续失败的情况下,服务重新计划 OS 更新的间隔。The interval at which the service reschedules the OS update in case failure persists.
UpdateFrequencyUpdateFrequency 逗号分隔的字符串(默认值:"Weekly, Wednesday, 7:00:00")Comma-separated string (Default: "Weekly, Wednesday, 7:00:00") 在群集上安装 OS 更新的频率。The frequency for installing OS updates on the cluster. 其格式和可能的值包括:The format and possible values are:
- Monthly, DD, HH:MM:SS,例如:Monthly, 5,12:22:32。- Monthly, DD, HH:MM:SS, for example, Monthly, 5, 12:22:32.
- Weekly, DAY, HH:MM:SS,例如:Weekly, Tuesday, 12:22:32。- Weekly, DAY, HH:MM:SS, for example, Weekly, Tuesday, 12:22:32.
- Daily, HH:MM:SS,例如:Daily, 12:22:32。- Daily, HH:MM:SS, for example, Daily, 12:22:32.
- None 表示不应执行更新。- None indicates that update shouldn't be done.

所有时间采用 UTC 格式。All the times are in UTC.
UpdateClassificationUpdateClassification 逗号分隔的字符串(默认值:“securityupdates”)Comma-separated string (Default: "securityupdates") 应在群集节点上安装的更新类型。Type of updates that should be installed on the cluster nodes. 可接受的值为 securityupdates、all。Acceptable values are securityupdates, all.
- securityupdates - 只安装安全更新- securityupdates - would install only security updates
- all - 安装 apt 中的所有可用更新。- all - would install all available updates from apt.
ApprovedPatchesApprovedPatches 逗号分隔的字符串(默认值:"")Comma-separated string (Default: "") 这是应在群集节点上安装的已批准更新列表。This is the list of approved updates that should be installed on cluster nodes. 逗号分隔的列表包含已批准的包和可选的所需目标版本。The comma-separated list contains approved packages and optionally desired target version.
例如:"apt-utils = 1.2.10ubuntu1, python3-jwt, apt-transport-https < 1.2.194, libsystemd0 >= 229-4ubuntu16"for example: "apt-utils = 1.2.10ubuntu1, python3-jwt, apt-transport-https < 1.2.194, libsystemd0 >= 229-4ubuntu16"
上述代码会安装The above would install
- 包含版本 1.2.10ubuntu1 的 apt-utils(如果 apt-cache 中已提供)。- apt-utils with version 1.2.10ubuntu1 if it is available in apt-cache. 如果未提供该特定版本,则不执行任何操作。If that particular version isn't available, then it is a no-op.
- python3-jwt 升级到最新可用版本。- python3-jwt upgrades to latest available version. 如果该包不存在,则不执行任何操作。If the package is not present, then it is a no-op.
- apt-transport-https 升级到低于 1.2.194 的最高版本。- apt-transport-https upgrades to highest version that is less than 1.2.194. 如果该版本不存在,则不执行任何操作。If this version is not present, then it is a no-op.
- libsystemd0 升级到大于等于 229-4ubuntu16 的最高版本。- libsystemd0 upgrades to highest version that is greater than equal to 229-4ubuntu16. 如果此类版本不存在,则不执行任何操作。If such a version does not exist, then it is a no-op.
RejectedPatchesRejectedPatches 逗号分隔的字符串(默认值:"")Comma-separated string (Default: "") 这是不应在群集节点上安装的更新列表。This is the list of updates that should not be installed on the cluster nodes
例如:"bash, sudo"for example: "bash, sudo"
上述代码会从更新接收项目中筛选出 bash、sudo。The preceding filters out bash, sudo from receiving any updates.

Tip

若要立即进行 OS 更新,请依据应用程序部署时间设置 UpdateFrequency。 例如,假设你有一个 5 节点测试群集,并计划在大约 UTC 下午 5:00 部署应用。 如果假定应用程序升级或部署最多需要 30 分钟,请将 UpdateFrequency 设置为“Daily, 17:30:00”。

部署应用Deploy the app

  1. 完成所有先决条件步骤来准备群集。Prepare the cluster by finishing all the prerequisite steps.

  2. 像部署任何其他 Service Fabric 应用那样部署修补业务流程应用。Deploy the patch orchestration app like any other Service Fabric app. 可以使用 PowerShell 或 Azure Service Fabric CLI 部署应用。You can deploy the app by using PowerShell or Azure Service Fabric CLI. 遵循使用 PowerShell 部署和删除应用程序使用 Azure Service Fabric CLI 部署应用程序中的步骤。Follow the steps in Deploy and remove applications using PowerShell or Deploy application using Azure Service Fabric CLI

  3. 若要在部署时配置应用程序,请将 ApplicationParameter 传递给 New-ServiceFabricApplication cmdlet 或提供的脚本。To configure the application at the time of deployment, pass the ApplicationParameter to the New-ServiceFabricApplication cmdlet or the scripts provided. 为方便起见,我们随应用程序一同提供了 powershell (Deploy.ps1) 和 bash (Deploy.sh) 脚本。For your convenience, powershell (Deploy.ps1) and bash (Deploy.sh) scripts are provided along with the application. 使用脚本:To use the script:

    • 连接到 Service Fabric 群集。Connect to a Service Fabric cluster.
    • 执行部署脚本。Execute the Deploy script. (可选)将应用程序参数传递给脚本。Optionally pass the application parameter to the script. 例如:.\Deploy.ps1 -ApplicationParameter @{ UpdateFrequency = "Daily, 11:00:00"} OR ./Deploy.sh "{"UpdateFrequency":"Daily, 11:00:00"}"for example: .\Deploy.ps1 -ApplicationParameter @{ UpdateFrequency = "Daily, 11:00:00"} OR ./Deploy.sh "{"UpdateFrequency":"Daily, 11:00:00"}"

Note

让脚本和应用程序文件夹 PatchOrchestrationApplication 始终位于同一目录中。

升级应用Upgrade the app

若要升级现有的修补业务流程应用,请遵循使用 PowerShell 进行 Service Fabric 应用程序升级使用 Azure Service Fabric CLI 进行 Service Fabric 应用程序升级中的步骤。To upgrade an existing patch orchestration app, follow the steps in Service Fabric application upgrade using PowerShell or Service Fabric application upgrade using Azure Service Fabric CLI

删除应用Remove the app

若要删除应用程序,请遵循使用 PowerShell 部署和删除应用程序使用 Azure Service Fabric CLI 删除应用程序中的步骤。To remove the application, follow the steps in Deploy and remove applications using PowerShell or Remove an application using Azure Service Fabric CLI

为方便起见,我们随应用程序一同提供了 powershell (Undeploy.ps1) 和 bash (Undeploy.sh) 脚本。For your convenience, powershell (Undeploy.ps1) and bash (Undeploy.sh) scripts are provided along with the application. 使用脚本:To use the script:

  • 连接到 Service Fabric 群集。Connect to a Service Fabric cluster.
  • 执行脚本 Undeploy.ps1 或 Undeploy.shExecute the script Undeploy.ps1 or Undeploy.sh

Note

让脚本和应用程序文件夹 PatchOrchestrationApplication 始终位于同一目录中。

查看更新结果View the Update results

修补业务流程应用公开了 REST API,向用户显示历史结果。The patch orchestration app exposes REST APIs to display the historical results to the user. 下面是一个示例结果:testadm@bronze000001:~$ curl -X GET http://10.0.0.5:20002/PatchOrchestrationApplication/v1/GetResultsFollowing is a sample result: testadm@bronze000001:~$ curl -X GET http://10.0.0.5:20002/PatchOrchestrationApplication/v1/GetResults

[ 
  { 
    "NodeName": "_bronze_0", 
    "UpdateOperationResults": [ 
      { 
        "OperationResult": "succeeded", 
        "NodeName": "_bronze_0", 
        "OperationTime": "2017-11-21T12:39:29.0435917Z", 
        "UpdateDetails": [ 
          { 
            "UpdateId": "linux-cloud-tools-azure:amd64=4.11.0.1015.15", 
            "ResultCode": "succeeded" 
          }, 
          { 
            "UpdateId": "linux-headers-azure:amd64=4.11.0.1015.15", 
            "ResultCode": "succeeded" 
          }, 
          { 
            "UpdateId": "linux-image-azure:amd64=4.11.0.1015.15", 
            "ResultCode": "succeeded" 
          }, 
          { 
            "UpdateId": "linux-tools-azure:amd64=4.11.0.1015.15", 
            "ResultCode": "succeeded" 
          }, 
          { 
            "UpdateId": "python3-apport:amd64=2.20.1-0ubuntu2.13", 
            "ResultCode": "succeeded" 
          }, 
        ], 
        "OperationType": "installation", 
        "UpdateClassification": "securityupdates", 
        "UpdateFrequency": "Daily, 7:00:00", 
        "RebootRequired": true, 
        "ApprovedList": "", 
        "RejectedList": "" 
      } 
    ] 
  } 
] 

下面描述了 JSON 的字段:Fields of the JSON are described as follows:

字段Field Values 详细信息Details
OperationResultOperationResult 0 - 已成功0 - Succeeded
1 - 已成功但有错误1 - Succeeded With Errors
2 - 已失败2 - Failed
3 - 已中止3 - Aborted
4 - 已中止,超时4 - Aborted With Timeout
指示整个操作的结果(通常涉及一个或多个更新的安装)。Indicates the result of overall operation (typically involving installation of one or more updates).
ResultCodeResultCode 与 OperationResult 相同Same as OperationResult 此字段指示单个更新的安装操作的结果。This field indicates result of installation operation for an individual update.
OperationTypeOperationType 1 - 安装1 - Installation
0 - 搜索并下载。0 - Search and Download.
Installation 是默认情况下结果中将显示的唯一 OperationType。Installation is the only OperationType that would be shown in the results by default.
UpdateClassificationUpdateClassification 默认值为“securityupdates”Default is "securityupdates" 在执行更新操作期间安装的更新类型Type of updates that is installed during the update operation
UpdateFrequencyUpdateFrequency 默认值为“Weekly, Wednesday, 7:00:00”Default is "Weekly, Wednesday, 7:00:00" 为此更新配置的更新频率。Update frequency configured for this update.
RebootRequiredRebootRequired true - 需要重新启动true - reboot was required
true - 不需要重新启动false - reboot was not required
指示是否需要重新启动才能完成安装更新。Indicates a reboot was required to complete installation of updates.
ApprovedListApprovedList 默认值为 ""Default is "" 此更新批准的修补程序列表List of approved patches for this update
RejectedListRejectedList 默认值为 ""Default is "" 此更新拒绝的修补程序列表List of rejected patches for this update

如果尚未计划更新,则生成的 JSON 为空。If no update is scheduled yet, the result JSON is empty.

登录到群集以查询更新结果。Log in to the cluster to query Update results. 然后找出协调器服务的主副本地址,并在浏览器中点击此 URL: http://<REPLICA-IP>:<ApplicationPort>/PatchOrchestrationApplication/v1/GetResults。Then find out the replica address for the primary of the Coordinator Service, and hit the URL from the browser: http://<REPLICA-IP>:<ApplicationPort>/PatchOrchestrationApplication/v1/GetResults.

协调器服务的 REST 终结点有一个动态端口。The REST endpoint for the Coordinator Service has a dynamic port. 若要查看确切的 URL,请参考 Service Fabric Explorer。To check the exact URL, refer to the Service Fabric Explorer. 例如,可在 http://10.0.0.7:20000/PatchOrchestrationApplication/v1/GetResults 处获取结果。For example, the results are available at http://10.0.0.7:20000/PatchOrchestrationApplication/v1/GetResults.

REST 终结点的图像

诊断/运行状况事件Diagnostics/health events

诊断日志Diagnostic logs

修补业务流程应用日志是作为 Service Fabric 运行日志的一部分进行收集的。Patch orchestration app logs are collected as part of Service Fabric runtime logs.

在想要通过所选的诊断工具/管道捕获日志的情况下使用。In case you want to capture logs via diagnostic tool/pipeline of your choice. 修补业务流程应用程序使用以下固定的提供程序 ID 通过 eventsource 记录事件Patch orchestration application uses following fixed provider IDs to log events via eventsource

  • e39b723c-590c-4090-abb0-11e3e6616346e39b723c-590c-4090-abb0-11e3e6616346
  • fc0028ff-bfdc-499f-80dc-ed922c52c5e9fc0028ff-bfdc-499f-80dc-ed922c52c5e9
  • 24afa313-0d3b-4c7c-b485-1047fd964b6024afa313-0d3b-4c7c-b485-1047fd964b60
  • 05dc046c-60e9-4ef7-965e-91660adffa6805dc046c-60e9-4ef7-965e-91660adffa68

运行状况报告Health reports

对于以下情况,修补业务流程应用还会针对协调器服务或节点代理服务发布运行状况报告:The patch orchestration app also publishes health reports against the Coordinator Service or the Node Agent Service in the following cases:

更新操作失败An Update operation failed

如果某个节点上的更新操作失败,将会针对节点代理服务生成运行状况报告。If an Update operation fails on a node, a health report is generated against the Node Agent Service. 运行状况报告的详细信息包含有问题的节点名称。Details of the health report contain the problematic node name.

在有问题的节点上成功完成修补后,将自动清除该报告。After patching is successfully completed on the problematic node, the report is automatically cleared.

节点代理守护程序服务已关闭The Node Agent Daemon Service is down

如果某个节点上的节点代理守护程序服务关闭,将会针对节点代理服务生成警告级别的运行状况报告。If the Node Agent Daemon service is down on a node, a warning-level health report is generated against the Node Agent Service.

未启用”修复管理器”服务The repair manager service is not enabled

如果在群集上找不到修复管理器服务,将会针对协调器服务生成警告级别的运行状况报告。A warning-level health report is generated for the Coordinator Service if repair manager service is not found on the cluster.

常见问题Frequently asked questions

问:Q. 为什么在修补业务流程应用运行时,我发现群集处于错误状态?Why do I see my cluster in an error state when the patch orchestration app is running?

A.A. 在安装过程中,修补业务流程应用会禁用或重启节点。During the installation process, the patch orchestration app disables or restarts nodes. 此操作可能暂时导致群集的运行状况变差。This operation can temporarily result in the health of the cluster going down.

根据应用程序的策略,执行修补操作期间可以让一个节点关闭,也可以让整个升级域同时关闭。Based on the policy for the application, either one node can go down during a patching operation or an entire upgrade domain can go down simultaneously.

在安装结束时,重启后节点将会重新启用。By the end of the installation, the nodes are reenabled post restart.

在以下示例中,由于两个节点关闭且违反了 MaxPercentageUnhealthyNodes 策略,群集暂时进入了错误状态。In the following example, the cluster went to an error state temporarily because two nodes were down and the MaxPercentageUnhealthyNodes policy got violated. 这是暂时性错误,在修补操作继续后即可恢复。The error is temporary until the patching operation is ongoing.

不正常群集的图像

如果问题持续出现,请参阅“故障排除”部分。If the issue persists, refer to the Troubleshooting section.

问:Q. 修补业务流程应用处于警告状态Patch orchestration app is in warning state

A.A. 查看针对应用程序发布的运行状况报告所报告的情况是否是根本原因。Check to see if a health report posted against the application is the root cause. 通常,警告中会包含问题的详细信息。Usually, the warning contains details of the problem. 如果该问题是暂时性的,则应用程序会自动从此状态中恢复。If the issue is transient, the application is expected to auto-recover from this state.

问:Q. 如果群集运行不正常,而我需要进行紧急的操作系统更新,该怎么办?What can I do if my cluster is unhealthy and I need to do an urgent operating system update?

A.A. 群集运行不正常时,修补业务流程应用不会安装更新。The patch orchestration app does not install updates while the cluster is unhealthy. 若要消除修补业务流程应用工作流的阻碍,请将群集恢复正常状态。To unblock the patch orchestration app workflow, bring your cluster to a healthy state.

问:Q. 为何跨群集运行修补需要花费很长时间?Why does patching across clusters take so long to run?

A.A. 修补业务流程应用所需的时长主要取决于以下因素:The time needed by the patch orchestration app is mostly dependent on the following factors:

  • 协调器服务的策略。The policy of the Coordinator Service.
    • 默认策略 NodeWise 指定一次只修补一个节点。The default policy, NodeWise, results in patching only one node at a time. 尤其是当存在更大的群集时,我们建议使用 UpgradeDomainWise 策略以实现更快的跨群集修补。Especially if there is a bigger cluster, we recommend that you use the UpgradeDomainWise policy to achieve faster patching across cluster.
  • 可下载并安装的更新数。The number of updates available for download and installation.
  • 下载和安装更新所需的平均时间,只需数小时。The average time needed to download and install an update, which should not exceed a couple of hours.
  • VM 的性能和网络带宽。The performance of the VM and network bandwidth.

问:Q. 修补业务流程应用如何判断哪些更新是安全更新?How does patch orchestration app decides which updates are security updates.

A.A. 修补业务流程应用使用特定于分发版的逻辑来确定可用更新中的哪些更新是安全更新。Patch orchestration app uses distro-specific logic for determining which updates among the available updates are security updates. 例如:在 ubuntu 中,应用会搜索存档 $RELEASE-security、$RELEASE-updates 中的更新($RELEASE 为 Xenial 或 Linux 标准基础发行版)。For example: In ubuntu the app searches for updates from archives $RELEASE-security, $RELEASE-updates ($RELEASE = xenial or the linux standard base release version).

问:Q. 如何锁定为特定的包版本?How can I lock on to a specific version of package?

A.A. 使用 ApprovedPatches 设置可将包锁定为特定版本。Use the ApprovedPatches settings to lock your packages to a particular version.

问:Q. Ubuntu 中启用的自动更新如何进行?What happens to automatic updates enabled in Ubuntu?

A.A. 在群集上安装修补业务流程应用后,会立即禁用群集节点上的无人参与升级。As soon as you install patch orchestration app on your cluster, unattended-upgrades on your cluster node would be disabled. 所有定期更新工作流将由修补业务流程应用驱动。All the periodic update workflow would be driven by patch orchestration app. 若要在整个群集中实现环境一致性,我们建议只通过修补业务流程应用安装更新。To have consistency of environment across cluster, we recommend installing the updates via patch orchestration app only.

问:Q. 在升级后,修补业务流程应用是否会清理未使用的包?Post upgrade does patch orchestration app do the cleanup of unused packages?

A.A. 是的,在执行安装后的步骤期间会执行清理。Yes, cleanup happens as part of post-installation steps.

问:Q. 修补业务流程应用是否可用来修补开发群集(单节点群集)?Can Patch Orchestration app be used to patch my dev cluster (one-node cluster) ?

A.A. 否,修补业务流程应用不能用来修补单节点群集。No, Patch orchestration app cannot be used to patch one-node cluster. 此限制是设计使然,因为 Service Fabric 系统服务或者任意客户应用将面临停机时间,因此修复管理器不会批准任何修复工作进行修补。This limitation is by design, as service fabric system services or any customer apps will face downtime and hence any repair job for patching would never get approved by repair manager.

故障排除Troubleshooting

节点无法恢复启动状态A node is not coming back to up state

节点可能会卡在“正在禁用”状态,因为:The node might be stuck in a disabling state because:

安全检查已挂起。A safety check is pending. 若要纠正此情况,请确保有足够多的节点处于正常状态。To remedy this situation, ensure that enough nodes are available in a healthy state.

节点可能会卡在“已禁用”状态,因为:The node might be stuck in a disabled state because:

  • 节点已被手动禁用。The node was disabled manually.
  • 某个正在进行的 Azure 基础结构作业导致节点被禁用。The node was disabled due to an ongoing Azure infrastructure job.
  • 修补节点的修补业务流程应用暂时禁用了节点。The node was disabled temporarily by the patch orchestration app to patch the node.

节点可能会卡在关闭状态,因为:The node might be stuck in a down state because:

  • 已手动将节点置于关闭状态。The node was put in a down state manually.
  • 节点正在重启(可能由修补业务流程应用触发)。The node is undergoing a restart (which might be triggered by the patch orchestration app).
  • VM 或计算机故障、网络连接问题导致节点关闭。The node is down due to a faulty VM or machine or network connectivity issues.

在某些节点上跳过了更新Updates were skipped on some nodes

修补业务流程应用根据重新计划策略尝试安装更新。The patch orchestration app tries to install an update according to the rescheduling policy. 服务根据应用程序策略尝试恢复节点并跳过更新。The service tries to recover the node and skip the update according to the application policy.

在这种情况下,将针对节点代理服务生成警告级别的运行状况报告。In such a case, a warning-level health report is generated against the Node Agent Service. 更新结果也包含可能的失败原因。The result for update also contains the possible reason for the failure.

安装更新时群集运行状况转为错误状态The health of the cluster goes to error while the update installs

发生故障的更新会使特定节点或升级域上的应用程序或群集的运行状况恶化。A faulty update can bring down the health of an application or cluster on a particular node or upgrade domain. 修补业务流程应用会终止任何后续的更新操作,直到群集再次正常运行。The patch orchestration app discontinues any subsequent update operations until the cluster is healthy again.

管理员必须介入,并判断为何之前安装的更新会导致应用程序或群集不正常。An administrator must intervene and determine why the application or cluster became unhealthy due to a previously installed update.

免责声明Disclaimer

修补业务流程应用会通过收集遥测来跟踪使用情况和性能。The patch orchestration app collects telemetry to track usage and performance. 应用程序的遥测遵循 Service Fabric 运行时的遥测设置(默认为启用)。The application's telemetry follows the setting of the Service Fabric runtime's telemetry setting (which is on by default).