应用程序升级故障排除Troubleshoot application upgrades

本文介绍一些围绕升级 Azure Service Fabric 应用程序的常见问题以及这些问题的解决方法。This article covers some of the common issues around upgrading an Azure Service Fabric application and how to resolve them.

失败的应用程序升级故障排除Troubleshoot a failed application upgrade

当升级失败时,Get-ServiceFabricApplicationUpgrade 命令的输出会包含用于调试失败的附加信息。When an upgrade fails, the output of the Get-ServiceFabricApplicationUpgrade command contains additional information for debugging the failure. 以下列表指定如何使用这些附加信息:The following list specifies how the additional information can be used:

  1. 识别失败类型。Identify the failure type.
  2. 识别失败原因。Identify the failure reason.
  3. 隔离一个或多个失败组件以进行进一步调查。Isolate one or more failing components for further investigation.

当 Service Fabric 检测到失败时,就会提供这些信息,而无论 FailureAction 是回滚升级还是挂起升级。This information is available when Service Fabric detects the failure regardless of whether the FailureAction is to roll back or suspend the upgrade.

确定失败类型Identify the failure type

Get-ServiceFabricApplicationUpgrade 的输出中,FailureTimestampUtc 标识 Service Fabric 检测到升级失败以及触发 FailureAction 时的时间戳 (UTC)。In the output of Get-ServiceFabricApplicationUpgrade, FailureTimestampUtc identifies the timestamp (in UTC) at which an upgrade failure was detected by Service Fabric and FailureAction was triggered. FailureReason 识别失败的三个可能的高级别原因之一:FailureReason identifies one of three potential high-level causes of the failure:

  1. UpgradeDomainTimeout - 指示特定的升级域花费了太长时间才完成,并且 UpgradeDomainTimeout 过期。UpgradeDomainTimeout - Indicates that a particular upgrade domain took too long to complete and UpgradeDomainTimeout expired.
  2. OverallUpgradeTimeout - 指示总体升级花费了太长时间才完成,并且 UpgradeTimeout 过期。OverallUpgradeTimeout - Indicates that the overall upgrade took too long to complete and UpgradeTimeout expired.
  3. HealthCheck - 指示在升级一个更新域后,根据指定的运行状况策略,应用程序的运行状况仍不正常,并且 HealthCheckRetryTimeout 过期。HealthCheck - Indicates that after upgrading an update domain, the application remained unhealthy according to the specified health policies and HealthCheckRetryTimeout expired.

仅当升级失败并开始回滚时,才会在输出中显示这些项。These entries only show up in the output when the upgrade fails and starts rolling back. 根据失败类型显示进一步的信息。Further information is displayed depending on the type of the failure.

调查升级超时Investigate upgrade timeouts

升级超时失败通常由服务可用性问题引起。Upgrade timeout failures are most commonly caused by service availability issues. 当服务副本或实例未能在新代码版本中启动时,此段落后面的输出是升级的典型输出。The output following this paragraph is typical of upgrades where service replicas or instances fail to start in the new code version. UpgradeDomainProgressAtFailure 字段捕获失败时所有挂起的升级工作的快照。The UpgradeDomainProgressAtFailure field captures a snapshot of any pending upgrade work at the time of failure.

Get-ServiceFabricApplicationUpgrade fabric:/DemoApp
ApplicationName                : fabric:/DemoApp
ApplicationTypeName            : DemoAppType
TargetApplicationTypeVersion   : v2
ApplicationParameters          : {}
StartTimestampUtc              : 4/14/2015 9:26:38 PM
FailureTimestampUtc            : 4/14/2015 9:27:05 PM
FailureReason                  : UpgradeDomainTimeout
UpgradeDomainProgressAtFailure : MYUD1

                                 NodeName            : Node4
                                 UpgradePhase        : PostUpgradeSafetyCheck
                                 PendingSafetyChecks :
                                     WaitForPrimaryPlacement - PartitionId: 744c8d9f-1d26-417e-a60e-cd48f5c098f0

                                 NodeName            : Node1
                                 UpgradePhase        : PostUpgradeSafetyCheck
                                 PendingSafetyChecks :
                                     WaitForPrimaryPlacement - PartitionId: 4b43f4d8-b26b-424e-9307-7a7a62e79750
UpgradeState                   : RollingBackCompleted
UpgradeDuration                : 00:00:46
CurrentUpgradeDomainDuration   : 00:00:00
NextUpgradeDomain              :
UpgradeDomainsStatus           : { "MYUD1" = "Completed";
                                 "MYUD2" = "Completed";
                                 "MYUD3" = "Completed" }
UpgradeKind                    : Rolling
RollingUpgradeMode             : UnmonitoredAuto
ForceRestart                   : False
UpgradeReplicaSetCheckTimeout  : 00:00:00

在本示例中,升级域 MYUD1 的升级失败,两个分区(744c8d9f-1d26-417e-a60e-cd48f5c098f04b43f4d8-b26b-424e-9307-7a7a62e79750)已停滞。In this example, the upgrade failed at upgrade domain MYUD1 and two partitions (744c8d9f-1d26-417e-a60e-cd48f5c098f0 and 4b43f4d8-b26b-424e-9307-7a7a62e79750) were stuck. 分区由于运行时无法将主要副本 (WaitForPrimaryPlacement) 放在目标节点 Node1Node4 上而停滞。The partitions were stuck because the runtime was unable to place primary replicas (WaitForPrimaryPlacement) on target nodes Node1 and Node4.

可使用 Get-ServiceFabricNode 命令验证这两个节点是否位于升级域 MYUD1 中。The Get-ServiceFabricNode command can be used to verify that these two nodes are in upgrade domain MYUD1. UpgradePhasePostUpgradeSafetyCheck,这意味着这些安全检查在升级域中所有节点完成升级后发生。The UpgradePhase says PostUpgradeSafetyCheck, which means that these safety checks are occurring after all nodes in the upgrade domain have finished upgrading. 所有这些信息表明应用程序代码的新版本可能存在问题。All this information points to a potential issue with the new version of the application code. 最常见的问题是打开或升级到主代码路径时的服务错误。The most common issues are service errors in the open or promotion to primary code paths.

UpgradePhasePreUpgradeSafetyCheck 意味着在执行升级前,准备升级域时出现了问题。An UpgradePhase of PreUpgradeSafetyCheck means there were issues preparing the upgrade domain before it was performed. 这种情况下最常见的问题是关闭主代码路径或从该路径降级时的服务错误。The most common issues in this case are service errors in the close or demotion from primary code paths.

当前 UpgradeStateRollingBackCompleted,因此必须已使用回滚 FailureAction(会在失败时自动回滚升级)执行原始升级。The current UpgradeState is RollingBackCompleted, so the original upgrade must have been performed with a rollback FailureAction, which automatically rolled back the upgrade upon failure. 如果已使用手动 FailureAction 执行了原始升级,则升级会改为处于挂起状态,以允许对应用程序进行实时调试。If the original upgrade was performed with a manual FailureAction, then the upgrade would instead be in a suspended state to allow live debugging of the application.

在极少数情况下,当系统完成当前升级域的所有工作时,如果整体升级超时,则 UpgradeDomainProgressAtFailure 字段可能为空 。In rare cases, the UpgradeDomainProgressAtFailure field may be empty if the overall upgrade times out just as the system completes all work for the current upgrade domain. 如果发生这种情况,请尝试增加 UpgradeTimeout 和 UpgradeDomainTimeout 升级参数值,然后重试升级 。If this happens, try increasing the UpgradeTimeout and UpgradeDomainTimeout upgrade parameter values and retry the upgrade.

调查运行状况检查失败Investigate health check failures

运行状况检查失败可能由各种其他问题触发,这些问题可能发生在升级域中所有节点完成升级、通过所有安全检查之后。Health check failures can be triggered by various issues that can happen after all nodes in an upgrade domain finish upgrading and passing all safety checks. 此段落后面的输出是升级因运行状况检查失败而失败时的典型输出。The output following this paragraph is typical of an upgrade failure due to failed health checks. UnhealthyEvaluations 字段根据指定的运行状况策略,捕获升级失败时失败的运行状况检查的快照。The UnhealthyEvaluations field captures a snapshot of health checks that failed at the time of the upgrade according to the specified health policy.

Get-ServiceFabricApplicationUpgrade fabric:/DemoApp
ApplicationName                         : fabric:/DemoApp
ApplicationTypeName                     : DemoAppType
TargetApplicationTypeVersion            : v4
ApplicationParameters                   : {}
StartTimestampUtc                       : 4/24/2015 2:42:31 AM
UpgradeState                            : RollingForwardPending
UpgradeDuration                         : 00:00:27
CurrentUpgradeDomainDuration            : 00:00:27
NextUpgradeDomain                       : MYUD2
UpgradeDomainsStatus                    : { "MYUD1" = "Completed";
                                          "MYUD2" = "Pending";
                                          "MYUD3" = "Pending" }
UnhealthyEvaluations                    :
                                          Unhealthy services: 50% (2/4), ServiceType='PersistedServiceType', MaxPercentUnhealthyServices=0%.

                                          Unhealthy service: ServiceName='fabric:/DemoApp/Svc3', AggregatedHealthState='Error'.

                                              Unhealthy partitions: 100% (1/1), MaxPercentUnhealthyPartitionsPerService=0%.

                                              Unhealthy partition: PartitionId='3a9911f6-a2e5-452d-89a8-09271e7e49a8', AggregatedHealthState='Error'.

                                                  Error event: SourceId='Replica', Property='InjectedFault'.

                                          Unhealthy service: ServiceName='fabric:/DemoApp/Svc2', AggregatedHealthState='Error'.

                                              Unhealthy partitions: 100% (1/1), MaxPercentUnhealthyPartitionsPerService=0%.

                                              Unhealthy partition: PartitionId='744c8d9f-1d26-417e-a60e-cd48f5c098f0', AggregatedHealthState='Error'.

                                                  Error event: SourceId='Replica', Property='InjectedFault'.

UpgradeKind                             : Rolling
RollingUpgradeMode                      : Monitored
FailureAction                           : Manual
ForceRestart                            : False
UpgradeReplicaSetCheckTimeout           : 49710.06:28:15
HealthCheckWaitDuration                 : 00:00:00
HealthCheckStableDuration               : 00:00:10
HealthCheckRetryTimeout                 : 00:00:10
UpgradeDomainTimeout                    : 10675199.02:48:05.4775807
UpgradeTimeout                          : 10675199.02:48:05.4775807
ConsiderWarningAsError                  :
MaxPercentUnhealthyPartitionsPerService :
MaxPercentUnhealthyReplicasPerPartition :
MaxPercentUnhealthyServices             :
MaxPercentUnhealthyDeployedApplications :
ServiceTypeHealthPolicyMap              :

调查运行状况检查失败原因首先需要了解 Service Fabric 运行状况模型。Investigating health check failures first requires an understanding of the Service Fabric health model. 但即使没有深入理解,我们也可以看到有两个服务是不正常的:fabric:/DemoApp/Svc3fabric:/DemoApp/Svc2,还可看到错误运行状况报告(本例中为“InjectedFault”)。But even without such an in-depth understanding, we can see that two services are unhealthy: fabric:/DemoApp/Svc3 and fabric:/DemoApp/Svc2, along with the error health reports ("InjectedFault" in this case). 在本示例中,四个服务中有两个服务不正常,低于不正常运行状况的默认目标 (MaxPercentUnhealthyServices) 0%。In this example, two out of four services are unhealthy, which is below the default target of 0% unhealthy (MaxPercentUnhealthyServices).

升级因为启动升级时手动指定 FailureAction 失败而暂停。The upgrade was suspended upon failing by specifying a FailureAction of manual when starting the upgrade. 此模式允许我们在采取其他任何措施之前,在失败状态下调查实时系统。This mode allows us to investigate the live system in the failed state before taking any further action.

从挂起的升级恢复Recover from a suspended upgrade

使用回滚 FailureAction时,无需任何恢复,因为在升级失败时会自动回滚。With a rollback FailureAction, there is no recovery needed since the upgrade automatically rolls back upon failing. 使用手动 FailureAction 时,有以下几个恢复选项:With a manual FailureAction, there are several recovery options:

  1. 触发回滚trigger a rollback
  2. 手动继续进行升级的其余部分Proceed through the remainder of the upgrade manually
  3. 继续进行受监控的升级Resume the monitored upgrade

可随时使用 Start-ServiceFabricApplicationRollback 命令启动应用程序回滚。The Start-ServiceFabricApplicationRollback command can be used at any time to start rolling back the application. 一旦命令成功返回,回滚请求即已在系统中注册,并会立即启动。Once the command returns successfully, the rollback request has been registered in the system and starts shortly thereafter.

Resume-ServiceFabricApplicationUpgrade 命令可用于手动继续进行升级的其余部分,一次执行一个升级域。The Resume-ServiceFabricApplicationUpgrade command can be used to proceed through the remainder of the upgrade manually, one upgrade domain at a time. 在此模式下,系统只执行安全检查,In this mode, only safety checks are performed by the system. 而不会再执行其他运行状况检查。No more health checks are performed. 仅当 UpgradeState 显示 RollingForwardPending 时才可使用此命令,它表示当前升级域已完成升级但下一个升级域尚未启动(挂起)。This command can only be used when the UpgradeState shows RollingForwardPending, which means that the current upgrade domain has finished upgrading but the next one has not started (pending).

Update-ServiceFabricApplicationUpgrade 命令可用于继续进行受监控的升级,同时执行安全检查和运行状况检查。The Update-ServiceFabricApplicationUpgrade command can be used to resume the monitored upgrade with both safety and health checks being performed.

Update-ServiceFabricApplicationUpgrade fabric:/DemoApp -UpgradeMode Monitored
UpgradeMode                             : Monitored
ForceRestart                            :
UpgradeReplicaSetCheckTimeout           :
FailureAction                           :
HealthCheckWaitDuration                 :
HealthCheckStableDuration               :
HealthCheckRetryTimeout                 :
UpgradeTimeout                          :
UpgradeDomainTimeout                    :
ConsiderWarningAsError                  :
MaxPercentUnhealthyPartitionsPerService :
MaxPercentUnhealthyReplicasPerPartition :
MaxPercentUnhealthyServices             :
MaxPercentUnhealthyDeployedApplications :
ServiceTypeHealthPolicyMap              :

升级将从上次挂起的升级域继续,并使用与以前相同的升级参数和运行状况策略。The upgrade continues from the upgrade domain where it was last suspended and use the same upgrade parameters and health policies as before. 如果需要,在继续进行升级时,可使用同一命令更改上面的输出中显示的任何升级参数和运行状况策略。If needed, any of the upgrade parameters and health policies shown in the preceding output can be changed in the same command when the upgrade resumes. 在本示例中,升级以监视模式继续,参数和运行状况策略保持不变。In this example, the upgrade was resumed in Monitored mode, with the parameters and the health policies unchanged.

进一步的故障排除Further troubleshooting

Service Fabric 没有遵循指定的运行状况策略Service Fabric is not following the specified health policies

可能的原因 1:Possible Cause 1:

Service Fabric 将所有百分比转换为实际实体(如副本、分区和服务)数,以进行运行状况评估,并且此数目始终调高到实体整数。Service Fabric translates all percentages into actual numbers of entities (for example, replicas, partitions, and services) for health evaluation and always rounds up to whole entities. 例如,如果最大值 MaxPercentUnhealthyReplicasPerPartition 是 21% 且有 5 个副本,则 Service Fabric 允许最多 2 个运行状况不正常的副本(即 Math.Ceiling (5*0.21))。For example, if the maximum MaxPercentUnhealthyReplicasPerPartition is 21% and there are five replicas, then Service Fabric allows up to two unhealthy replicas (that is,Math.Ceiling (5*0.21)). 因此,设置运行状况策略时应考虑到这一点。Thus, health policies should be set accordingly.

可能的原因 2:Possible Cause 2:

运行状况策略以总服务数的百分比指定,而非具体服务实例数的百分比。Health policies are specified in terms of percentages of total services and not specific service instances. 例如,如果在升级前应用程序有四个服务实例 A、B、C 和 D,其中服务 D 不正常,但这对应用程序没有明显影响。For example, before an upgrade, if an application has four service instances A, B, C, and D, where service D is unhealthy but with little impact to the application. 我们想要在升级过程中忽略已知的不正常服务 D,并将参数 MaxPercentUnhealthyServices 设置为 25%,假设只需 A、B 和 C 处于正常状态。We want to ignore the known unhealthy service D during upgrade and set the parameter MaxPercentUnhealthyServices to be 25%, assuming only A, B, and C need to be healthy.

但在升级期间,D 可能变为正常,而 C 变为不正常。However, during the upgrade, D may become healthy while C becomes unhealthy. 升级仍会成功,因为只有 25% 的服务运行状况不正常。The upgrade would still succeed because only 25% of the services are unhealthy. 但是,这可能导致非预期错误,因为 C 意外地变为不正常,而不是 D。在此情况下,应将 D 建模为不同于 A、B 和 C 的服务类型。由于可根据服务类型指定运行状况策略,因此可将不同的运行状况百分比阈值应用到不同的服务。However, it might result in unanticipated errors due to C being unexpectedly unhealthy instead of D. In this situation, D should be modeled as a different service type from A, B, and C. Since health policies are specified per service type, different unhealthy percentage thresholds can be applied to different services.

我没有为应用程序升级指定运行状况策略,但升级仍因我从未指定的一些超时而失败I did not specify a health policy for application upgrade, but the upgrade still fails for some time-outs that I never specified

当未向升级请求提供运行状况策略时,会使用当前应用程序版本的 ApplicationManifest.xml 中的策略。When health policies aren't provided to the upgrade request, they are taken from the ApplicationManifest.xml of the current application version. 例如,如果要将应用程序 X 从版本 1.0 升级到版本 2.0,则将使用版本 1.0 中指定的应用程序运行状况策略。For example, if you're upgrading Application X from version 1.0 to version 2.0, application health policies specified for in version 1.0 are used. 如果应对升级使用不同的运行状况策略,则需在应用程序升级 API 调用过程中指定该策略。If a different health policy should be used for the upgrade, then the policy needs to be specified as part of the application upgrade API call. 指定为 API 调用一部分的策略只会在升级期间应用。The policies specified as part of the API call only apply during the upgrade. 升级完成后,使用 ApplicationManifest.xml 中指定的策略。Once the upgrade is complete, the policies specified in the ApplicationManifest.xml are used.

指定了错误的超时值Incorrect time-outs are specified

用户可能要知道当超时设置不一致时会发生什么情况。You may have wondered about what happens when time-outs are set inconsistently. 例如,UpgradeTimeout 小于 UpgradeDomainTimeoutFor example, you may have an UpgradeTimeout that's less than the UpgradeDomainTimeout. 答案是返回错误。The answer is that an error is returned. 返回错误的情况包括:UpgradeDomainTimeout 小于 HealthCheckWaitDurationHealthCheckRetryTimeout 的总和,或者 UpgradeDomainTimeout 小于 HealthCheckWaitDurationHealthCheckStableDuration 的总和。Errors are returned if the UpgradeDomainTimeout is less than the sum of HealthCheckWaitDuration and HealthCheckRetryTimeout, or if UpgradeDomainTimeout is less than the sum of HealthCheckWaitDuration and HealthCheckStableDuration.

我升级花费的时间过长My upgrades are taking too long

完成升级所需的时间取决于运行状况检查和指定的超时。The time for an upgrade to complete depends on the health checks and time-outs specified. 运行状况检查和超时取决于花费多少时间来复制、部署和稳定应用程序。Health checks and time-outs depend on how long it takes to copy, deploy, and stabilize the application. 超时过短可能意味着会出现更多的失败升级,因此建议在开始时保守地使用较长超时。Being too aggressive with time-outs might mean more failed upgrades, so we recommend starting conservatively with longer time-outs.

让我们快速回顾一下超时如何与升级时间相互作用:Here's a quick refresher on how the time-outs interact with the upgrade times:

完成升级域升级的时间不会早于 HealthCheckWaitDuration + HealthCheckStableDurationUpgrades for an upgrade domain cannot complete faster than HealthCheckWaitDuration + HealthCheckStableDuration.

发生升级失败的时间不会早于 HealthCheckWaitDuration + HealthCheckRetryTimeoutUpgrade failure cannot occur faster than HealthCheckWaitDuration + HealthCheckRetryTimeout.

升级域的升级时间受到 UpgradeDomainTimeout 的限制。The upgrade time for an upgrade domain is limited by UpgradeDomainTimeout. 如果 HealthCheckRetryTimeoutHealthCheckStableDuration 均不为零,并且应用程序的运行状况保持来回切换,那么升级最终将于 UpgradeDomainTimeout 超时。If HealthCheckRetryTimeout and HealthCheckStableDuration are both non-zero and the health of the application keeps switching back and forth, then the upgrade eventually times out on UpgradeDomainTimeout. 在当前升级域的升级开始时,UpgradeDomainTimeout 就开始倒计时。UpgradeDomainTimeout starts counting down once the upgrade for the current upgrade domain begins.

后续步骤Next steps

Upgrading your Application Using Visual Studio (使用 Visual Studio 升级应用程序)逐步讲解了如何使用 Visual Studio 进行应用程序升级。Upgrading your Application Using Visual Studio walks you through an application upgrade using Visual Studio.

使用 Powershell 升级应用程序逐步讲解了如何使用 PowerShell 进行应用程序升级。Upgrading your Application Using Powershell walks you through an application upgrade using PowerShell.

使用升级参数来控制应用程序的升级方式。Control how your application upgrades by using Upgrade Parameters.

了解如何使用数据序列化,使应用程序在升级后保持兼容。Make your application upgrades compatible by learning how to use Data Serialization.

参考高级主题,了解如何在升级应用程序时使用高级功能。Learn how to use advanced functionality while upgrading your application by referring to Advanced Topics.