使用系统运行状况报告进行故障排除Use system health reports to troubleshoot

Azure Service Fabric 组件提供有关现成群集中所有实体的系统运行状况报告。Azure Service Fabric components provide system health reports on all entities in the cluster right out of the box. 运行状况存储根据系统报告来创建和删除实体。The health store creates and deletes entities based on the system reports. 它还会将这些实体组织为层次结构以捕获实体交互。It also organizes them in a hierarchy that captures entity interactions.

备注

请阅读 Service Fabric 运行状况模型以了解与运行状况相关的概念。To understand health-related concepts, read more at Service Fabric health model.

使用系统运行状况报告,不仅可以查看群集和应用程序功能,还能标记问题。System health reports provide visibility into cluster and application functionality, and flag problems. 对于应用程序和服务,系统运行状况报告从 Service Fabric 的角度验证实体得到实现并且正常运行。For applications and services, system health reports verify that entities are implemented and are behaving correctly from the Service Fabric perspective. 报告既不监视服务的业务逻辑运行状况,也不检测无响应的进程。The reports don't provide any health monitoring of the business logic of the service or detection of processes that are not responding. 用户服务可以使用其逻辑的特有信息来丰富运行状况数据。User services can enrich the health data with information specific to their logic.

备注

用户监视程序发送的运行状况报告仅在系统组件创建实体后 才可见。Health reports sent by user watchdogs are visible only after the system components create an entity. 如果实体遭到删除,运行状况存储会自动删除与实体相关联的所有运行状况报告。When an entity is deleted, the health store automatically deletes all the health reports associated with it. 这同样适用于创建实体的新实例。The same is true when a new instance of the entity is created. 例如,创建新的有状态持久化服务副本实例时,An example is when a new stateful persisted service replica instance is created. 所有与旧实例关联的报告都会从存储中删除并清除。All reports associated with the old instance are deleted and cleaned up from the store.

按来源标识系统组件报告,并以“System”。The system component reports are identified by the source, which starts with the "System." 前缀开头。prefix. 监视器不能与源使用相同的前缀,因为如果参数无效,报告会被拒绝。Watchdogs can't use the same prefix for their sources, as reports with invalid parameters are rejected.

接下来,将以一些系统报告为例,介绍是什么触发生成这些报告,以及如何纠正报告指出的潜在问题。Let's look at some system reports to understand what triggers them and to learn how to correct the potential problems they represent.

备注

Service Fabric 会继续添加报告,让用户更清楚地了解群集和应用程序中的情况。Service Fabric continues to add reports on conditions of interest that improve visibility into what is happening in the cluster and the applications. 还可在现有报告中添加更多详细信息,以帮助用户更快地排查问题。Existing reports can be enhanced with more details to help troubleshoot the problem faster.

群集系统运行状况报告Cluster system health reports

群集运行状况实体在运行状况存储中自动创建。The cluster health entity is created automatically in the health store. 如果一切运行正常,则不提供系统报告。If everything works properly, it doesn't have a system report.

邻居丢失Neighborhood loss

System.Federation 在检测到邻居丢失时会报告一个错误。System.Federation reports an error when it detects a neighborhood loss. 报告来自于单个节点,并且在属性名称中包含节点 ID。The report is from individual nodes, and the node ID is included in the property name. 如果整个 Service Fabric 环缺少一个邻近区域,通常可以有两个事件,分别代表间隙报告的两端。If one neighborhood is lost in the entire Service Fabric ring, you can typically expect two events that represent both sides of the gap report. 如果有多个邻居丢失,则会有更多事件。If more neighborhoods are lost, there are more events.

报告将全局租用超时指定为生存时间 (TTL)。The report specifies the global-lease timeout as the time-to-live (TTL). 只要条件仍处于活动状态,就会在每半个 TTL 期间重新发送一次报告。The report is resent every half of the TTL duration for as long as the condition remains active. 事件过期后会被自动删除。The event is automatically removed when it expires. 过期后删除行为可以确保从运行状况存储中正常清理报告,即使在报告节点停止运行时,也不例外。Remove-when-expired behavior ensures that the report is cleaned up from the health store correctly, even if the reporting node is down.

  • SourceId:System.FederationSourceId: System.Federation
  • 属性:以 Neighborhood 开头并包含节点信息。Property: Starts with Neighborhood and includes node information.
  • 后续步骤:调查邻近区域丢失的原因。Next steps: Investigate why the neighborhood is lost. 例如,检查群集节点之间的通信。For example, check the communication between cluster nodes.

重新生成Rebuild

“故障转移管理器(FM)”服务管理有关群集节点的信息。The Failover Manager (FM) service manages information about the cluster nodes. 当 FM 失去其数据并陷入数据丢失时,将无法保证它具有关于群集节点的最新信息。When FM loses its data and goes into data loss, it can't guarantee that it has the most updated information about the cluster nodes. 在这种情况下,系统将经历重新生成,并且 System.FM 将从群集中的所有节点收集数据,以便重新生成其状态。In this case, the system goes through a rebuild, and System.FM gathers data from all nodes in the cluster in order to rebuild its state. 有时,由于网络或节点问题,重新生成可能会陷入卡滞或停滞。Sometimes, due to networking or node issues, rebuild can get stuck or stalled. “故障转移主管理器(FMM)”服务也可能会发生这种情况。The same can happen with the Failover Manager Master (FMM) service. FMM 是一项无状态的系统服务,用于跟踪所有 FM 在群集中的位置。The FMM is a stateless system service that keeps track of where all the FMs are in the cluster. FMM 主节点始终是 ID 最接近 0 的节点。The FMM's primary is always the node with the ID closest to 0. 如果删除该节点,将触发重新生成。If that node gets dropped, a rebuild is triggered. 如果出现上面任意一种情况,System.FM 或 System.FMM 将通过错误报表对其进行标记。When one of the previous conditions happens, System.FM or System.FMM flags it through an error report. 重新生成可能会卡滞在以下两个阶段之一:Rebuild might be stuck in one of two phases:

  • 等待广播:FM/FMM 等待其他节点的广播消息答复。Waiting for broadcast: FM/FMM waits for the broadcast message reply from the other nodes.

    • 后续步骤:调查节点之间是否存在网络连接问题。Next steps: Investigate whether there is a network connection issue between nodes.
  • 等待节点:FM/FMM 已收到来自其他节点的广播答复,正在等待特定节点的答复。Waiting for nodes: FM/FMM already received a broadcast reply from the other nodes and is waiting for a reply from specific nodes. 运行状况报告列出 FM/FMM 正在等待其响应的节点。The health report lists the nodes for which the FM/FMM is waiting for a response.

    • 后续步骤:调查 FM/FMM 和所列出节点之间的网络连接。Next steps: Investigate the network connection between the FM/FMM and the listed nodes. 调查每个列出的节点是否存在其他可能问题。Investigate each listed node for other possible issues.
  • SourceID:System.FM 或 System.FMMSourceID: System.FM or System.FMM

  • 属性:Rebuild。Property: Rebuild.

  • 后续步骤:调查节点之间的网络连接,以及在运行状况报告的说明中列出的任何特定节点的状态。Next steps: Investigate the network connection between the nodes, as well as the state of any specific nodes that are listed on the description of the health report.

发送节点状态Seed Node Status

System.FM 会在某些种子节点运行不正常的情况下报告群集级别的警告。System.FM reports a cluster level warning if some seed nodes are unhealthy. 种子节点可以维护基础群集的可用性。Seed nodes are the nodes which maintain the availability of the underlying cluster. 这些节点有助于通过在某些类型的网络故障期间,与其他节点建立租约并充当决胜属性来确保群集保持启动状态。These nodes help to ensure the cluster remains up by establishing leases with other nodes and serving as tiebreakers during certain kinds of network failures. 如果群集中的大部分种子节点故障并且无法将其恢复,则群集会自动关闭。If a majority of the seed nodes are down in the cluster and they are not brought back, the cluster automatically shuts down.

如果种子节点的状态为“停机”、“已删除”或“未知”,则表明该节点运行不正常。A seed node is unhealthy if its node status is Down, Removed or Unknown. 种子节点状态的警告报告会列出所有运行不正常的种子节点及详细信息。The warning report for seed node status will list all the unhealthy seed nodes with detailed information.

  • SourceID:System.FMSourceID: System.FM

  • 属性:SeedNodeStatusProperty: SeedNodeStatus

  • 后续步骤:如果此警告显示在群集中,请按以下说明来修复它:对于运行 Service Fabric 6.5 或更高版本的群集:对于 Azure 上的 Service Fabric 群集,当种子节点发生故障后,Service Fabric 会尝试自动将其更改为非种子节点。Next steps: If this warning shows in the cluster, follow below instructions to fix it: For cluster running Service Fabric version 6.5 or higher: For Service Fabric cluster on Azure, after the seed node goes down, Service Fabric will try to change it to a non-seed node automatically. 若要实现这一点,请确保主节点类型中的非种子节点数大于或等于“发生故障”的种子节点数。To make this happen, make sure the number of non-seed nodes in the primary node type is greater or equal to the number of Down seed nodes. 如果需要,请将更多节点添加到主节点类型以实现这一目标。If necessary, add more nodes to the primary node type to achieve this. 根据群集状态,修复此问题可能需要一定的时间。Depending on the cluster status, it may take some time to fix the issue. 修复完以后,会自动清除警告报告。Once this is done, the warning report is automatically cleared.

    对于 Service Fabric 独立群集来说,所有种子节点必须变得正常才能清除警告报告。For Service Fabric standalone cluster, to clear the warning report, all the seed nodes need to become healthy. 需要根据种子节点运行不正常的原因采取不同的操作:如果种子节点状态为“停机”,则用户需启动该种子节点;如果种子节点状态为“已删除”或“未知”,则需从群集中删除该种子节点。Depending on why seed nodes are unhealthy, different actions need to be taken: if the seed node is Down, users need to bring that seed node up; if the seed node is Removed or Unknown, this seed node needs to be removed from the cluster. 当所有种子节点变得正常以后,会自动清除警告报告。The warning report is automatically cleared when all seed nodes become healthy.

    对于运行低于 6.5 版的 Service Fabric 的群集:在这种情况下,需手动清除警告报告。For cluster running Service Fabric version older than 6.5: In this case, the warning report needs to be cleared manually. 用户在清除报告之前,应确保所有种子节点变得正常:如果种子节点状态为“停机”,则用户需启动该种子节点;如果种子节点状态为“已删除”或“未知”,则需从群集中删除该种子节点。Users should make sure all the seed nodes become healthy before clearing the report: if the seed node is Down, users need to bring that seed node up;if the seed node is Removed or Unknown, that seed node needs to be removed from the cluster. 在所有种子节点变得正常以后,请使用以下 Powershell 命令清除警告报告After all the seed nodes become healthy, use following command from Powershell to clear the warning report:

    PS C:\> Send-ServiceFabricClusterHealthReport -SourceId "System.FM" -HealthProperty "SeedNodeStatus" -HealthState OK
    

节点系统运行状况报告Node system health reports

System.FM 表示“故障转移管理器”服务,是管理群集节点相关信息的主管服务。System.FM, which represents the Failover Manager service, is the authority that manages information about cluster nodes. 每个节点应该都有一个来自 System.FM 的报告,显示其状态。Each node should have one report from System.FM showing its state. 节点实体随节点状态一起删除。The node entities are removed when the node state is removed. 有关详细信息,请参阅 RemoveNodeStateAsyncFor more information, see RemoveNodeStateAsync.

节点开启/节点关闭Node up/down

节点加入环时,System.FM 报告为正常(节点已启动且正在运行)。System.FM reports as OK when the node joins the ring (it's up and running). 节点离开环时,则报告错误(节点已关闭进行升级,或只是发生故障)。It reports an error when the node departs the ring (it's down, either for upgrading or simply because it has failed). 运行状况存储生成的运行状况层次结构对与 System.FM 节点报告相关的已部署实体起作用。The health hierarchy built by the health store acts on deployed entities in correlation with System.FM node reports. 它将节点视为所有已部署实体的虚拟父项。It considers the node a virtual parent of all deployed entities. 如果 System.FM 报告节点已启动并且其实例与实体关联的实例相同,则可以通过查询公开该节点上已部署的实体。The deployed entities on that node are exposed through queries if the node is reported as up by System.FM, with the same instance as the instance associated with the entities. 如果 System.FM 报告节点停止运行或重启(作为新实例),运行状况存储会自动清理只能位于停止运行的节点或节点的上一实例上的已部署实体。When System.FM reports that the node is down or restarted, as a new instance, the health store automatically cleans up the deployed entities that can exist only on the down node or on the previous instance of the node.

  • SourceId:System.FMSourceId: System.FM
  • 属性:State。Property: State.
  • 后续步骤:如果节点是因为升级而停止运行,应该会在升级后恢复运行。Next steps: If the node is down for an upgrade, it should come back up after it's been upgraded. 在这种情况下,运行状况应切换回正常。In this case, the health state should switch back to OK. 如果节点没有重新启动或发生故障,则需要进一步调查问题。If the node doesn't come back or it fails, the problem needs more investigation.

以下示例显示 System.FM 事件,且节点正常运行时的运行状况状态为正常:The following example shows the System.FM event with a health state of OK for node up:

PS C:\> Get-ServiceFabricNodeHealth  _Node_0

NodeName              : _Node_0
AggregatedHealthState : Ok
HealthEvents          : 
                        SourceId              : System.FM
                        Property              : State
                        HealthState           : Ok
                        SequenceNumber        : 8
                        SentAt                : 7/14/2017 4:54:51 PM
                        ReceivedAt            : 7/14/2017 4:55:14 PM
                        TTL                   : Infinite
                        Description           : Fabric node is up.
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Error->Ok = 7/14/2017 4:55:14 PM, LastWarning = 1/1/0001 12:00:00 AM

证书过期日期Certificate expiration

System.FabricNode 在节点使用的证书即将过期时报告警告。System.FabricNode reports a warning when certificates used by the node are near expiration. 每个节点有三个证书:Certificate_clusterCertificate_serverCertificate_default_clientThere are three certificates per node: Certificate_cluster, Certificate_server, and Certificate_default_client. 如果过期时间至少超过两周,报告运行状况是正常。When the expiration is at least two weeks away, the report health state is OK. 如果过期时间在两周内,则报告类型为警告。When the expiration is within two weeks, the report type is a warning. 这些事件的 TTL 是无限的,节点离开群集时,它们会被删除。TTL of these events is infinite, and they are removed when a node leaves the cluster.

  • SourceId:System.FabricNodeSourceId: System.FabricNode
  • 属性:以 Certificate 开头并且包含有关证书类型的详细信息Property: Starts with Certificate and contains more information about the certificate type.
  • 后续步骤:如果证书即将过期,则更新证书。Next steps: Update the certificates if they are near expiration.

负载容量冲突Load capacity violation

如果 Service Fabric 负载均衡器检测到节点容量冲突,则报告警告。The Service Fabric Load Balancer reports a warning when it detects a node capacity violation.

  • SourceId:System.PLBSourceId: System.PLB
  • 属性:以 Capacity 开头。Property: Starts with Capacity.
  • 后续步骤:检查已提供的指标,并查看节点上的当前容量。Next steps: Check the provided metrics and view the current capacity on the node.

资源调控指标的节点容量不匹配Node capacity mismatch for resource governance metrics

如果群集清单中定义的节点容量大于资源调控指标(内存和 CPU 核心)的实际节点容量,System.Hosting 将报告一个警告。System.Hosting reports a warning if defined node capacities in the cluster manifest are larger than the real node capacities for resource governance metrics (memory and CPU cores). 首个使用资源调控的服务包在指定节点上注册时,将显示运行状况报告。A health report appears when the first service package that uses resource governance registers on a specified node.

  • SourceId:System.HostingSourceId: System.Hosting
  • 属性ResourceGovernanceProperty: ResourceGovernance.
  • 后续步骤:此问题可能会造成问题,因为服务包不会按预期进行强制调控并且资源调控不正常工作。Next steps: This issue can be a problem because governing service packages aren't enforced as expected and resource governance doesn't work properly. 使用这些指标的正确节点容量更新群集清单,或者不指定节点容量,让 Service Fabric 自动检测可用资源。Update the cluster manifest with the correct node capacities for these metrics, or don't specify them and let Service Fabric automatically detect available resources.

应用程序系统运行状况报告Application system health reports

System.CM 表示群集管理器服务,是管理应用程序相关信息的主管服务。System.CM, which represents the Cluster Manager service, is the authority that manages information about an application.

状态State

创建或更新应用程序时,System.CM 报告正常。System.CM reports as OK when the application has been created or updated. 当删除应用程序时,它会通知运行状况存储,以便从存储中删除应用程序。It informs the health store when the application is deleted so that it can be removed from the store.

  • SourceId:System.CMSourceId: System.CM
  • 属性:State。Property: State.
  • 后续步骤:如果已创建或更新应用程序,它应该包含群集管理器运行状况报告。Next steps: If the application has been created or updated, it should include the Cluster Manager health report. 否则,请通过发出查询检查应用程序的状态。Otherwise, check the state of the application by issuing a query. 例如,使用 PowerShell cmdlet Get-ServiceFabricApplication -ApplicationName applicationNameFor example, use the PowerShell cmdlet Get-ServiceFabricApplication -ApplicationName applicationName.

以下示例显示 fabric:/WordCount 应用程序上的状态事件:The following example shows the state event on the fabric:/WordCount application:

PS C:\> Get-ServiceFabricApplicationHealth fabric:/WordCount -ServicesFilter None -DeployedApplicationsFilter None -ExcludeHealthStatistics

ApplicationName                 : fabric:/WordCount
AggregatedHealthState           : Ok
ServiceHealthStates             : None
DeployedApplicationHealthStates : None
HealthEvents                    : 
                                  SourceId              : System.CM
                                  Property              : State
                                  HealthState           : Ok
                                  SequenceNumber        : 282
                                  SentAt                : 7/13/2017 5:57:05 PM
                                  ReceivedAt            : 7/14/2017 4:55:10 PM
                                  TTL                   : Infinite
                                  Description           : Application has been created.
                                  RemoveWhenExpired     : False
                                  IsExpired             : False
                                  Transitions           : Error->Ok = 7/13/2017 5:57:05 PM, LastWarning = 1/1/0001 12:00:00 AM

服务系统运行状况报告Service system health reports

System.FM 表示故障转移管理器服务,是管理服务相关信息的主管服务。System.FM, which represents the Failover Manager service, is the authority that manages information about services.

状态State

已创建服务时,System.FM 报告正常。System.FM reports as OK when the service has been created. 删除服务时,它会从运行状况存储中删除实体。It deletes the entity from the health store when the service is deleted.

  • SourceId:System.FMSourceId: System.FM
  • 属性:State。Property: State.

以下示例显示服务 fabric:/WordCount/WordCountWebService 上的状态事件:The following example shows the state event on the service fabric:/WordCount/WordCountWebService:

PS C:\> Get-ServiceFabricServiceHealth fabric:/WordCount/WordCountWebService -ExcludeHealthStatistics

ServiceName           : fabric:/WordCount/WordCountWebService
AggregatedHealthState : Ok
PartitionHealthStates : 
                        PartitionId           : 8bbcd03a-3a53-47ec-a5f1-9b77f73c53b2
                        AggregatedHealthState : Ok

HealthEvents          : 
                        SourceId              : System.FM
                        Property              : State
                        HealthState           : Ok
                        SequenceNumber        : 14
                        SentAt                : 7/13/2017 5:57:05 PM
                        ReceivedAt            : 7/14/2017 4:55:10 PM
                        TTL                   : Infinite
                        Description           : Service has been created.
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Error->Ok = 7/13/2017 5:57:18 PM, LastWarning = 1/1/0001 12:00:00 AM

服务相关错误Service correlation error

检测到更新服务与形成关联链的其他服务相关时,System.PLB 会报告错误。System.PLB reports an error when it detects that updating a service is correlated with another service that creates an affinity chain. 更新成功后会清除报告。The report is cleared when a successful update happens.

  • SourceId:System.PLBSourceId: System.PLB
  • 属性ServiceDescriptionProperty: ServiceDescription.
  • 后续步骤:检查相关服务说明。Next steps: Check the correlated service descriptions.

分区系统运行状况报告Partition system health reports

System.FM 表示故障转移管理器服务,是管理服务分区相关信息的主管服务。System.FM, which represents the Failover Manager service, is the authority that manages information about service partitions.

状态State

创建分区并且分区正常时,System.FM 报告正常。System.FM reports as OK when the partition has been created and is healthy. 删除分区时,它从运行状况存储删除实体。It deletes the entity from the health store when the partition is deleted.

如果分区小于最小副本计数,则它会报告错误。If the partition is below the minimum replica count, it reports an error. 如果分区不小于最低副本计数,但小于目标副本计数,将会报告警告。If the partition is not below the minimum replica count, but it's below the target replica count, it reports a warning. 如果分区处于仲裁丢失状态,System.FM 会报告错误。If the partition is in quorum loss, System.FM reports an error.

其他显著事件包括,在重新配置时间长于预期以及生成时间长于预期时发出警告。Other notable events include a warning when the reconfiguration takes longer than expected and when the build takes longer than expected. 生成和重新配置的预期时长可根据服务方案进行配置。The expected times for the build and reconfiguration are configurable based on the service scenarios. 例如,如果服务的状态为 1TB(如 Azure SQL 数据库),那么生成时间就长于状态量小的服务。For example, if a service has a terabyte of state, such as Azure SQL Database, the build takes longer than for a service with a small amount of state.

  • SourceId:System.FMSourceId: System.FM
  • 属性:State。Property: State.
  • 后续步骤:如果运行状况不正常,则有可能某些副本没有正确创建、打开或提升为主副本或次要副本。Next steps: If the health state is not OK, it's possible that some replicas have not been created, opened, or promoted to primary or secondary correctly.

如果说明描述仲裁丢失,请检查并备份已停止运行副本的详细运行状况报告,这有助于让分区重新上线。If the description describes quorum loss, then examining the detailed health report for replicas that are down and bringing them back up helps to bring the partition back online.

如果说明描述分区无法运行重新配置,主要副本的运行状况报告还提供其他信息。If the description describes a partition stuck in reconfiguration, then the health report on the primary replica provides additional information.

对于其他 System.FM 运行状况报告,还有其他系统组件中副本、分区或服务的相关报告。For other System.FM health reports, there would be reports on the replicas or the partition or service from other system components.

下面的示例展示了其中一些报告。The following examples describe some of these reports.

以下示例显示了一个运行状况良好的分区:The following example shows a healthy partition:

PS C:\> Get-ServiceFabricPartition fabric:/WordCount/WordCountWebService | Get-ServiceFabricPartitionHealth -ExcludeHealthStatistics -ReplicasFilter None

PartitionId           : 8bbcd03a-3a53-47ec-a5f1-9b77f73c53b2
AggregatedHealthState : Ok
ReplicaHealthStates   : None
HealthEvents          : 
                        SourceId              : System.FM
                        Property              : State
                        HealthState           : Ok
                        SequenceNumber        : 70
                        SentAt                : 7/13/2017 5:57:05 PM
                        ReceivedAt            : 7/14/2017 4:55:10 PM
                        TTL                   : Infinite
                        Description           : Partition is healthy.
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Error->Ok = 7/13/2017 5:57:18 PM, LastWarning = 1/1/0001 12:00:00 AM

下面的示例展示了小于目标副本计数的分区运行状况。The following example shows the health of a partition that's below target replica count. 下一步是获取分区描述,其中为分区配置方式:MinReplicaSetSize 为 3,TargetReplicaSetSize 为 7。The next step is to get the partition description, which shows how it's configured: MinReplicaSetSize is three and TargetReplicaSetSize is seven. 然后,获取群集中的节点数(在此示例中为 5)。Then get the number of nodes in the cluster, which in this case is five. 因此,在此示例中,无法放置两个副本,因为副本的目标数量大于可用节点数。So, in this case, two replicas can't be placed, because the target number of replicas is higher than the number of nodes available.

PS C:\> Get-ServiceFabricPartition fabric:/WordCount/WordCountService | Get-ServiceFabricPartitionHealth -ReplicasFilter None -ExcludeHealthStatistics

PartitionId           : af2e3e44-a8f8-45ac-9f31-4093eb897600
AggregatedHealthState : Warning
UnhealthyEvaluations  : 
                        Unhealthy event: SourceId='System.FM', Property='State', HealthState='Warning', ConsiderWarningAsError=false.

ReplicaHealthStates   : None
HealthEvents          : 
                        SourceId              : System.FM
                        Property              : State
                        HealthState           : Warning
                        SequenceNumber        : 123
                        SentAt                : 7/14/2017 4:55:39 PM
                        ReceivedAt            : 7/14/2017 4:55:44 PM
                        TTL                   : Infinite
                        Description           : Partition is below target replica or instance count.
                        fabric:/WordCount/WordCountService 7 2 af2e3e44-a8f8-45ac-9f31-4093eb897600
                          N/S Ready _Node_2 131444422260002646
                          N/S Ready _Node_4 131444422293113678
                          N/S Ready _Node_3 131444422293113679
                          N/S Ready _Node_1 131444422293118720
                          N/P Ready _Node_0 131444422293118721
                          (Showing 5 out of 5 replicas. Total available replicas: 5)

                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Error->Warning = 7/14/2017 4:55:44 PM, LastOk = 1/1/0001 12:00:00 AM

                        SourceId              : System.PLB
                        Property              : ServiceReplicaUnplacedHealth_Secondary_af2e3e44-a8f8-45ac-9f31-4093eb897600
                        HealthState           : Warning
                        SequenceNumber        : 131445250939703027
                        SentAt                : 7/14/2017 4:58:13 PM
                        ReceivedAt            : 7/14/2017 4:58:14 PM
                        TTL                   : 00:01:05
                        Description           : The Load Balancer was unable to find a placement for one or more of the Service's Replicas:
                        Secondary replica could not be placed due to the following constraints and properties:  
                        TargetReplicaSetSize: 7
                        Placement Constraint: N/A
                        Parent Service: N/A

                        Constraint Elimination Sequence:
                        Existing Secondary Replicas eliminated 4 possible node(s) for placement -- 1/5 node(s) remain.
                        Existing Primary Replica eliminated 1 possible node(s) for placement -- 0/5 node(s) remain.

                        Nodes Eliminated By Constraints:

                        Existing Secondary Replicas -- Nodes with Partition's Existing Secondary Replicas/Instances:
                        --
                        FaultDomain:fd:/4 NodeName:_Node_4 NodeType:NodeType4 UpgradeDomain:4 UpgradeDomain: ud:/4 Deactivation Intent/Status: None/None
                        FaultDomain:fd:/3 NodeName:_Node_3 NodeType:NodeType3 UpgradeDomain:3 UpgradeDomain: ud:/3 Deactivation Intent/Status: None/None
                        FaultDomain:fd:/2 NodeName:_Node_2 NodeType:NodeType2 UpgradeDomain:2 UpgradeDomain: ud:/2 Deactivation Intent/Status: None/None
                        FaultDomain:fd:/1 NodeName:_Node_1 NodeType:NodeType1 UpgradeDomain:1 UpgradeDomain: ud:/1 Deactivation Intent/Status: None/None

                        Existing Primary Replica -- Nodes with Partition's Existing Primary Replica or Secondary Replicas:
                        --
                        FaultDomain:fd:/0 NodeName:_Node_0 NodeType:NodeType0 UpgradeDomain:0 UpgradeDomain: ud:/0 Deactivation Intent/Status: None/None

                        RemoveWhenExpired     : True
                        IsExpired             : False
                        Transitions           : Error->Warning = 7/14/2017 4:56:14 PM, LastOk = 1/1/0001 12:00:00 AM

PS C:\> Get-ServiceFabricPartition fabric:/WordCount/WordCountService | select MinReplicaSetSize,TargetReplicaSetSize

MinReplicaSetSize TargetReplicaSetSize
----------------- --------------------
                2                    7                        

PS C:\> @(Get-ServiceFabricNode).Count
5

下面的示例展示了无法运行重新配置(原因是用户不履行 RunAsync 方法中的取消令牌)的分区运行状况。The following example shows the health of a partition that's stuck in reconfiguration due to the user not honoring the cancellation token in the RunAsync method. 调查标记为主要 (P) 的任何副本的运行状况报告有助于深入了解问题。Investigating the health report of any replica marked as primary (P) can help to drill down further into the problem.

PS C:\utilities\ServiceFabricExplorer\ClientPackage\lib> Get-ServiceFabricPartitionHealth 0e40fd81-284d-4be4-a665-13bc5a6607ec -ExcludeHealthStatistics 

PartitionId           : 0e40fd81-284d-4be4-a665-13bc5a6607ec
AggregatedHealthState : Warning
UnhealthyEvaluations  : 
                        Unhealthy event: SourceId='System.FM', Property='State', HealthState='Warning', 
                        ConsiderWarningAsError=false.

HealthEvents          : 
                        SourceId              : System.FM
                        Property              : State
                        HealthState           : Warning
                        SequenceNumber        : 7
                        SentAt                : 8/27/2017 3:43:09 AM
                        ReceivedAt            : 8/27/2017 3:43:32 AM
                        TTL                   : Infinite
                        Description           : Partition reconfiguration is taking longer than expected.
                        fabric:/app/test1 3 1 0e40fd81-284d-4be4-a665-13bc5a6607ec
                          P/S Ready Node1 131482789658160654
                          S/P Ready Node2 131482789688598467
                          S/S Ready Node3 131482789688598468
                          (Showing 3 out of 3 replicas. Total available replicas: 3)                        

                        For more information see: https://aka.ms/sfhealth
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Ok->Warning = 8/27/2017 3:43:32 AM, LastError = 1/1/0001 12:00:00 AM

此运行状况报告显示正在执行重新配置的分区的副本状态:This health report shows the state of the replicas of the partition undergoing reconfiguration:

  P/S Ready Node1 131482789658160654
  S/P Ready Node2 131482789688598467
  S/S Ready Node3 131482789688598468

对于每个副本,运行状况报告包含:For each replica, the health report contains:

  • 旧配置角色Previous configuration role
  • 当前配置角色Current configuration role
  • 副本状态Replica state
  • 运行副本的节点Node on which the replica is running
  • 副本 IDReplica ID

在此示例中,需要进一步调查。In a case like the example, further investigation is needed. 调查上一示例中以标记为 PrimarySecondary 的副本(131482789658160654 和 131482789688598467)开头的各个副本的运行状况。Investigate the health of each individual replica starting with the replicas marked as Primary and Secondary (131482789658160654 and 131482789688598467) in the previous example.

副本约束冲突Replica constraint violation

如果 System.PLB 检测到副本约束冲突并且无法放置所有分区副本,则报告警告。System.PLB reports a warning if it detects a replica constraint violation and can't place all partition replicas. 报告详细信息会显示哪些约束和属性阻止了副本放置。The report details show which constraints and properties prevent the replica placement.

  • SourceId:System.PLBSourceId: System.PLB
  • 属性:以 ReplicaConstraintViolation 开头。Property: Starts with ReplicaConstraintViolation.

副本系统运行状况报告Replica system health reports

System.RA表示重新配置代理组件,是用于处理副本状态的主管组件。System.RA, which represents the reconfiguration agent component, is the authority for the replica state.

状态State

在副本创建后,System.RA 报告正常。System.RA reports OK when the replica has been created.

  • SourceId:System.RASourceId: System.RA
  • 属性:State。Property: State.

以下示例显示了一个运行状况良好的副本:The following example shows a healthy replica:

PS C:\> Get-ServiceFabricPartition fabric:/WordCount/WordCountService | Get-ServiceFabricReplica | where {$_.ReplicaRole -eq "Primary"} | Get-ServiceFabricReplicaHealth

PartitionId           : af2e3e44-a8f8-45ac-9f31-4093eb897600
ReplicaId             : 131444422293118721
AggregatedHealthState : Ok
HealthEvents          : 
                        SourceId              : System.RA
                        Property              : State
                        HealthState           : Ok
                        SequenceNumber        : 131445248920273536
                        SentAt                : 7/14/2017 4:54:52 PM
                        ReceivedAt            : 7/14/2017 4:55:13 PM
                        TTL                   : Infinite
                        Description           : Replica has been created._Node_0
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Error->Ok = 7/14/2017 4:55:13 PM, LastWarning = 1/1/0001 12:00:00 AM

ReplicaOpenStatus, ReplicaCloseStatus, ReplicaChangeRoleStatusReplicaOpenStatus, ReplicaCloseStatus, ReplicaChangeRoleStatus

此属性用于在用户尝试打开副本、关闭副本或将副本从一个角色转换为另一个角色时,指示警告或故障。This property is used to indicate warnings or failures when attempting to open a replica, close a replica, or transition a replica from one role to another. 有关详细信息,请参阅副本生命周期For more information, see Replica lifecycle. 这些故障可能是 API 调用抛出的异常,也可能是在这段时间内服务主机进程发生的故障。The failures might be exceptions thrown from the API calls or crashes of the service host process during this time. 对于因 C# 代码中的 API 调用而发生的故障,Service Fabric 会在运行状况报告中添加异常和堆栈跟踪。For failures due to API calls from C# code, Service Fabric adds the exception and stack trace to the health report.

这些运行状况警告是在本地重试操作数次(具体取决于策略)后发出的。These health warnings are raised after retrying the action locally some number of times (depending on policy). Service Fabric 重试操作的次数不得超过最大阈值。Service Fabric retries the action up to a maximum threshold. 达到最大阈值后,它可能会尝试采取措施来纠正这种情况。After the maximum threshold is reached, it might try to act to correct the situation. 这样的尝试可能会导致这些警告遭到清除,因为它放弃对此节点执行操作。This attempt can cause these warnings to get cleared as it gives up on the action on this node. 例如,如果副本无法在节点上打开,Service Fabric 会发出运行状况警告。For example, if a replica is failing to open on a node, Service Fabric raises a health warning. 如果副本仍无法打开,Service Fabric 会进行自我修复。If the replica continues to fail to open, Service Fabric acts to self-repair. 此操作可能会涉及在另一个节点上尝试同一操作。This action might involve trying the same operation on another node. 该尝试会导致针对此副本发出的警告遭到清除。This attempt causes the warning raised for this replica to be cleared.

  • SourceId:System.RASourceId: System.RA
  • 属性ReplicaOpenStatusReplicaCloseStatusReplicaChangeRoleStatusProperty: ReplicaOpenStatus, ReplicaCloseStatus, and ReplicaChangeRoleStatus.
  • 后续步骤:调查服务代码或故障转储,确定操作失败的原因。Next steps: Investigate the service code or crash dumps to identify why the operation is failing.

下面的示例展示了从打开方法抛出 TargetInvocationException 的副本运行状况。The following example shows the health of a replica that's throwing TargetInvocationException from its open method. 说明包含故障点 (IStatefulServiceReplica.Open )、异常类型 (TargetInvocationException ) 和堆栈跟踪。The description contains the point of failure, IStatefulServiceReplica.Open, the exception type TargetInvocationException, and the stack trace.

PS C:\> Get-ServiceFabricReplicaHealth -PartitionId 337cf1df-6cab-4825-99a9-7595090c0b1b -ReplicaOrInstanceId 131483509874784794

PartitionId           : 337cf1df-6cab-4825-99a9-7595090c0b1b
ReplicaId             : 131483509874784794
AggregatedHealthState : Warning
UnhealthyEvaluations  : 
                        Unhealthy event: SourceId='System.RA', Property='ReplicaOpenStatus', HealthState='Warning', 
                        ConsiderWarningAsError=false.

HealthEvents          : 
                        SourceId              : System.RA
                        Property              : ReplicaOpenStatus
                        HealthState           : Warning
                        SequenceNumber        : 131483510001453159
                        SentAt                : 8/27/2017 11:43:20 PM
                        ReceivedAt            : 8/27/2017 11:43:21 PM
                        TTL                   : Infinite
                        Description           : Replica had multiple failures during open on _Node_0 API call: IStatefulServiceReplica.Open(); Error = System.Reflection.TargetInvocationException (-2146232828)
                                                Exception has been thrown by the target of an invocation.
                                                   at Microsoft.ServiceFabric.Replicator.RecoveryManager.d__31.MoveNext()
                                                --- End of stack trace from previous location where exception was thrown ---
                                                   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
                                                   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
                                                   at Microsoft.ServiceFabric.Replicator.LoggingReplicator.d__137.MoveNext()
                                                --- End of stack trace from previous location where exception was thrown ---
                                                   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
                                                   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
                                                   at Microsoft.ServiceFabric.Replicator.DynamicStateManager.d__109.MoveNext()
                                                --- End of stack trace from previous location where exception was thrown ---
                                                   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
                                                   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
                                                   at Microsoft.ServiceFabric.Replicator.TransactionalReplicator.d__79.MoveNext()
                                                --- End of stack trace from previous location where exception was thrown ---
                                                   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
                                                   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
                                                   at Microsoft.ServiceFabric.Replicator.StatefulServiceReplica.d__21.MoveNext()
                                                --- End of stack trace from previous location where exception was thrown ---
                                                   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
                                                   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
                                                   at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.d__0.MoveNext()
    
                                                    For more information see: https://aka.ms/sfhealth
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Error->Warning = 8/27/2017 11:43:21 PM, LastOk = 1/1/0001 12:00:00 AM                        

下面的示例展示了在关闭期间不断发生故障的副本:The following example shows a replica that's constantly crashing during close:

C:>Get-ServiceFabricReplicaHealth -PartitionId dcafb6b7-9446-425c-8b90-b3fdf3859e64 -ReplicaOrInstanceId 131483565548493142

PartitionId           : dcafb6b7-9446-425c-8b90-b3fdf3859e64
ReplicaId             : 131483565548493142
AggregatedHealthState : Warning
UnhealthyEvaluations  : 
                        Unhealthy event: SourceId='System.RA', Property='ReplicaCloseStatus', HealthState='Warning', 
                        ConsiderWarningAsError=false.

HealthEvents          : 
                        SourceId              : System.RA
                        Property              : ReplicaCloseStatus
                        HealthState           : Warning
                        SequenceNumber        : 131483565611258984
                        SentAt                : 8/28/2017 1:16:01 AM
                        ReceivedAt            : 8/28/2017 1:16:03 AM
                        TTL                   : Infinite
                        Description           : Replica had multiple failures during close on _Node_1. The application 
                        host has crashed.

                        For more information see: https://aka.ms/sfhealth
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Error->Warning = 8/28/2017 1:16:03 AM, LastOk = 1/1/0001 12:00:00 AM

重新配置Reconfiguration

此属性用于指示执行重新配置的副本何时检测到重新配置已停止或受阻。This property is used to indicate when a replica performing a reconfiguration detects that the reconfiguration is stalled or stuck. 此运行状况报告可能针对的是当前角色为主要的副本,交换主要重新配置的情况除外,在这种情况下,此报告可能针对的是从主要降级为活动次要的副本。This health report might be on the replica whose current role is primary, except in the cases of a swap primary reconfiguration, where it might be on the replica that's being demoted from primary to active secondary.

重新配置可能会因以下原因之一而无法运行:The reconfiguration can be stuck for one of the following reasons:

  • 本地副本(与执行重新配置相同的副本)上的操作尚未完成。An action on the local replica, the same replica as the one performing the reconfiguration, is not completing. 在这种情况下,从其他组件(System.RAP 或 System.RE)调查此副本的运行状况报告可能会获得其他信息。In this case, investigating the health reports on this replica from other components, System.RAP or System.RE, might provide additional information.

  • 远程副本上的操作尚未完成。An action is not completing on a remote replica. 运行状况报告中列出了操作挂起的副本。Replicas for which actions are pending are listed in the health report. 应对这些远程副本的运行状况报告进行进一步调查。Further investigation should be done on health reports for those remote replicas. 此节点和远程节点之间也可能存在通信问题。There might also be communication problems between this node and the remote node.

在极少数情况下,重新配置可能会因为此节点和故障转移管理器服务之间的通信问题或其他问题而无法运行。In rare cases, the reconfiguration can be stuck due to communication or other problems between this node and the Failover Manager service.

  • SourceId:System.RASourceId: System.RA
  • 属性:Reconfiguration。Property: Reconfiguration.
  • 后续步骤:根据运行状况报告的说明调查本地或远程副本。Next steps: Investigate local or remote replicas depending on the description of the health report.

下面的示例展示了重新配置在本地副本上无法运行的运行状况报告。The following example shows a health report where a reconfiguration is stuck on the local replica. 在此示例中,这是由于服务不履行取消令牌所致。In this sample, it's due to a service not honoring the cancellation token.

PS C:\> Get-ServiceFabricReplicaHealth -PartitionId 9a0cedee-464c-4603-abbc-1cf57c4454f3 -ReplicaOrInstanceId 131483600074836703

PartitionId           : 9a0cedee-464c-4603-abbc-1cf57c4454f3
ReplicaId             : 131483600074836703
AggregatedHealthState : Warning
UnhealthyEvaluations  : 
                        Unhealthy event: SourceId='System.RA', Property='Reconfiguration', HealthState='Warning', 
                        ConsiderWarningAsError=false.

HealthEvents          : 
                        SourceId              : System.RA
                        Property              : Reconfiguration
                        HealthState           : Warning
                        SequenceNumber        : 131483600309264482
                        SentAt                : 8/28/2017 2:13:50 AM
                        ReceivedAt            : 8/28/2017 2:13:57 AM
                        TTL                   : Infinite
                        Description           : Reconfiguration is stuck. Waiting for response from the local replica

                        For more information see: https://aka.ms/sfhealth
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Error->Warning = 8/28/2017 2:13:57 AM, LastOk = 1/1/0001 12:00:00 AM

下面的示例展示了重新配置因等待两个远程副本的响应而无法运行的运行状况报告。The following example shows a health report where a reconfiguration is stuck waiting for a response from two remote replicas. 在此示例中,分区中有三个副本,包括当前的主要副本。In this example, there are three replicas in the partition, including the current primary.

PS C:\> Get-ServiceFabricReplicaHealth -PartitionId  579d50c6-d670-4d25-af70-d706e4bc19a2 -ReplicaOrInstanceId 131483956274977415

PartitionId           : 579d50c6-d670-4d25-af70-d706e4bc19a2
ReplicaId             : 131483956274977415
AggregatedHealthState : Warning
UnhealthyEvaluations  : 
                        Unhealthy event: SourceId='System.RA', Property='Reconfiguration', HealthState='Warning', ConsiderWarningAsError=false.

HealthEvents          : 
                        SourceId              : System.RA
                        Property              : Reconfiguration
                        HealthState           : Warning
                        SequenceNumber        : 131483960376212469
                        SentAt                : 8/28/2017 12:13:57 PM
                        ReceivedAt            : 8/28/2017 12:14:07 PM
                        TTL                   : Infinite
                        Description           : Reconfiguration is stuck. Waiting for response from 2 replicas

                        Pending Replicas: 
                        P/I Down 40 131483956244554282
                        S/S Down 20 131483956274972403

                        For more information see: https://aka.ms/sfhealth
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Error->Warning = 8/28/2017 12:07:37 PM, LastOk = 1/1/0001 12:00:00 AM

此运行状况报告显示重新配置因等待两个副本的响应而无法运行:This health report shows that the reconfiguration is stuck waiting for a response from two replicas:

    P/I Down 40 131483956244554282
    S/S Down 20 131483956274972403

对于每个副本,给出了以下信息:For each replica, the following information is given:

  • 旧配置角色Previous configuration role
  • 当前配置角色Current configuration role
  • 副本状态Replica state
  • 节点 IDNode ID
  • 副本 IDReplica ID

若要取消阻止重新配置:To unblock the reconfiguration:

  • 应启动 down 副本。The down replicas should be brought up.
  • inbuild 副本应完成生成,并切换到就绪状态。The inbuild replicas should complete the build and transition to ready.

服务 API 调用缓慢Slow service API call

如果对用户服务代码的调用时间超过配置的时间,则 System.RAPSystem.Replicator 报告警告。System.RAP and System.Replicator report a warning if a call to the user service code takes longer than the configured time. 当调用完成时,警告被清除。The warning is cleared when the call completes.

  • SourceId:System.RAP 或 System.ReplicatorSourceId: System.RAP or System.Replicator
  • 属性:慢速 API 的名称。Property: The name of the slow API. 说明提供了有关 API 挂起时间的详细信息。The description provides more details about the time the API has been pending.
  • 后续步骤:调查调用时间超过预期的原因。Next steps: Investigate why the call takes longer than expected.

下面的示例展示了 System.RAP 中因 Reliable Service 不履行 RunAsync 中的取消令牌而发生的运行状况事件:The following example shows the health event from System.RAP for a reliable service that's not honoring the cancellation token in RunAsync:

PS C:\> Get-ServiceFabricReplicaHealth -PartitionId 5f6060fb-096f-45e4-8c3d-c26444d8dd10 -ReplicaOrInstanceId 131483966141404693

PartitionId           : 5f6060fb-096f-45e4-8c3d-c26444d8dd10
ReplicaId             : 131483966141404693
AggregatedHealthState : Warning
UnhealthyEvaluations  : 
                        Unhealthy event: SourceId='System.RA', Property='Reconfiguration', HealthState='Warning', ConsiderWarningAsError=false.

HealthEvents          :                         
                        SourceId              : System.RAP
                        Property              : IStatefulServiceReplica.ChangeRole(S)Duration
                        HealthState           : Warning
                        SequenceNumber        : 131483966663476570
                        SentAt                : 8/28/2017 12:24:26 PM
                        ReceivedAt            : 8/28/2017 12:24:56 PM
                        TTL                   : Infinite
                        Description           : The api IStatefulServiceReplica.ChangeRole(S) on _Node_1 is stuck. Start Time (UTC): 2017-08-28 12:23:56.347.
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Error->Warning = 8/28/2017 12:24:56 PM, LastOk = 1/1/0001 12:00:00 AM

属性和文本指明了哪些 API 无法运行。The property and text indicate which API got stuck. 对不同卡滞 API 采取的后续步骤是不同的。The next steps to take for different stuck APIs are different. IStatefulServiceReplica 或 IStatelessServiceInstance 上的任何 API 通常都是服务代码中的 bug。Any API on the IStatefulServiceReplica or IStatelessServiceInstance is usually a bug in the service code. 下面的部分介绍了如何将上述内容转换为 Reliable Services 模型The following section describes how these translate to the Reliable Services model:

  • IStatefulServiceReplica.Open:此警告指示对 CreateServiceInstanceListenersICommunicationListener.OpenAsyncOnOpenAsync(若已重写)的调用已停滞。IStatefulServiceReplica.Open: This warning indicates that a call to CreateServiceInstanceListeners, ICommunicationListener.OpenAsync, or if overridden, OnOpenAsync is stuck.

  • IStatefulServiceReplica.CloseIStatefulServiceReplica.Abort:最常见的情况是服务不遵循传递给 RunAsync 的取消令牌。IStatefulServiceReplica.Close and IStatefulServiceReplica.Abort: The most common case is a service not honoring the cancellation token passed in to RunAsync. 也可能是无法调用 ICommunicationListener.CloseAsyncOnCloseAsync(若已重写)。It might also be that ICommunicationListener.CloseAsync, or if overridden, OnCloseAsync is stuck.

  • IStatefulServiceReplica.ChangeRole(S)IStatefulServiceReplica.ChangeRole(N) :最常见的情况是服务不遵循传递给 RunAsync 的取消令牌。IStatefulServiceReplica.ChangeRole(S) and IStatefulServiceReplica.ChangeRole(N): The most common case is a service not honoring the cancellation token passed in to RunAsync. 在这种情况下,最佳解决方案是重启副本。In this scenario, the best solution is to restart the replica.

  • IStatefulServiceReplica.ChangeRole(P) :最常见的情况是服务没有从 RunAsync 返回任务。IStatefulServiceReplica.ChangeRole(P): The most common case is that the service has not returned a task from RunAsync.

可能会在 IReplicator 接口上无法调用其他 API。Other API calls that can get stuck are on the IReplicator interface. 例如:For example:

  • IReplicator.CatchupReplicaSet:此警告指示出现两种情况之一。IReplicator.CatchupReplicaSet: This warning indicates one of two things. 已启动的副本不足。There are insufficient up replicas. 若要查看是否是这种情况,请查看分区中的副本的副本状态,或查看卡滞重新配置的 System.FM 运行状况报告。To see if this is the case, look at the replica status of the replicas in the partition or the System.FM health report for a stuck reconfiguration. 或副本不确认操作。Or the replicas are not acknowledging operations. PowerShell cmdlet Get-ServiceFabricDeployedReplicaDetail 可用于确定所有副本的进度。The PowerShell cmdlet Get-ServiceFabricDeployedReplicaDetail can be used to determine the progress of all the replicas. 问题在于,某些副本的 LastAppliedReplicationSequenceNumber 值落后于主要副本的 CommittedSequenceNumber 值。The problem lies with replicas whose LastAppliedReplicationSequenceNumber value is behind the primary's CommittedSequenceNumber value.

  • IReplicator.BuildReplica(<Remote ReplicaId>) :此警告指示生成过程出现问题。IReplicator.BuildReplica(<Remote ReplicaId>): This warning indicates a problem in the build process. 有关详细信息,请参阅副本生命周期For more information, see Replica lifecycle. 这可能是由于复制器地址配置错误所致。It might be due to a misconfiguration of the replicator address. 有关详细信息,请参阅配置有状态可靠服务在服务清单中指定资源For more information, see Configure stateful Reliable Services and Specify resources in a service manifest. 也可能是远程节点有问题。It might also be a problem on the remote node.

复制器系统运行状况报告Replicator system health reports

复制队列已满: 如果复制队列已满,则 System.Replicator 报告警告。Replication queue full: System.Replicator reports a warning when the replication queue is full. 在主要副本上,由于一个或多个次要副本确认操作的速度较慢,复制队列通常会达到已满状态。On the primary, the replication queue usually becomes full because one or more secondary replicas are slow to acknowledge operations. 辅助副本上服务应用操作的速度较慢时,通常会发生这种情况。On the secondary, this usually happens when the service is slow to apply the operations. 队列不再满时,警告会被清除。The warning is cleared when the queue is no longer full.

  • SourceId:System.ReplicatorSourceId: System.Replicator
  • 属性PrimaryReplicationQueueStatusSecondaryReplicationQueueStatus,视副本角色而定。Property: PrimaryReplicationQueueStatus or SecondaryReplicationQueueStatus, depending on the replica role.
  • 后续步骤:如果报告位于主要副本上,请检查群集中节点间的连接。Next steps: If the report is on the primary, check the connection between the nodes in the cluster. 如果所有连接都正常,则可能至少有一个慢速次要副本在应用操作时具有高磁盘延迟。If all connections are healthy, there might be at least one slow secondary with a high disk latency to apply operations. 如果报告位于次要副本上,则先检查节点上的磁盘使用情况和性能。If the report is on the secondary, check the disk usage and performance on the node first. 然后检查从慢速节点到主要副本的传出连接。Then check the outgoing connection from the slow node to the primary.

RemoteReplicatorConnectionS状态: 当辅助(远程)复制器的连接不正常时,主要副本上的 System.Replicator 会报告警告。RemoteReplicatorConnectionStatus: System.Replicator on the primary replica reports a warning when the connection to a secondary (remote) replicator is not healthy. 报告的信息中会显示远程复制器的地址,这样可以更方便地检测是否传入了错误的配置,或者复制器之间是否存在网络问题。The remote replicator's address is shown in the report's message, which makes it more convenient to detect if the wrong configuration was passed in or if there are network issues between the replicators.

  • SourceId:System.ReplicatorSourceId: System.Replicator
  • 属性RemoteReplicatorConnectionStatusProperty: RemoteReplicatorConnectionStatus.
  • 后续步骤:检查错误消息并确保已正确配置远程复制器地址。Next steps: Check the error message and make sure the remote replicator address is configured correctly. 例如,如果使用“localhost”侦听地址打开远程复制器,则无法从外部访问。For example, if the remote replicator is opened with the "localhost" listen address, it isn't reachable from the outside. 如果地址看上去正确,请检查主节点和远程地址间的连接,以找出任何潜在的网络问题。If the address looks correct, check the connection between the primary node and the remote address to find any potential network issues.

复制队列已满Replication queue full

如果复制队列已满,则 System.Replicator 报告警告。System.Replicator reports a warning when the replication queue is full. 在主要副本上,由于一个或多个次要副本确认操作的速度较慢,复制队列通常会达到已满状态。On the primary, the replication queue usually becomes full because one or more secondary replicas are slow to acknowledge operations. 辅助副本上服务应用操作的速度较慢时,通常会发生这种情况。On the secondary, this usually happens when the service is slow to apply the operations. 队列不再满时,警告会被清除。The warning is cleared when the queue is no longer full.

  • SourceId:System.ReplicatorSourceId: System.Replicator
  • 属性PrimaryReplicationQueueStatusSecondaryReplicationQueueStatus,视副本角色而定。Property: PrimaryReplicationQueueStatus or SecondaryReplicationQueueStatus, depending on the replica role.

命名操作速度慢Slow Naming operations

如果命名操作耗时超过可接受范围,System.NamingService 会报告主要副本的运行状况。System.NamingService reports the health on its primary replica when a Naming operation takes longer than acceptable. CreateServiceAsyncDeleteServiceAsync 都是命名操作的示例。Examples of Naming operations are CreateServiceAsync or DeleteServiceAsync. 可以在 FabricClient 下找到更多方法。More methods can be found under FabricClient. 例如,可在服务管理方法属性管理方法下找到更多方法。For example, they can be found under service management methods or property management methods.

备注

命名服务会将服务名称解析为群集中的某个位置。The Naming service resolves service names to a location in the cluster. 用户可以使用它来管理服务名称和属性。Users can use it to manage service names and properties. 它是 Service Fabric 分区持久化服务。It's a Service Fabric partitioned-persisted service. 其中一个分区代表“颁发机构所有者” ,内含与所有 Service Fabric 名称和服务相关的元数据。One of the partitions represents the Authority Owner, which contains metadata about all Service Fabric names and services. Service Fabric 名称映射到不同的分区,这些分区称为“名称所有者” 分区,因此服务是可扩展的。The Service Fabric names are mapped to different partitions, called Name Owner partitions, so the service is extensible. 有关详细信息,请参阅命名服务Read more about the Naming service.

如果命名操作耗时超出预期,则会在为操作提供服务的命名服务分区的主要副本上使用警告报告对操作进行标记。When a Naming operation takes longer than expected, the operation is flagged with a warning report on the primary replica of the Naming service partition that serves the operation. 如果操作成功完成,将会清除警告。If the operation completes successfully, the warning is cleared. 如果操作在完成时出现错误,则运行状况报告中会包括有关该错误的详细信息。If the operation completes with an error, the health report includes details about the error.

  • SourceId:System.NamingServiceSourceId: System.NamingService
  • 属性:以前缀“Duration_ ”开头,用于发现速度慢的操作以及对其应用了操作的 Service Fabric 名称。Property: Starts with the prefix "Duration_" and identifies the slow operation and the Service Fabric name on which the operation is applied. 例如,如果名称 fabric:/MyApp/MyService 处的创建服务操作耗时过长,则属性为 Duration_AOCreateService.fabric:/MyApp/MyService 。For example, if create service at name fabric:/MyApp/MyService takes too long, the property is Duration_AOCreateService.fabric:/MyApp/MyService. “AO”指向此名称和操作的命名分区角色。"AO" points to the role of the Naming partition for this name and operation.
  • 后续步骤:查看命名操作失败的原因。Next steps: Check to see why the Naming operation fails. 每个操作可能会有不同的根本原因。Each operation can have different root causes. 例如,可能无法删除服务。For example, the delete service might be stuck. 服务可能会卡滞,因为应用程序主机总是在节点上发生故障,原因是服务代码中存在用户 bug。The service might be stuck because the application host keeps crashing on a node due to a user bug in the service code.

以下示例显示了创建服务操作。The following example shows a create service operation. 该操作花的时间超过配置的持续时间。The operation took longer than the configured duration. “AO”重试并将工作发送到“NO”。"AO" retries and sends work to "NO." “NO”在完成上一个操作时出现超时。"NO" completed the last operation with TIMEOUT. 在这种情况下,同一个副本对于“AO”和“NO”角色来说都是主要副本。In this case, the same replica is primary for both the "AO" and "NO" roles.

PartitionId           : 00000000-0000-0000-0000-000000001000
ReplicaId             : 131064359253133577
AggregatedHealthState : Warning
UnhealthyEvaluations  :
                        Unhealthy event: SourceId='System.NamingService', Property='Duration_AOCreateService.fabric:/MyApp/MyService', HealthState='Warning', ConsiderWarningAsError=false.

HealthEvents          :
                        SourceId              : System.RA
                        Property              : State
                        HealthState           : Ok
                        SequenceNumber        : 131064359308715535
                        SentAt                : 4/29/2016 8:38:50 PM
                        ReceivedAt            : 4/29/2016 8:39:08 PM
                        TTL                   : Infinite
                        Description           : Replica has been created.
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Error->Ok = 4/29/2016 8:39:08 PM, LastWarning = 1/1/0001 12:00:00 AM

                        SourceId              : System.NamingService
                        Property              : Duration_AOCreateService.fabric:/MyApp/MyService
                        HealthState           : Warning
                        SequenceNumber        : 131064359526778775
                        SentAt                : 4/29/2016 8:39:12 PM
                        ReceivedAt            : 4/29/2016 8:39:38 PM
                        TTL                   : 00:05:00
                        Description           : The AOCreateService started at 2016-04-29 20:39:08.677 is taking longer than 30.000.
                        RemoveWhenExpired     : True
                        IsExpired             : False
                        Transitions           : Error->Warning = 4/29/2016 8:39:38 PM, LastOk = 1/1/0001 12:00:00 AM

                        SourceId              : System.NamingService
                        Property              : Duration_NOCreateService.fabric:/MyApp/MyService
                        HealthState           : Warning
                        SequenceNumber        : 131064360657607311
                        SentAt                : 4/29/2016 8:41:05 PM
                        ReceivedAt            : 4/29/2016 8:41:08 PM
                        TTL                   : 00:00:15
                        Description           : The NOCreateService started at 2016-04-29 20:39:08.689 completed with FABRIC_E_TIMEOUT in more than 30.000.
                        RemoveWhenExpired     : True
                        IsExpired             : False
                        Transitions           : Error->Warning = 4/29/2016 8:39:38 PM, LastOk = 1/1/0001 12:00:00 AM

DeployedApplication 系统运行状况报告DeployedApplication system health reports

System.Hosting 是已部署实体上的主管组件。System.Hosting is the authority on deployed entities.

激活Activation

应用程序在节点上成功激活时,System.Hosting 报告正常。System.Hosting reports as OK when an application has been successfully activated on the node. 否则报告错误。Otherwise, it reports an error.

  • SourceId:System.HostingSourceId: System.Hosting
  • 属性Activation,包括推出版本。Property: Activation, including the rollout version.
  • 后续步骤:如果应用程序不正常,则调查激活失败的原因。Next steps: If the application is unhealthy, investigate why the activation failed.

下面的示例展示了成功激活:The following example shows a successful activation:

PS C:\> Get-ServiceFabricDeployedApplicationHealth -NodeName _Node_1 -ApplicationName fabric:/WordCount -ExcludeHealthStatistics

ApplicationName                    : fabric:/WordCount
NodeName                           : _Node_1
AggregatedHealthState              : Ok
DeployedServicePackageHealthStates : 
                                     ServiceManifestName   : WordCountServicePkg
                                     ServicePackageActivationId : 
                                     NodeName              : _Node_1
                                     AggregatedHealthState : Ok

HealthEvents                       : 
                                     SourceId              : System.Hosting
                                     Property              : Activation
                                     HealthState           : Ok
                                     SequenceNumber        : 131445249083836329
                                     SentAt                : 7/14/2017 4:55:08 PM
                                     ReceivedAt            : 7/14/2017 4:55:14 PM
                                     TTL                   : Infinite
                                     Description           : The application was activated successfully.
                                     RemoveWhenExpired     : False
                                     IsExpired             : False
                                     Transitions           : Error->Ok = 7/14/2017 4:55:14 PM, LastWarning = 1/1/0001 12:00:00 AM

下载Download

如果应用程序包下载失败,System.Hosting 会报告错误。System.Hosting reports an error if the application package download fails.

  • SourceId:System.HostingSourceId: System.Hosting
  • 属性Download,包括推出版本。Property: Download, including the rollout version.
  • 后续步骤:调查在节点上下载失败的原因。Next steps: Investigate why the download failed on the node.

DeployedServicePackage 系统运行状况报告DeployedServicePackage system health reports

System.Hosting 是已部署实体上的主管组件。System.Hosting is the authority on deployed entities.

服务包激活Service package activation

如果服务包在节点上成功激活,则 System.Hosting 报告正常。System.Hosting reports as OK if the service package activation on the node is successful. 否则报告错误。Otherwise, it reports an error.

  • SourceId:System.HostingSourceId: System.Hosting
  • 属性:Activation。Property: Activation.
  • 后续步骤:调查激活失败的原因。Next steps: Investigate why the activation failed.

代码包激活Code package activation

对于每个代码包,如果成功激活,System.Hosting 报告正常。System.Hosting reports as OK for each code package if the activation is successful. 如果激活失败,则报告配置的警告。If the activation fails, it reports a warning as configured. 如果 CodePackage 无法激活,或者由于错误数超过配置的 CodePackageHealthErrorThreshold 而终止,则 Hosting 报告错误。If CodePackage fails to activate or terminates with an error greater than the configured CodePackageHealthErrorThreshold, hosting reports an error. 如果服务包中有多个代码包,则为每个包生成激活报告。If a service package contains multiple code packages, an activation report is generated for each one.

  • SourceId:System.HostingSourceId: System.Hosting
  • 属性:使用前缀 CodePackageActivation,并包含 CodePackageActivation:CodePackageName:SetupEntryPoint/EntryPoint 形式的代码包名称和入口点。Property: Uses the prefix CodePackageActivation and contains the name of the code package and the entry point as CodePackageActivation:CodePackageName:SetupEntryPoint/EntryPoint. 例如,CodePackageActivation:Code:SetupEntryPoint 。For example, CodePackageActivation:Code:SetupEntryPoint.

服务类型注册Service type registration

如果服务类型注册成功,System.Hosting 报告正常。System.Hosting reports as OK if the service type has been registered successfully. 如果注册未按时完成(超时是通过 ServiceTypeRegistrationTimeout 配置),则报告错误。It reports an error if the registration wasn't done in time, as configured by using ServiceTypeRegistrationTimeout. 如果运行时已关闭,服务类型会从节点取消注册,并且 Hosting 会报告警告。If the runtime is closed, the service type is unregistered from the node and hosting reports a warning.

  • SourceId:System.HostingSourceId: System.Hosting
  • 属性:使用前缀 ServiceTypeRegistration,并包含服务类型名称。Property: Uses the prefix ServiceTypeRegistration and contains the service type name. 例如,ServiceTypeRegistration:FileStoreServiceType 。For example, ServiceTypeRegistration:FileStoreServiceType.

以下示例显示了一个正常的已部署服务包:The following example shows a healthy deployed service package:

PS C:\> Get-ServiceFabricDeployedServicePackageHealth -NodeName _Node_1 -ApplicationName fabric:/WordCount -ServiceManifestName WordCountServicePkg

ApplicationName            : fabric:/WordCount
ServiceManifestName        : WordCountServicePkg
ServicePackageActivationId : 
NodeName                   : _Node_1
AggregatedHealthState      : Ok
HealthEvents               : 
                             SourceId              : System.Hosting
                             Property              : Activation
                             HealthState           : Ok
                             SequenceNumber        : 131445249084026346
                             SentAt                : 7/14/2017 4:55:08 PM
                             ReceivedAt            : 7/14/2017 4:55:14 PM
                             TTL                   : Infinite
                             Description           : The ServicePackage was activated successfully.
                             RemoveWhenExpired     : False
                             IsExpired             : False
                             Transitions           : Error->Ok = 7/14/2017 4:55:14 PM, LastWarning = 1/1/0001 12:00:00 AM

                             SourceId              : System.Hosting
                             Property              : CodePackageActivation:Code:EntryPoint
                             HealthState           : Ok
                             SequenceNumber        : 131445249084306362
                             SentAt                : 7/14/2017 4:55:08 PM
                             ReceivedAt            : 7/14/2017 4:55:14 PM
                             TTL                   : Infinite
                             Description           : The CodePackage was activated successfully.
                             RemoveWhenExpired     : False
                             IsExpired             : False
                             Transitions           : Error->Ok = 7/14/2017 4:55:14 PM, LastWarning = 1/1/0001 12:00:00 AM

                             SourceId              : System.Hosting
                             Property              : ServiceTypeRegistration:WordCountServiceType
                             HealthState           : Ok
                             SequenceNumber        : 131445249088096842
                             SentAt                : 7/14/2017 4:55:08 PM
                             ReceivedAt            : 7/14/2017 4:55:14 PM
                             TTL                   : Infinite
                             Description           : The ServiceType was registered successfully.
                             RemoveWhenExpired     : False
                             IsExpired             : False
                             Transitions           : Error->Ok = 7/14/2017 4:55:14 PM, LastWarning = 1/1/0001 12:00:00 AM

下载Download

如果服务包下载失败,System.Hosting 报告错误。System.Hosting reports an error if the service package download fails.

  • SourceId:System.HostingSourceId: System.Hosting
  • 属性Download,包括推出版本。Property: Download, including the rollout version.
  • 后续步骤:调查在节点上下载失败的原因。Next steps: Investigate why the download failed on the node.

升级验证Upgrade validation

如果升级期间验证失败或节点上的升级失败,System.Hosting 报告错误。System.Hosting reports an error if validation during the upgrade fails or if the upgrade fails on the node.

  • SourceId:System.HostingSourceId: System.Hosting
  • 属性:使用前缀 FabricUpgradeValidation,并包含升级版本。Property: Uses the prefix FabricUpgradeValidation and contains the upgrade version.
  • 说明:指向遇到的错误。Description: Points to the error encountered.

资源调控指标的节点容量未定义Undefined node capacity for resource governance metrics

如果未在群集清单中定义节点容量,且自动检测被配置为已关闭,则 System.Hosting 将报告一个警告。System.Hosting reports a warning if node capacities aren't defined in the cluster manifest and the configuration for automatic detection is turned off. 只要使用资源调控的服务包在指定节点上注册,Service Fabric 就会引发一个运行状况警报。Service Fabric raises a health warning whenever the service package that uses resource governance registers on a specified node.

  • SourceId:System.HostingSourceId: System.Hosting
  • 属性ResourceGovernanceProperty: ResourceGovernance.
  • 后续步骤:要解决此问题,首选方法是更改群集清单以启用可用资源的自动检测功能。Next steps: The preferred way to overcome this problem is to change the cluster manifest to enable automatic detection of available resources. 另一种方法是使用为这些指标正确指定的节点容量来更新群集清单。Another way is to update the cluster manifest with correctly specified node capacities for these metrics.

后续步骤Next steps