添加自定义 Service Fabric 运行状况报告Add custom Service Fabric health reports

Azure Service Fabric 引入了运行状况模型,用于在特定实体上标记不正常的群集和应用程序状态。Azure Service Fabric introduces a health model designed to flag unhealthy cluster and application conditions on specific entities. 运行状况模型使用运行状况报告器(系统组件和监视器)。The health model uses health reporters (system components and watchdogs). 其目标是实现轻松快捷的诊断和修复。The goal is easy and fast diagnosis and repair. 服务编写人员必须预先考虑运行状况。Service writers need to think upfront about health. 应报告任何可能会影响运行状况的条件,尤其是如果它有助于标记出接近根源的问题。Any condition that can impact health should be reported on, especially if it can help flag problems close to the root. 运行状况信息可节省调试和调查的时间和精力。The health information can save time and effort on debugging and investigation. 该服务在云中(私有云或 Azure 云)大规模启动并运行后,好处格外明显。The usefulness is especially clear once the service is up and running at scale in the cloud (private or Azure).

Service Fabric 报告器可监视感兴趣的已标识条件。The Service Fabric reporters monitor identified conditions of interest. 它们会根据其本地视图报告这些条件。They report on those conditions based on their local view. 运行状况存储可汇总所有报告器发送的运行状况数据,从而确定实体的全局运行状况是否正常。The health store aggregates health data sent by all reporters to determine whether entities are globally healthy. 该模型应具有功能丰富、灵活且易于使用的特点。The model is intended to be rich, flexible, and easy to use. 运行状况报告的质量决定了群集运行状况视图的准确度。The quality of the health reports determines the accuracy of the health view of the cluster. 错误显示不正常问题的误报会对升级或其他使用运行状况数据的服务产生负面影响。False positives that wrongly show unhealthy issues can negatively impact upgrades or other services that use health data. 修复服务和警报机制就是此类服务的例子。Examples of such services are repair services and alerting mechanisms. 因此,提供报表时需多加考量,才能让其以尽可能最佳的方式捕获感兴趣的条件。Therefore, some thought is needed to provide reports that capture conditions of interest in the best possible way.

若要设计和实施运行状况报告,监视器和系统组件必须:To design and implement health reporting, watchdogs and system components must:

  • 定义它们感兴趣的条件、受监视的方式以及对群集或应用程序功能的影响。Define the condition they are interested in, the way it is monitored, and the impact on the cluster or application functionality. 根据此信息确定运行状况报告属性和运行状态。Based on this information, decide on the health report property and health state.
  • 确定应用报表的实体Determine the entity that the report applies to.
  • 确定是从服务内部、内部监视器还是外部监视器完成报表。Determine where the reporting is done, from within the service or from an internal or external watchdog.
  • 定义用于标识报告器的源。Define a source used to identify the reporter.
  • 选择报告策略,是定期报告还是在转换时报告。Choose a reporting strategy, either periodically or on transitions. 建议定期发送报告,因为它需要的代码较简单,不容易发生错误。The recommended way is periodically, as it requires simpler code and is less prone to errors.
  • 确定不正常条件的报告可在运行状况存储中保留多久,以及应该如何清除。Determine how long the report for unhealthy conditions should stay in the health store and how it should be cleared. 使用此信息确定报告的生存时间和到期删除行为。Using this information, decide the report's time to live and remove-on-expiration behavior.

如上所述,可以基于以下内容完成报告:As mentioned, reporting can be done from:

  • 受监视的 Service Fabric 服务副本。The monitored Service Fabric service replica.
  • 部署为 Service Fabric 服务的内部监视器(例如,监视状态和问题报告的 Service Fabric 无状态服务)。Internal watchdogs deployed as a Service Fabric service (for example, a Service Fabric stateless service that monitors conditions and issues reports). 可以在所有节点上部署监视器,也可以将监视器与受监视的服务关联。The watchdogs can be deployed an all nodes or can be affinitized to the monitored service.
  • 在 Service Fabric 节点上运行,但以 Service Fabric 服务形式实现的内部监视器。Internal watchdogs that run on the Service Fabric nodes but are not implemented as Service Fabric services.
  • 从 Service Fabric 群集探测资源的外部监视器(例如,诸如 Gomez 之类的监视服务)。External watchdogs that probe the resource from outside the Service Fabric cluster (for example, monitoring service like Gomez).


根据现有设定,群集中将填充系统组件发送的运行状况报告。Out of the box, the cluster is populated with health reports sent by the system components. 有关详细信息,请参阅使用系统运行状况报告进行故障排除Read more at Using system health reports for troubleshooting. 必须针对系统已创建的运行状况实体发送用户报告。The user reports must be sent on health entities that have already been created by the system.

只要运行状况报告的设计清晰明了,发送运行状况报告就十分容易。Once the health reporting design is clear, health reports can be sent easily. 如果群集不安全或 Fabric 客户端拥有管理员权限,可以使用 FabricClient 报告运行状况。You can use FabricClient to report health if the cluster is not secure or if the fabric client has admin privileges. 可以使用 FabricClient.HealthManager.ReportHealth 通过 API 进行报告,或者通过 PowerShell 或 REST 来完成。Reporting can be done through the API by using FabricClient.HealthManager.ReportHealth, through PowerShell, or through REST. 配置旋钮 Batch 报告可提升性能。Configuration knobs batch reports for improved performance.


报告运行状况会同步处理,并且只代表客户端上的验证工作。Report health is synchronous, and it represents only the validation work on the client side. 运行状况客户端或者 PartitionCodePackageActivationContext 对象接受报告的事实并不表示该报告已应用于存储中。The fact that the report is accepted by the health client or the Partition or CodePackageActivationContext objects doesn't mean that it is applied in the store. 它以异步方式发送并可能与其他报告一起进行批处理。It is sent asynchronously and possibly batched with other reports. 服务器上的处理仍可能失败:序号可能已过时、必须应用报告的实体已被删除,等等。The processing on the server may still fail: the sequence number could be stale, the entity on which the report must be applied has been deleted, etc.

运行状况客户端Health client

运行状况报告通过存在于结构客户端内的运行状况客户端发送到运行状况管理器。The health reports are sent to the health manager through a health client, which lives inside the fabric client. 运行状况管理器将报告保存在运行状况存储中。The health manager saves reports in the health store. 可以使用下列设置来配置运行状况客户端:The health client can be configured with the following settings:

  • HealthReportSendInterval:报告添加到客户端与报告发送到运行状况管理器之间的时间延迟。HealthReportSendInterval: The delay between the time the report is added to the client and the time it is sent to the health manager. 用于将报告 Batch 为单条消息,而不会为每个报告发送一条消息。Used to batch reports into a single message, rather than sending one message for each report. 批处理可以提升性能。The batching improves performance. 默认值:30 秒。Default: 30 seconds.
  • HealthReportRetrySendInterval:运行状况客户端将累积的运行状况报告重新发送给运行状况管理器的间隔时间。HealthReportRetrySendInterval: The interval at which the health client resends accumulated health reports to the health manager. 默认值:30秒,最小值:1 秒。Default: 30 seconds, minimum: 1 second.
  • HealthOperationTimeout:报告消息发送到运行状况管理器的超时期限。HealthOperationTimeout: The timeout period for a report message sent to the health manager. 如果消息超时,运行状况客户端就会不断重试,直到运行状况管理器确认报告已处理为止。If a message times out, the health client retries it until the health manager confirms that the report has been processed. 默认值:2 分钟。Default: two minutes.


批量处理报告时,结构客户端必须至少保持 HealthReportSendInterval 的活动状态,以确保报告发送完毕。When the reports are batched, the fabric client must be kept alive for at least the HealthReportSendInterval to ensure that they are sent. 如果消息丢失或运行状况管理器因为暂时性错误而无法应用它们,结构客户端必须保持更长时间的活动状态,让其有重试的机会。If the message is lost or the health manager cannot apply them due to transient errors, the fabric client must be kept alive longer to give it a chance to retry.

客户端上的缓冲会将报告的唯一性纳入考虑范围。The buffering on the client takes the uniqueness of the reports into consideration. 例如,如果特定的错误报告器针对相同实体的相同属性每秒报告 100 个报告,则会以最后一个版本取代所有报告。For example, if a particular bad reporter is reporting 100 reports per second on the same property of the same entity, the reports are replaced with the last version. 客户端队列中最多存在一个这样的报告。At most one such report exists in the client queue. 如果配置了批处理,则发送到运行状况管理器的报告数目仅为每个发送间隔发送一份报告。If batching is configured, the number of reports sent to the health manager is just one per send interval. 这是最后添加的报告,可反映实体的最新状态。This report is the last added report, which reflects the most current state of the entity. 创建 FabricClient 时,通过传递 FabricClientSettings 及运行状况相关实体的所需值来指定配置参数。Specify configuration parameters when FabricClient is created by passing FabricClientSettings with the desired values for health-related entries.

以下示例创建结构客户端,并指定添加报告后尽快发送。The following example creates a fabric client and specifies that the reports should be sent when they are added. 可重试的错误或超时发生时,每 40 秒重试一次。On timeouts and errors that can be retried, retries happen every 40 seconds.

var clientSettings = new FabricClientSettings()
    HealthOperationTimeout = TimeSpan.FromSeconds(120),
    HealthReportSendInterval = TimeSpan.FromSeconds(0),
    HealthReportRetrySendInterval = TimeSpan.FromSeconds(40),
var fabricClient = new FabricClient(clientSettings);

建议保留默认结构客户端设置,将 HealthReportSendInterval 设为 30 秒。We recommend keeping the default fabric client settings, which set HealthReportSendInterval to 30 seconds. 此设置确保通过批处理获得最佳性能。This setting ensures optimal performance due to batching. 对于必须尽快发送关键报告,请在 FabricClient.HealthClient.ReportHealth API 中对 HealthReportSendOptions 使用 Immediate trueFor critical reports that must be sent as soon as possible, use HealthReportSendOptions with Immediate true in FabricClient.HealthClient.ReportHealth API. 即时报告会绕过批处理间隔。Immediate reports bypass the batching interval. 请小心使用此标志;我们想尽可能利用运行状况客户端批处理。Use this flag with care; we want to take advantage of the health client batching whenever possible. 结构客户端即将关闭(例如,进程已确定无效状态并需要关闭以预防副作用)时,即时发送也很有用。Immediate send is also useful when the fabric client is closing (for example, the process has determined invalid state and needs to shut down to prevent side effects). 它确保尽量发送累积的报告。It ensures a best-effort send of the accumulated reports. 如果某个报告添加有“即时”标志,运行状况客户端对自上次发送积累的所有报告进行批处理。When one report is added with Immediate flag, the health client batches all the accumulated reports since last send.

通过 PowerShell 创建与群集的连接时,可以指定相同的参数。Same parameters can be specified when a connection to a cluster is created through PowerShell. 以下示例启动与本地群集的连接:The following example starts a connection to a local cluster:

PS C:\> Connect-ServiceFabricCluster -HealthOperationTimeoutInSec 120 -HealthReportSendIntervalInSec 0 -HealthReportRetrySendIntervalInSec 40

ConnectionEndpoint   :
FabricClientSettings : {
                       ClientFriendlyName                   : PowerShell-1944858a-4c6d-465f-89c7-9021c12ac0bb
                       PartitionLocationCacheLimit          : 100000
                       PartitionLocationCacheBucketCount    : 1024
                       ServiceChangePollInterval            : 00:02:00
                       ConnectionInitializationTimeout      : 00:00:02
                       KeepAliveInterval                    : 00:00:20
                       HealthOperationTimeout               : 00:02:00
                       HealthReportSendInterval             : 00:00:00
                       HealthReportRetrySendInterval        : 00:00:40
                       NotificationGatewayConnectionTimeout : 00:00:00
                       NotificationCacheUpdateTimeout       : 00:00:00
GatewayInformation   : {
                       NodeAddress                          : localhost:19000
                       NodeId                               : 1880ec88a3187766a6da323399721f53
                       NodeInstanceId                       : 130729063464981219
                       NodeName                             : Node.1

与 API 类似,可以使用 -Immediate 开关立即发送报告,无需考虑 HealthReportSendInterval 值。Similarly to API, reports can be sent using -Immediate switch to be sent immediately, regardless of the HealthReportSendInterval value.

对于 REST,报告发送到 Service Fabric 网关,它具有内部结构客户端。For REST, the reports are sent to the Service Fabric gateway, which has an internal fabric client. 默认情况下,此客户端被配置为每隔 30 秒发送批处理的报告。By default, this client is configured to send reports batched every 30 seconds. 可以使用 HttpGateway 上的群集配置设置 HttpGatewayHealthReportSendInterval 来更改批处理间隔。You can change the batch interval with the cluster configuration setting HttpGatewayHealthReportSendInterval on HttpGateway. 如上所述,更好的选择是在 Immediate 为 true 时发送报告。As mentioned, a better option is to send the reports with Immediate true.


要确保未授权的服务无法针对群集中的实体报告运行状况,请将服务器配置为只接受来自受保护客户端的请求。To ensure that unauthorized services can't report health against the entities in the cluster, configure the server to accept requests only from secured clients. 用于报告的 FabricClient 必须启用安全性才能与群集通信(例如使用 Kerberos 或证书身份验证)。The FabricClient used for reporting must have security enabled to be able to communicate with the cluster (for example, with Kerberos or certificate authentication). 详细了解群集安全性Read more about cluster security.

在低特权的服务内进行报告Report from within low privilege services

如果 Service Fabric 服务对群集没有管理员访问权限,可以通过 PartitionCodePackageActivationContext,报告来自当前上下文的实体的运行状况。If Service Fabric services do not have admin access to the cluster, you can report health on entities from the current context through Partition or CodePackageActivationContext.


就内部而言,PartitionCodePackageActivationContext 会保留使用默认设置配置的运行状况客户端。Internally, the Partition and the CodePackageActivationContext hold a health client configured with default settings. 如同就运行状况客户端进行的阐释那样,对报告进行批处理并根据计时器发送。As explained for the health client, reports are batched and sent on a timer. 对象应保持活动状态,以便有机会发送报告。The objects should be kept alive to have a chance to send the report.

通过 PartitionCodePackageActivationContext 运行状况 API 发送报告时,可指定 HealthReportSendOptionsYou can specify HealthReportSendOptions when sending reports through Partition and CodePackageActivationContext health APIs. 如有必须尽快发送的关键报告,请对 HealthReportSendOptions 使用 Immediate trueIf you have critical reports that must be sent as soon as possible, use HealthReportSendOptions with Immediate true. 即时报告绕开内部运行状况客户端的批处理间隔。Immediate reports bypass the batching interval of the internal health client. 如前所述,请小心使用此标志;我们想尽可能利用运行状况客户端批处理。As mentioned before, use this flag with care; we want to take advantage of the health client batching whenever possible.

设计运行状况报告Design health reporting

生成高质量报告的第一步是识别可能影响服务运行状况的条件。The first step in generating high-quality reports is identifying the conditions that can impact the health of the service. 在条件启动甚至发生之前,任何有助于在服务或群集中标记问题的条件,都有可能节约数十亿元的费用。Any condition that can help flag problems in the service or cluster when it starts--or even better, before a problem happens--can potentially save billions of dollars. 优点包括故障时间变少,晚上花在调查和修复问题上的时间变少,客户满意度自然也会更高。The benefits include less down time, fewer night hours spent investigating and repairing issues, and higher customer satisfaction.

识别条件后,监视器编写人员需要找出最佳的监视方式,以实现开销和实用性的平衡。Once the conditions are identified, watchdog writers need to figure out the best way to monitor them for balance between overhead and usefulness. 例如,设想某个服务使用某个共享上的一些临时文件进行复杂计算。For example, consider a service that does complex calculations that use some temporary files on a share. 监视器可以监视该共享,以确保有足够空间可用。A watchdog could monitor the share to ensure that enough space is available. 它可以侦听文件或目录更改通知。It could listen for notifications of file or directory changes. 它可以在达到预先阈值时报告警告,在共享已满时报告错误。It could report a warning if an upfront threshold is reached, and report an error if the share is full. 报告警告时,修复系统可以开始清理共享上较旧的文件。On a warning, a repair system could start cleaning up older files on the share. 报告错误时,修复系统可将服务副本移至另一个节点。On an error, a repair system could move the service replica to another node. 请注意依据运行状况描述条件状态的方式:何种条件状态可视为正常(没问题),何种条件状态可视为不正常(警告或错误)。Note how the condition states are described in terms of health: the state of the condition that can be considered healthy (ok) or unhealthy (warning or error).

设定好监视的详细信息之后,监视器编写人员需要了解如何实现监视器。Once the monitoring details are set, a watchdog writer needs to figure out how to implement the watchdog. 如果条件可在服务内确定,则监视器可以成为受监视服务本身的一部分。If the conditions can be determined from within the service, the watchdog can be part of the monitored service itself. 例如,服务代码可在每次尝试写入文件时,检查并报告共享使用量。For example, the service code can check the share usage, and then report every time it tries to write a file. 这个方法的优点是报告变得轻而易举。The advantage of this approach is that reporting is simple. 为避免监视器 bug 影响到服务功能,必须多加留意。Care must be taken to prevent watchdog bugs from impacting the service functionality.

并非一定要在受监视的服务内报告。Reporting from within the monitored service is not always an option. 服务中的监视程序可能无法检测状态。A watchdog within the service may not be able to detect the conditions. 可能没有逻辑或数据可供做出判断。It may not have the logic or data to make the determination. 监视状态的开销可能很高。The overhead of monitoring the conditions may be high. 状态也可能不特定于某项服务,而会影响服务之间的交互。The conditions also may not be specific to a service, but instead affect interactions between services. 也可以选择在群集中将监视器作为独立进程。Another option is to have watchdogs in the cluster as separate processes. 监视器监视条件和报告,不以任何方式影响主要服务。The watchdogs monitor the conditions and report, without affecting the main services in any way. 例如,这些监视器可在相同应用程序中以无状态服务的形式实现,或在所有节点或相同节点上作为服务部署。For example, these watchdogs could be implemented as stateless services in the same application, deployed on all nodes or on the same nodes as the service.

有时,也并非一定要在群集中运行监视器。Sometimes, a watchdog running in the cluster is not an option either. 如果受监视的条件是用户所见的服务可用性或功能,监视器最好能与用户客户端位于相同的位置。If the monitored condition is the availability or functionality of the service as users see it, it's best to have the watchdogs in the same place as the user clients. 这样,就可以采用与用户调用操作相同的方式来测试操作。There, they can test the operations in the same way users call them. 例如,监视器可以存留于群集外部、对服务发出请求,并检查结果的延迟和正确性。For example, you can have a watchdog that lives outside the cluster, issues requests to the service, and checks the latency and correctness of the result. (例如在计算器服务中,2+2 是否在合理的时间内返回 4?)(For a calculator service, for example, does 2+2 return 4 in a reasonable amount of time?)

确定监视器详细信息后,应该确定可唯一标识它的源 ID。Once the watchdog details have been finalized, you should decide on a source ID that uniquely identifies it. 如果多个相同类型的监视器存留于群集中,它们必须报告不同的实体,如果它们报告相同的实体,请使用不同的源 ID 或属性。If multiple watchdogs of the same type are living in the cluster, they must report on different entities, or, if they report on the same entity, use different source ID or property. 这样报告才能共存。This way, their reports can coexist. 运行状况报告的属性应捕获受监视的条件。The property of the health report should capture the monitored condition. (对于上述示例,属性可以是 ShareSize。)如果多个报告应用于同一条件,该属性应包含一些动态信息,才可让报告共存。(For the example above, the property could be ShareSize.) If multiple reports apply to the same condition, the property should contain some dynamic information that allows reports to coexist. 例如,如果需要监视多个共享,属性名称可以是 ShareSize-sharenameFor example, if multiple shares need to be monitored, the property name can be ShareSize-sharename.


请勿将运行状况存储用于保存状态信息 。Do not use the health store to keep status information. 只有与运行状况相关的信息才应作为运行状况进行报告,即影响实体运行状况评估的信息。Only health-related information should be reported as health, as this information impacts the health evaluation of an entity. 运行状况存储并非设计作为一般用途的存储。The health store was not designed as a general-purpose store. 它使用运行状况评估逻辑将所有数据聚合到运行状况中。It uses health evaluation logic to aggregate all data into the health state. 发送与运行状况无关的信息(例如,报告运行状况为“正常”的状态)不会影响聚合的运行状况,但可能对运行状况存储的性能造成负面影响。Sending information unrelated to health (like reporting status with a health state of OK) doesn't impact the aggregated health state, but it can negatively affect the performance of the health store.

下一个决策点就是需要报告的实体。The next decision point is which entity to report on. 大多数情况下,该条件清楚地标识实体。Most of the time, the condition clearly identifies the entity. 选择具有最佳粒度的实体。Choose the entity with best possible granularity. 如果条件影响到某个分区中的所有副本,则报告该分区,而非服务。If a condition impacts all replicas in a partition, report on the partition, not on the service. 以下是需要仔细考虑的极端案例。There are corner cases where more thought is needed, though. 如果条件影响到实体(例如副本),但需要将条件标记为超过副本生存期,则应报告分区。If the condition impacts an entity, such as a replica, but the desire is to have the condition flagged for more than the duration of replica life, then it should be reported on the partition. 否则,删除副本时,运行状况存储会清除其所有报告。Otherwise, when the replica is deleted, the health store cleans up all its reports. 监视器编写器必须将实体和报告的生存期纳入考虑范围。Watchdog writers must think about the lifetimes of the entity and the report. 必须清楚说明应从存储中清除报告的时间点(例如,针对实体报告的错误不再适用时)。It must be clear when a report should be cleaned up from a store (for example, when an error reported on an entity no longer applies).

让我们以一个例子解释上述要点。Let's look at an example that puts together the points I described. 假设在所有节点上部署一个 Service Fabric 应用程序,该应用程序由一个主要的有状态持久性服务和多个次要的无状态服务组成(每种任务类型具有一种次要服务类型)。Consider a Service Fabric application composed of a master stateful persistent service and secondary stateless services deployed on all nodes (one secondary service type for each type of task). 主服务有一个处理队列,队列中包含次要服务需要执行的命令。The master has a processing queue that contains commands to be executed by secondaries. 次要服务执行传入请求,并发回确认信号。The secondaries execute the incoming requests and send back acknowledgement signals. 可以监视的条件之一是主要服务的处理队列长度。One condition that could be monitored is the length of the master processing queue. 如果主服务队列长度达到阈值,则报告警告。If the master queue length reaches a threshold, a warning is reported. 该警告指出辅助服务无法处理负载。The warning indicates that the secondaries can't handle the load. 如果队列达到长度上限,而且命令已删除,则会因为服务无法恢复而报告错误。If the queue reaches the maximum length and commands are dropped, an error is reported, as the service can't recover. 可以针对属性 QueueStatus发送报告。The reports can be on the property QueueStatus. 监视器位于服务内部,它会在主要服务的主要副本上定期发送报告。The watchdog lives inside the service, and it's sent periodically on the master primary replica. 生存时间为 2 分钟,每隔 30 秒定期发送一次报告。The time to live is two minutes, and it's sent periodically every 30 seconds. 如果主要副本发生故障,报告会自动从存储中清除。If the primary goes down, the report is cleaned up automatically from store. 如果服务副本已启用,但发生死锁或有其他问题,该报告会在运行状况存储中过期。If the service replica is up, but it is deadlocked or having other issues, the report expires in the health store. 在这种情况下,会错误地评估实体。In this case, the entity is evaluated at error.

另一个可监视的条件是任务执行时间。Another condition that can be monitored is task execution time. 主服务会根据任务类型将任务分发给次要服务。The master distributes tasks to the secondaries based on the task type. 根据设计,主服务可以轮询次要服务以获取任务状态。Depending on the design, the master could poll the secondaries for task status. 它也可以等待次要服务在完成时发回确认信号。It could also wait for secondaries to send back acknowledgement signals when they are done. 在第二种情况中,必须注意检测次要服务停止运行或消息丢失的情况。In the second case, care must be taken to detect situations where secondaries die or messages are lost. 一种方法是主服务向同一个次要服务发送 Ping 请求,并次要服务发回其状态。One option is for the master to send a ping request to the same secondary, which sends back its status. 如果未收到状态,主服务将此视为失败并重新安排任务。If no status is received, the master considers it a failure and reschedules the task. 此行为假设任务采用幂等模式。This behavior assumes that the tasks are idempotent.

如果任务未在特定时间(t1,例如 10 分钟)内完成,监视的条件可翻译为警告。The monitored condition can be translated as a warning if the task is not done in a certain time (t1, for example 10 minutes). 如果任务未按时(t2,例如 20 分钟)完成,监视的条件可翻译为错误。If the task is not completed in time (t2, for example 20 minutes), the monitored condition can be translated as Error. 此报告可以多种方式完成:This reporting can be done in multiple ways:

  • 主服务的主要副本定期报告自身情况。The master primary replica reports on itself periodically. 针对队列中的所有挂起任务可以有一个属性。You can have one property for all pending tasks in the queue. 如果至少有一个任务耗时较长,则 PendingTasks 属性的报告状态为警告或错误(视情况而定)。If at least one task takes longer, the report status on the property PendingTasks is a warning or error, as appropriate. 如果没有挂起的任务或所有任务已开始执行,报告状态为“正常”。If there are no pending tasks or all tasks started execution, the report status is OK. 任务是持久性的。The tasks are persistent. 如果主要副本发生故障,新升级的主要副本可继续适当地进行报告。If the primary goes down, the newly promoted primary can continue to report properly.
  • 云或外部的另一个监视器进程检查任务(根据所需的任务结果从外部检查),以查看它们是否已完成。Another watchdog process (in the cloud or external) checks the tasks (from outside, based on the desired task result) to see if they are completed. 如果它们不采用阈值,则发送有关主服务的报告。If they do not respect the thresholds, a report is sent on the master service. 此外还会发送有关每个任务的报告,其中包含任务标识符(例如 PendingTask+taskId)。A report is also sent on each task that includes the task identifier, like PendingTask+taskId. 只有在状况不正常时才应发送报告。Reports should be sent only on unhealthy states. 将生存时间设置为几分钟,并将报告标记为到期时删除,以确保进行清理。Set time to live to a few minutes, and mark the reports to be removed when they expire to ensure cleanup.
  • 如果任务运行时间超出预期,执行任务的次要服务将发送报告。The secondary that is executing a task reports when it takes longer than expected to run it. 它会报告 PendingTasks属性上的服务实例。It reports on the service instance on the property PendingTasks. 报告将指出有问题的服务实例,但它不会捕获实例停止运行的情况。The report pinpoints the service instance that has issues, but it doesn't capture the situation where the instance dies. 因为那时报告已清除完毕。The reports are cleaned up then. 它可能会报告次要服务。It could report on the secondary service. 如果次要服务完成任务,备用实例将从存储中清除报告。If the secondary completes the task, the secondary instance clears the report from the store. 报告不会捕获确认消息丢失的情况,从主要服务的角度来看,任务并未完成。The report doesn't capture the situation where the acknowledgement message is lost and the task is not finished from the master's point of view.

不过,上述情况中的报告已完成,评估运行状况时,会在应用程序运行状况中捕获这些报告。However the reporting is done in the cases described above, the reports are captured in application health when health is evaluated.

定期报告与转换时报告Report periodically vs. on transition

使用运行状况报告模型,监视器可以定期发送报告,也可以在转换时发送报告。By using the health reporting model, watchdogs can send reports periodically or on transitions. 建议定期发送监视器报告,因为代码要简单得多,不容易发生错误。The recommended way for watchdog reporting is periodically, because the code is much simpler and less prone to errors. 监视器必须尽可能简单,以免出现触发误报的 bug。The watchdogs must strive to be as simple as possible to avoid bugs that trigger incorrect reports. 不正确的不正常 报告会影响运行状况评估以及基于运行状况的情况(包括升级)。Incorrect unhealthy reports impact health evaluations and scenarios based on health, including upgrades. 不正确的正常 报告会隐藏群集中的问题,我们不希望发生这种情况。Incorrect healthy reports hide issues in the cluster, which is not desired.

针对定期报告,可以使用计时器实现监视器。For periodic reporting, the watchdog can be implemented with a timer. 计时器回调时,监视器可以检查状态并根据当前状态发送报告。On a timer callback, the watchdog can check the state and send a report based on the current state. 不需要查看先前发送的报告或在消息传送方面进行任何优化。There is no need to see which report was sent previously or make any optimizations in terms of messaging. 运行状况客户端具有批处理逻辑,有助于提高性能。The health client has batching logic to help with performance. 运行状况客户端保持活动状态时,会在内部不断重试,直到运行状况存储确认报告,或者监视器生成具有相同实体、属性和源的较新报告。While the health client is kept alive, it retries internally until the report is acknowledged by the health store or the watchdog generates a newer report with the same entity, property, and source.

转换时报告需要注意状态处理。Reporting on transitions requires careful handling of state. 监视器会监视某些条件,仅当这些条件改变时才报告。The watchdog monitors some conditions and reports only when the conditions change. 此方法的优点是需要较少的报告。The upside of this approach is that fewer reports are needed. 缺点是监视器的逻辑很复杂。The downside is that the logic of the watchdog is complex. 监视器必须维护条件或报告,以便对其进行检查,判断状态变更。The watchdog must maintain the conditions or the reports, so that they can be inspected to determine state changes. 故障转移时,必须注意添加的但尚未发送至运行状况存储的报告。On failover, care must be taken with reports added, but not yet sent to the health store. 序列号必须不断递增。The sequence number must be ever-increasing. 否则,报告将因为过时而被拒绝。If not, the reports are rejected as stale. 在造成数据丢失的少数情况下,可能需要同步报告器的状态与运行状况存储的状态。In the rare cases where data loss is incurred, synchronization may be needed between the state of the reporter and the state of the health store.

通过 PartitionCodePackageActivationContext 进行转换报告,对服务自行报告而言较为合理。Reporting on transitions makes sense for services reporting on themselves, through Partition or CodePackageActivationContext. 删除本地对象(副本或已部署的服务包/已部署的应用程序)时,也会删除它的所有报告。When the local object (replica or deployed service package / deployed application) is removed, all its reports are also removed. 这种自动清理会放宽在报告器和运行状况存储之间同步的需求。This automatic cleanup relaxes the need for synchronization between reporter and health store. 如果报告针对的是父分区或父应用程序,则在故障转移时必须小心谨慎,以免在运行状况存储中产生过时的报告。If the report is for parent partition or parent application, care must be taken on failover to avoid stale reports in the health store. 必须添加逻辑来维护正确的状态,并从存储中清除不再需要的报告。Logic must be added to maintain the correct state and clear the report from store when not needed anymore.

实现运行状况报告Implement health reporting

了解实体和报告详细信息后,就可以通过 API、PowerShell 或 REST 发送运行状况报告。Once the entity and report details are clear, sending health reports can be done through the API, PowerShell, or REST.


若要通过 API 发送报告,必须特定于要报告的实体类型创建一个运行状况报告。To report through the API, you need to create a health report specific to the entity type they want to report on. 然后将报告提供给运行状况客户端。Give the report to a health client. 或者,创建运行状况信息,并将其传递至 PartitionCodePackageActivationContext 上正确的报告方法,以报告当前实体的运行状况。Alternatively, create a health information and pass it to correct reporting methods on Partition or CodePackageActivationContext to report on current entities.

以下示例演示如何从群集内的监视器定期发送报告。The following example shows periodic reporting from a watchdog within the cluster. 该监视器会检查能否从节点内访问外部资源。The watchdog checks whether an external resource can be accessed from within a node. 应用程序内的服务清单需要该资源。The resource is needed by a service manifest within the application. 如果无法访问该资源,应用程序内的其他服务仍然可以正常运行。If the resource is unavailable, the other services within the application can still function properly. 因此,会在已部署的服务包实体上每隔 30 秒发送一次报告。Therefore, the report is sent on the deployed service package entity every 30 seconds.

private static Uri ApplicationName = new Uri("fabric:/WordCount");
private static string ServiceManifestName = "WordCount.Service";
private static string NodeName = FabricRuntime.GetNodeContext().NodeName;
private static Timer ReportTimer = new Timer(new TimerCallback(SendReport), null, 30 * 1000, 30 * 1000);
private static FabricClient Client = new FabricClient(new FabricClientSettings() { HealthReportSendInterval = TimeSpan.FromSeconds(0) });

public static void SendReport(object obj)
    // Test whether the resource can be accessed from the node
    HealthState healthState = this.TestConnectivityToExternalResource();

    // Send report on deployed service package, as the connectivity is needed by the specific service manifest
    // and can be different on different nodes
    var deployedServicePackageHealthReport = new DeployedServicePackageHealthReport(
        new HealthInformation("ExternalSourceWatcher", "Connectivity", healthState));

    // TODO: handle exception. Code omitted for snippet brevity.
    // Possible exceptions: FabricException with error codes
    // FabricHealthStaleReport (non-retryable, the report is already queued on the health client),
    // FabricHealthMaxReportsReached (retryable; user should retry with exponential delay until the report is accepted).


使用 Send-ServiceFabricEntityTypeHealthReport 发送运行状况报告。Send health reports with Send-ServiceFabricEntityTypeHealthReport.

以下示例演示如何定期报告某个节点上的 CPU 值。The following example shows periodic reporting on CPU values on a node. 应每隔 30 秒发送一次报告,报告生存时间为 2 分钟。The reports should be sent every 30 seconds, and they have a time to live of two minutes. 如果过期,就表示报告器有问题,因此会错误地评估该节点。If they expire, the reporter has issues, so the node is evaluated at error. CPU 高于阈值时,报告的运行状况为警告。When the CPU is above a threshold, the report has a health state of warning. CPU 保持高于阈值的配置时间时,则将其报告为错误。When the CPU remains above a threshold for more than the configured time, it's reported as an error. 否则,报告器发送的运行状态为“正常”。Otherwise, the reporter sends a health state of OK.

PS C:\> Send-ServiceFabricNodeHealthReport -NodeName Node.1 -HealthState Warning -SourceId PowershellWatcher -HealthProperty CPU -Description "CPU is above 80% threshold" -TimeToLiveSec 120

PS C:\> Get-ServiceFabricNodeHealth -NodeName Node.1
NodeName              : Node.1
AggregatedHealthState : Warning
UnhealthyEvaluations  :
                        Unhealthy event: SourceId='PowershellWatcher', Property='CPU', HealthState='Warning', ConsiderWarningAsError=false.

HealthEvents          :
                        SourceId              : System.FM
                        Property              : State
                        HealthState           : Ok
                        SequenceNumber        : 5
                        SentAt                : 4/21/2015 8:01:17 AM
                        ReceivedAt            : 4/21/2015 8:02:12 AM
                        TTL                   : Infinite
                        Description           : Fabric node is up.
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : ->Ok = 4/21/2015 8:02:12 AM

                        SourceId              : PowershellWatcher
                        Property              : CPU
                        HealthState           : Warning
                        SequenceNumber        : 130741236814913394
                        SentAt                : 4/21/2015 9:01:21 PM
                        ReceivedAt            : 4/21/2015 9:01:21 PM
                        TTL                   : 00:02:00
                        Description           : CPU is above 80% threshold
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : ->Warning = 4/21/2015 9:01:21 PM

以下示例会在副本上报告暂时性警告。The following example reports a transient warning on a replica. 它先获取分区 ID,再获取所需服务的副本 ID。It first gets the partition ID and then the replica ID for the service it is interested in. 然后,从 PowershellWatcher 发送为属性 ResourceDependency 生成的报告。It then sends a report from PowershellWatcher on the property ResourceDependency. 此报告只需存在 2 分钟,就从存储中自动删除。The report is of interest for only two minutes, and it is removed from the store automatically.

PS C:\> $partitionId = (Get-ServiceFabricPartition -ServiceName fabric:/WordCount/WordCount.Service).PartitionId

PS C:\> $replicaId = (Get-ServiceFabricReplica -PartitionId $partitionId | where {$_.ReplicaRole -eq "Primary"}).ReplicaId

PS C:\> Send-ServiceFabricReplicaHealthReport -PartitionId $partitionId -ReplicaId $replicaId -HealthState Warning -SourceId PowershellWatcher -HealthProperty ResourceDependency -Description "The external resource that the primary is using has been rebooted at 4/21/2015 9:01:21 PM. Expect processing delays for a few minutes." -TimeToLiveSec 120 -RemoveWhenExpired

PS C:\> Get-ServiceFabricReplicaHealth  -PartitionId $partitionId -ReplicaOrInstanceId $replicaId

PartitionId           : 8f82daff-eb68-4fd9-b631-7a37629e08c0
ReplicaId             : 130740415594605869
AggregatedHealthState : Warning
UnhealthyEvaluations  :
                        Unhealthy event: SourceId='PowershellWatcher', Property='ResourceDependency', HealthState='Warning', ConsiderWarningAsError=false.

HealthEvents          :
                        SourceId              : System.RA
                        Property              : State
                        HealthState           : Ok
                        SequenceNumber        : 130740768777734943
                        SentAt                : 4/21/2015 8:01:17 AM
                        ReceivedAt            : 4/21/2015 8:02:12 AM
                        TTL                   : Infinite
                        Description           : Replica has been created.
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : ->Ok = 4/21/2015 8:02:12 AM

                        SourceId              : PowershellWatcher
                        Property              : ResourceDependency
                        HealthState           : Warning
                        SequenceNumber        : 130741243777723555
                        SentAt                : 4/21/2015 9:12:57 PM
                        ReceivedAt            : 4/21/2015 9:12:57 PM
                        TTL                   : 00:02:00
                        Description           : The external resource that the primary is using has been rebooted at 4/21/2015 9:01:21 PM. Expect processing delays for a few minutes.
                        RemoveWhenExpired     : True
                        IsExpired             : False
                        Transitions           : ->Warning = 4/21/2015 9:12:32 PM


通过 REST 使用 POST 请求发送运行状况报告,这些请求将发送到所需的实体,其正文中包含运行状况报告描述。Send health reports using REST with POST requests that go to the desired entity and have in the body the health report description. 如需示例,请参阅有关如何发送 REST 群集运行状况报告服务运行状况报告的文档。For example, see how to send REST cluster health reports or service health reports. 支持所有实体。All entities are supported.

后续步骤Next steps

根据运行状况数据,服务编写人员和群集/应用程序管理员可以思考使用这些信息的方法。Based on the health data, service writers and cluster/application administrators can think of ways to consume the information. 例如,他们可以根据运行状态设置警报,以便在出现导致服务中断的严重问题之前就会其捕获。For example, they can set up alerts based on health status to catch severe issues before they provoke outages. 管理员还可以设置修复系统以便自动修复问题。Administrators can also set up repair systems to fix issues automatically.

Service Fabric 运行状况监视简介Introduction to Service Fabric health Monitoring

查看 Service Fabric 运行状况报告View Service Fabric health reports

如何报告和检查服务运行状况How to report and check service health

使用系统运行状况报告进行故障排除Use system health reports for troubleshooting

在本地监视和诊断服务Monitor and diagnose services locally

Service Fabric 应用程序升级Service Fabric application upgrade