Service Fabric 运行状况监视简介Introduction to Service Fabric health monitoring

Azure Service Fabric 引入了一个运行状况模型,该模型提供丰富、灵活且可扩展的运行状况评估和报告。Azure Service Fabric introduces a health model that provides rich, flexible, and extensible health evaluation and reporting. 使用该模型,可对群集及其中所运行服务的状态进行准实时监视。The model allows near-real-time monitoring of the state of the cluster and the services running in it. 可以轻松获取运行状况信息,并在潜在问题级联并造成大规模停机之前予以更正。You can easily obtain health information and correct potential issues before they cascade and cause massive outages. 在典型模型中,服务基于其本地视图发送报告,并聚合信息,以提供整体的群集级别视图。In the typical model, services send reports based on their local views, and that information is aggregated to provide an overall cluster-level view.

Service Fabric 组件使用这种提供丰富信息的运行状况模型报告其当前状态。Service Fabric components use this rich health model to report their current state. 可以使用相同的机制报告多个应用程序的运行状况。You can use the same mechanism to report health from your applications. 只要投入时间规划高质量的运行状况报告来捕获自定义条件,就能更轻松地检测并修复运行中应用程序的问题。If you invest in high-quality health reporting that captures your custom conditions, you can detect and fix issues for your running application much more easily.

备注

为应对被监视升级的需求,我们启动了运行状况子系统。We started the health subsystem to address a need for monitored upgrades. Service Fabric 提供受监视的应用程序和群集升级,可以确保完全可用性、永不停机,且几乎或完全不需要用户介入。Service Fabric provides monitored application and cluster upgrades that ensure full availability, no downtime and minimal to no user intervention. 若要实现这些目标,升级将根据配置的升级策略检查运行状况。To achieve these goals, the upgrade checks health based on configured upgrade policies. 仅当运行状况遵从所需阈值时,才可继续升级。An upgrade can proceed only when health respects desired thresholds. 否则,升级会自动回滚或暂停,以便让管理员有机会修复问题。Otherwise, the upgrade is either automatically rolled back or paused to give administrators a chance to fix the issues. 若要了解有关应用程序升级的详细信息,请参阅此文To learn more about application upgrades, see this article.

运行状况存储Health store

运行状况存储保留群集中关于实体的运行状况相关信息,以进行轻松的检索和评估。The health store keeps health-related information about entities in the cluster for easy retrieval and evaluation. 它作为 Service Fabric 保留的有状态服务进行实现,以确保高度可用性和伸缩性。It is implemented as a Service Fabric persisted stateful service to ensure high availability and scalability. 运行状况存储是 fabric:/System 应用程序的一部分,并且只要群集已启动并正在运行,即可使用。The health store is part of the fabric:/System application, and it is available when the cluster is up and running.

运行状况实体和层次结构Health entities and hierarchy

运行状况实体采用逻辑层次结构进行组织,该结构会捕获不同实体之间的交互和依赖项。The health entities are organized in a logical hierarchy that captures interactions and dependencies among different entities. 基于从 Service Fabric 组件收到的报告,运行状况存储自动生成实体和层次结构。The health store automatically builds health entities and hierarchy based on reports received from Service Fabric components.

运行状况实体镜像 Service Fabric 实体。The health entities mirror the Service Fabric entities. (例如,运行状况应用程序实体匹配群集中部署的应用程序实例,运行状况节点实体匹配 Service Fabric 群集节点。)运行状况层次结构捕获系统实体的交互并且是进行高级运行状况评估的基础。(For example, health application entity matches an application instance deployed in the cluster, while health node entity matches a Service Fabric cluster node.) The health hierarchy captures the interactions of the system entities, and it is the basis for advanced health evaluation. 可以通过 Service Fabric 技术概述了解 Service Fabric 的关键概念。You can learn about key Service Fabric concepts in Service Fabric technical overview. 有关应用程序的详细信息,请参阅 Service Fabric 应用程序模型For more on application, see Service Fabric application model.

利用运行状况实体和层次结构,可有效报告、调试和监视群集和应用程序。The health entities and hierarchy allow the cluster and applications to be effectively reported, debugged, and monitored. 运行状况模型为群集中许多移动片段的运行状况提供准确而精细的表示。The health model provides an accurate, granular representation of the health of the many moving pieces in the cluster.

运行状况实体。Health entities. 运行状况实体基于父-子关系在层次结构中进行组织。The health entities, organized in a hierarchy based on parent-child relationships.

运行状况实体有:The health entities are:

  • 群集Cluster. 表示 Service Fabric 群集的运行状况。Represents the health of a Service Fabric cluster. 群集运行状况报告说明影响整个群集的条件。Cluster health reports describe conditions that affect the entire cluster. 这些条件会影响群集中的多个实体或群集本身。These conditions affect multiple entities in the cluster or the cluster itself. 根据条件,报告者不能将问题缩小到一个或多个不正常子级范围内。Based on the condition, the reporter can't narrow the issue down to one or more unhealthy children. 示例包括因网络分区或通信问题导致的群集拆分的中枢部分。Examples include the brain of the cluster splitting due to network partitioning or communication issues.
  • 节点Node. 表示 Service Fabric 节点的运行状况。Represents the health of a Service Fabric node. 节点运行状况报告描述影响节点功能的条件。Node health reports describe conditions that affect the node functionality. 它们通常会影响在其上运行的所有已部署的实体。They typically affect all the deployed entities running on it. 示例包括当节点磁盘空间不足(或其他计算机范围属性,例如内存、连接)以及节点已关闭时。Examples include node out of disk space (or other machine-wide properties, such as memory, connections) and when a node is down. 节点实体由节点名称(字符串)标识。The node entity is identified by the node name (string).
  • 应用程序Application. 表示在群集中运行的应用程序实例的运行状况。Represents the health of an application instance running in the cluster. 应用程序运行状况报告描述影响应用程序整体运行状况的条件。Application health reports describe conditions that affect the overall health of the application. 不能将其缩小到单个子项(服务或已部署的应用程序)。They can't be narrowed down to individual children (services or deployed applications). 示例包括应用程序中不同服务之间的端到端交互。Examples include the end-to-end interaction among different services in the application. 应用程序实体由应用程序名称 (URI) 标识。The application entity is identified by the application name (URI).
  • 服务Service. 表示在群集中运行的服务的运行状况。Represents the health of a service running in the cluster. 服务运行状况报告说明影响服务整体运行状况的条件。Service health reports describe conditions that affect the overall health of the service. 报告者不能将问题范围缩小到不正常分区或副本。The reporter can't narrow down the issue to an unhealthy partition or replica. 示例包括导致全部分区问题的服务配置(如端口或外部文件共享)。Examples include a service configuration (such as port or external file share) that is causing issues for all partitions. 服务实体由服务名称 (URI) 标识。The service entity is identified by the service name (URI).
  • 分区Partition. 表示服务分区的运行状况。Represents the health of a service partition. 分区运行状况报告描述影响整个副本集的条件。Partition health reports describe conditions that affect the entire replica set. 示例包括当副本数目低于目标计数以及分区发生仲裁丢失时的情况。Examples include when the number of replicas is below target count and when a partition is in quorum loss. 分区实体由分区 ID (GUID) 标识。The partition entity is identified by the partition ID (GUID).
  • 副本Replica. 表示有状态服务副本或无状态服务实例的运行状况。Represents the health of a stateful service replica or a stateless service instance. 该副本是监视器和系统组件可针对应用程序进行报告的最小单位。The replica is the smallest unit that watchdogs and system components can report on for an application. 对于有状态服务,示例如下:主要副本不能将操作复制到辅助副本以及复制节奏缓慢。For stateful services, examples include a primary replica that can't replicate operations to secondaries and slow replication. 还包含无状态实例,可在耗尽资源或存在连接问题时进行报告。Also, a stateless instance can report when it is running out of resources or has connectivity issues. 副本实体由分区 ID (GUID) 和副本或实例 ID(长型值)标识。The replica entity is identified by the partition ID (GUID) and the replica or instance ID (long).
  • DeployedApplicationDeployedApplication. 表示 在节点上运行的应用程序的运行状况。Represents the health of an application running on a node. 已部署应用程序运行状况报告描述特定于节点上应用程序的条件,不能将其缩小到部署在相同节点上的服务包。Deployed application health reports describe conditions specific to the application on the node that can't be narrowed down to service packages deployed on the same node. 示例包括无法从该节点下载应用程序包以及在节点上设置应用程序安全主体时出现问题。Examples include errors when application package can't be downloaded on that node and issues setting up application security principals on the node. 已部署应用程序由应用程序名称 (URI) 和节点名称(字符串)标识。The deployed application is identified by application name (URI) and node name (string).
  • DeployedServicePackageDeployedServicePackage. 表示在群集中节点上运行的服务包的运行状况。Represents the health of a service package running on a node in the cluster. 它描述特定于服务包的条件,该条件不会影响同一应用程序在同一节点上的其他服务包。It describes conditions specific to a service package that do not affect the other service packages on the same node for the same application. 示例包括服务包中无法启动的代码包以及无法读取的配置包。Examples include a code package in the service package that cannot be started and a configuration package that cannot be read. 已部署服务包由应用程序名称 (URI)、节点名称(字符串)、服务清单名称(字符串)以及服务包激活 ID(字符串)标识。The deployed service package is identified by application name (URI), node name (string), service manifest name (string), and service package activation ID (string).

利用运行状况模型的精度,可以轻松地检测和更正问题。The granularity of the health model makes it easy to detect and correct issues. 例如,如果服务没有响应,则可报告应用程序实例不正常。For example, if a service is not responding, it is feasible to report that the application instance is unhealthy. 该级别的报告不是理想之选,因为该问题可能不会影响该应用程序中的所有服务。Reporting at that level is not ideal, however, because the issue might not be affecting all the services within that application. 应对不正常的服务或者对特定子分区(如果有更多信息指向该分区)应用报告。The report should be applied to the unhealthy service or to a specific child partition, if more information points to that partition. 数据通过层次结构自动呈现,会在服务和应用程序级别显示不正常的分区。The data automatically surfaces through the hierarchy, and an unhealthy partition is made visible at service and application levels. 这种聚合有助于更快找出问题的根本原因并解决问题。This aggregation helps to pinpoint and resolve the root cause of the issue more quickly.

运行状况层次结构由父-子关系组成。The health hierarchy is composed of parent-child relationships. 群集由节点和应用程序组成。A cluster is composed of nodes and applications. 应用程序包含服务和已部署的应用程序。Applications have services and deployed applications. 已部署的应用程序包含已部署的服务包。Deployed applications have deployed service packages. 服务具有分区,并且每个分区都有一个或多个副本。Services have partitions, and each partition has one or more replicas. 节点和已部署实体之间具有特殊关系。There is a special relationship between nodes and deployed entities. 如果节点的机构系统组件(故障转移管理器服务)报告某一节点不正常,则它会影响已部署应用程序、服务包和在其上部署的副本。An unhealthy node as reported by its authority system component, the Failover Manager service, affects the deployed applications, service packages, and replicas deployed on it.

运行状况层次结构表示基于最新运行状况报告的系统最新状态,该报告提供近实时信息。The health hierarchy represents the latest state of the system based on the latest health reports, which is almost real-time information. 内部和外部的监视器可以根据应用程序特定逻辑或自定义监视条件,对相同实体进行报告。Internal and external watchdogs can report on the same entities based on application-specific logic or custom monitored conditions. 用户报告可与系统报告共存。User reports coexist with the system reports.

设计大型云服务时,请投入时间规划如何报告和响应运行状况。Plan to invest in how to report and respond to health during the design of a large cloud service. 有了这种预先投入准备,可以更轻松地调试、监视和操作该服务。This up-front investment makes the service easier to debug, monitor, and operate.

健康状况Health states

Service Fabric 使用三种运行状况状态来说明实体是否正常:“正常”、“警告”和“错误”。Service Fabric uses three health states to describe whether an entity is healthy or not: OK, warning, and error. 发送到运行状况存储的任何报告都必须指定其中一种状态。Any report sent to the health store must specify one of these states. 运行状况评估结果是其中一种状态。The health evaluation result is one of these states.

可能的运行状况如下:The possible health states are:

  • 正常OK. 实体正常。The entity is healthy. 没有针对它或其子项(如果适用)报告已知问题。There are no known issues reported on it or its children (when applicable).
  • 警告Warning. 实体存在一些问题,但仍可正常运行。The entity has some issues, but it can still function correctly. 例如,存在延迟,但尚未造成任何功能性问题。For example, there are delays, but they do not cause any functional issues yet. 在某些情况下,警告条件可能无需外部干预即可自行修复。In some cases, the warning condition may fix itself without external intervention. 在这些情况下,运行状况报告可唤醒意识并提供对正在发生的事情的可见性。In these cases, health reports raise awareness and provide visibility into what is going on. 在其他情况下,警告条件可能不经用户干预而恶化为严重问题。In other cases, the warning condition may degrade into a severe problem without user intervention.
  • 错误Error. 实体不正常。The entity is unhealthy. 应采取行动修复实体的状态,因为它无法正常工作。Action should be taken to fix the state of the entity, because it can't function properly.
  • 未知Unknown. 运行状况存储中不存在实体。The entity doesn't exist in the health store. 此结果可以从合并来自多个组件的结果的分布式查询中获取。This result can be obtained from the distributed queries that merge results from multiple components. 例如,获取节点列表查询发送到 FailoverManager、ClusterManager 和 HealthManager,获取应用程序列表查询发送到 ClusterManager 和 HealthManager 。For example, the get node list query goes to FailoverManager, ClusterManager, and HealthManager; get application list query goes to ClusterManager and HealthManager. 这些查询会合并来自多个系统组件的结果。These queries merge results from multiple system components. 如果其他系统组件返回运行状况存储中不存在的实体,则合并结果包含未知运行状况。If another system component returns an entity that is not present in health store, the merged result has unknown health state. 实体不在存储中,因为运行状况报告尚未得到处理或该实体在删除后已被清理。An entity is not in store because health reports have not yet been processed or the entity has been cleaned up after deletion.

运行状况策略Health policies

运行状况存储应用运行状况策略,基于实体的报告和子项来确定该实体是否正常运行。The health store applies health policies to determine whether an entity is healthy based on its reports and its children.

备注

可以在群集清单(适用于群集和节点运行状况评估)或应用程序清单(适用于应用程序评估及其任何子项)中指定运行状况策略。Health policies can be specified in the cluster manifest (for cluster and node health evaluation) or in the application manifest (for application evaluation and any of its children). 运行状况评估请求也可以在自定义运行状况评估策略中传递,并且仅用于该评估。Health evaluation requests can also pass in custom health evaluation policies, which are used only for that evaluation.

默认情况下,Service Fabric 针对父-子层次结构关系应用严格的规则(所有项都必须正常运行)。By default, Service Fabric applies strict rules (everything must be healthy) for the parent-child hierarchical relationship. 只要其中一个子项具有一个不正常事件,则将父项视为不正常。If even one of the children has one unhealthy event, the parent is considered unhealthy.

群集运行状况策略Cluster health policy

群集运行状况策略用于评估群集运行状况状态和节点运行状况状态。The cluster health policy is used to evaluate the cluster health state and node health states. 可以在群集清单中对它进行定义。The policy can be defined in the cluster manifest. 如果该策略不存在,则会使用默认策略(不容许失败)。If it is not present, the default policy (zero tolerated failures) is used. 群集运行状况策略包含:The cluster health policy contains:

  • ConsiderWarningAsErrorConsiderWarningAsError. 指定运行状况评估期间是否将警告性运行状况报告视为错误。Specifies whether to treat warning health reports as errors during health evaluation. 默认值:false。Default: false.
  • MaxPercentUnhealthyApplicationsMaxPercentUnhealthyApplications. 指定群集被视为“错误”之前可以容忍的不正常应用程序最大百分比。Specifies the maximum tolerated percentage of applications that can be unhealthy before the cluster is considered in error.
  • MaxPercentUnhealthyNodesMaxPercentUnhealthyNodes. 指定群集被视为“错误”之前可以容忍的不正常节点最大百分比。Specifies the maximum tolerated percentage of nodes that can be unhealthy before the cluster is considered in error. 在大型群集中,某些节点始终处于关闭或无法修复的状态,因此应配置此百分比以便容忍这种情况。In large clusters, some nodes are always down or out for repairs, so this percentage should be configured to tolerate that.
  • ApplicationTypeHealthPolicyMapApplicationTypeHealthPolicyMap. 群集运行状况评估期间,可使用应用程序类型运行状况策略,描述特殊应用程序类型。The application type health policy map can be used during cluster health evaluation to describe special application types. 默认情况下,所有应用程序都放入池中,并使用 MaxPercentUnhealthyApplications 进行评估。By default, all applications are put into a pool and evaluated with MaxPercentUnhealthyApplications. 如果某些应用程序类型应分别对待,可将其从全局池中提出。If some application types should be treated differently, they can be taken out of the global pool. 根据与映射中应用程序类型名称关联的百分比来评估这些类型。Instead, they are evaluated against the percentages associated with their application type name in the map. 例如,群集中有数千个不同类型的应用程序,以及某个特殊应用程序类型的一些应用程序实例。For example, in a cluster there are thousands of applications of different types, and a few control application instances of a special application type. 控制应用程序绝不应出错。The control applications should never be in error. 可以将全局 MaxPercentUnhealthyApplications 指定为 20%,以容许一些失败,但对于“ControlApplicationType”应用程序类型,请将 MaxPercentUnhealthyApplications 设为 0。You can specify global MaxPercentUnhealthyApplications to 20% to tolerate some failures, but for the application type "ControlApplicationType" set the MaxPercentUnhealthyApplications to 0. 如此一来,如果众多应用程序中有一些运行不正常,但比例低于全局状况不良百分比,则将群集评估为“警告”。This way, if some of the many applications are unhealthy, but below the global unhealthy percentage, the cluster would be evaluated to Warning. “警告”健康状况不影响群集升级或“错误”健康状况将触发的其他监视。A warning health state does not impact cluster upgrade or other monitoring triggered by Error health state. 但是,即使只有一个控制应用程序出错,也会造成群集不正常运行,根据升级配置,这会触发回滚或暂停群集升级。But even one control application in error would make cluster unhealthy, which triggers roll back or pauses the cluster upgrade, depending on the upgrade configuration. 对于映射中定义的应用程序类型,所有应用程序实例都从应用程序的全局池中提出。For the application types defined in the map, all application instances are taken out of the global pool of applications. 使用映射中的特定 MaxPercentUnhealthyApplications,根据该应用程序类型的应用程序总数对其进行评估。They are evaluated based on the total number of applications of the application type, using the specific MaxPercentUnhealthyApplications from the map. 所有其他应用程序都保留在全局池中,使用 MaxPercentUnhealthyApplications 进行评估。All the rest of the applications remain in the global pool and are evaluated with MaxPercentUnhealthyApplications.

以下示例摘自某个群集清单。The following example is an excerpt from a cluster manifest. 若要定义应用程序类型映射中的条目,请在参数名称前面添加“ApplicationTypeMaxPercentUnhealthyApplications-”,后接应用程序类型名称。To define entries in the application type map, prefix the parameter name with "ApplicationTypeMaxPercentUnhealthyApplications-", followed by the application type name.

<FabricSettings>
  <Section Name="HealthManager/ClusterHealthPolicy">
    <Parameter Name="ConsiderWarningAsError" Value="False" />
    <Parameter Name="MaxPercentUnhealthyApplications" Value="20" />
    <Parameter Name="MaxPercentUnhealthyNodes" Value="20" />
    <Parameter Name="ApplicationTypeMaxPercentUnhealthyApplications-ControlApplicationType" Value="0" />
  </Section>
</FabricSettings>

应用程序运行状况策略Application health policy

应用程序运行状况策略说明如何对应用程序及其子项进行事件和子项状态聚合评估。The application health policy describes how the evaluation of events and child-states aggregation is done for applications and their children. 它可以在应用程序清单(应用程序包中的 ApplicationManifest.xml)中定义。It can be defined in the application manifest, ApplicationManifest.xml, in the application package. 如果未指定任何策略,则当运行状况报告或子项处于“警告”或“错误”健康状况时,Service Fabric 会假设实体不正常运行。If no policies are specified, Service Fabric assumes that the entity is unhealthy if it has a health report or a child at the warning or error health state. 可配置的策略有:The configurable policies are:

  • ConsiderWarningAsErrorConsiderWarningAsError. 指定运行状况评估期间是否将警告性运行状况报告视为错误。Specifies whether to treat warning health reports as errors during health evaluation. 默认值:false。Default: false.
  • MaxPercentUnhealthyDeployedApplicationsMaxPercentUnhealthyDeployedApplications. 指定将应用程序被视为“错误”之前可以容忍的不正常已部署应用程序的最大百分比。Specifies the maximum tolerated percentage of deployed applications that can be unhealthy before the application is considered in error. 此百分比的计算方式为:不正常的已部署应用程序数除以群集中目前已部署应用程序的节点数。This percentage is calculated by dividing the number of unhealthy deployed applications over the number of nodes that the applications are currently deployed on in the cluster. 计算结果向上进一,以容忍少量节点上出现一次失败。The computation rounds up to tolerate one failure on small numbers of nodes. 默认百分比:零。Default percentage: zero.
  • DefaultServiceTypeHealthPolicyDefaultServiceTypeHealthPolicy. 指定默认服务类型运行状况策略,该策略会替换应用程序中所有服务类型的默认运行状况策略。Specifies the default service type health policy, which replaces the default health policy for all service types in the application.
  • ServiceTypeHealthPolicyMapServiceTypeHealthPolicyMap. 针对每个服务类型提供服务运行状况策略的映射。Provides a map of service health policies per service type. 这些策略将取代每个指定服务类型的默认服务类型运行状况策略。These policies replace the default service type health policies for each specified service type. 例如,如果应用程序包含无状态网关服务类型和有状态引擎服务类型,可为其评估分别配置运行状况策略。For example, if an application has a stateless gateway service type and a stateful engine service type, you can configure the health policies for their evaluation differently. 按服务类型指定策略时,可以更精细地控制服务的运行状况。When you specify policy per service type, you can gain more granular control of the health of the service.

服务类型运行状况策略Service type health policy

服务类型运行状况策略指定如何评估和聚合服务及服务的子项。The service type health policy specifies how to evaluate and aggregate the services and the children of services. 该策略包含:The policy contains:

以下示例摘自某个应用程序清单:The following example is an excerpt from an application manifest:

    <Policies>
        <HealthPolicy ConsiderWarningAsError="true" MaxPercentUnhealthyDeployedApplications="20">
            <DefaultServiceTypeHealthPolicy
                   MaxPercentUnhealthyServices="0"
                   MaxPercentUnhealthyPartitionsPerService="10"
                   MaxPercentUnhealthyReplicasPerPartition="0"/>
            <ServiceTypeHealthPolicy ServiceTypeName="FrontEndServiceType"
                   MaxPercentUnhealthyServices="0"
                   MaxPercentUnhealthyPartitionsPerService="20"
                   MaxPercentUnhealthyReplicasPerPartition="0"/>
            <ServiceTypeHealthPolicy ServiceTypeName="BackEndServiceType"
                   MaxPercentUnhealthyServices="20"
                   MaxPercentUnhealthyPartitionsPerService="0"
                   MaxPercentUnhealthyReplicasPerPartition="0">
            </ServiceTypeHealthPolicy>
        </HealthPolicy>
    </Policies>

运行状况评估Health evaluation

用户和自动化服务可以随时评估任何实体的运行状况。Users and automated services can evaluate health for any entity at any time. 若要评估实体运行状况,运行状况存储聚合实体上的所有运行状况报告,并评估其所有子项(如果适用)。To evaluate an entity's health, the health store aggregates all health reports on the entity and evaluates all its children (when applicable). 运行状况聚合算法使用运行状况策略,这类策略指定如何评估运行状况报告以及如何聚合子项健康状况(如果适用)。The health aggregation algorithm uses health policies that specify how to evaluate health reports and how to aggregate child health states (when applicable).

运行状况报告聚合Health report aggregation

一个实体可以拥有由不同报告器(系统组件或监视器)发送的针对不同属性的多个运行状况报告。One entity can have multiple health reports sent by different reporters (system components or watchdogs) on different properties. 聚合使用关联的运行状况策略,尤其是应用程序或群集运行状况策略的 ConsiderWarningAsError 成员。The aggregation uses the associated health policies, in particular the ConsiderWarningAsError member of application or cluster health policy. ConsiderWarningAsError 指定如何评估警告。ConsiderWarningAsError specifies how to evaluate warnings.

已聚合运行状况状态由实体上 最差 的运行状况报告触发。The aggregated health state is triggered by the worst health reports on the entity. 如果至少有一个“错误”运行状况报告,则聚合的健康状况为“错误”。If there is at least one error health report, the aggregated health state is an error.

“错误”报告的运行状况报告聚合。

包含一个或多个“错误”运行状况报告的运行状况实体评估为“错误”。A health entity that has one or more error health reports is evaluated as Error. 已过期运行状况报告同样如此,无论其健康状况如何。The same is true for an expired health report, regardless of its health state.

如果没有任何“错误”报告但有一个或多个“警告”,则聚合的健康状况为“警告”或“错误”,具体取决于 ConsiderWarningAsError 策略标志。If there are no error reports and one or more warnings, the aggregated health state is either warning or error, depending on the ConsiderWarningAsError policy flag.

运行状况报告与“警告”报告聚合,ConsiderWarningAsError 为 false。

运行状况报告与“警告”报告聚合,ConsiderWarningAsError 设置为 false(默认值)。Health report aggregation with warning report and ConsiderWarningAsError set to false (default).

子项运行状况聚合Child health aggregation

实体的聚合健康状况反映子项健康状况(如果适用)。The aggregated health state of an entity reflects the child health states (when applicable). 用于聚合子项运行状况状态的算法基于实体类型使用适用的运行状况策略。The algorithm for aggregating child health states uses the health policies applicable based on the entity type.

子实体运行状况聚合。

基于运行状况策略的子项聚合。Child aggregation based on health policies.

运行状况存储评估所有子项后,根据针对不正常子项配置的最大百分比来聚合其健康状况。After the health store has evaluated all the children, it aggregates their health states based on the configured maximum percentage of unhealthy children. 此百分比取自基于实体和子项类型的策略。This percentage is taken from the policy based on the entity and child type.

  • 如果所有子项的状态都为“正常”,则子项的聚合健康状况为“正常”。If all children have OK states, the child aggregated health state is OK.
  • 如果子项具有“正常”状态和“警告”状态,则子项的聚合健康状况为“警告”。If children have both OK and warning states, the child aggregated health state is warning.
  • 如果具有“错误”状态的子项不遵从不正常子项的最大允许百分比,已聚合父级运行状况状态则为“错误”。If there are children with error states that do not respect the maximum allowed percentage of unhealthy children, the aggregated parent health state is an error.
  • 如果具有“错误”状态的子项遵从不正常子项的最大允许百分比,已聚合父级运行状况状态则为“警告”。If the children with error states respect the maximum allowed percentage of unhealthy children, the aggregated parent health state is warning.

运行状况报告Health reporting

系统组件、系统结构应用程序和内部/外部监视器可以针对 Service Fabric 实体进行报告。System components, System Fabric applications, and internal/external watchdogs can report against Service Fabric entities. 报告器基于它们正在监视的条件对监视的实体的运行状况进行 本地 判断。The reporters make local determinations of the health of the monitored entities, based on the conditions they are monitoring. 它们无需查看任何全局状态或聚合数据。They don't need to look at any global state or aggregate data. 理想行为是使用简单的报告器而不是复杂的有机体,因为后者需要分析许多内容才能推断出要发送的信息。The desired behavior is to have simple reporters, and not complex organisms that need to look at many things to infer what information to send.

为将运行状况数据发送到运行状况存储,报告器需要标识受影响的实体并创建运行状况报告。To send health data to the health store, a reporter needs to identify the affected entity and create a health report. 若要发送报告,请使用 FabricClient.HealthClient.ReportHealth API、PartitionCodePackageActivationContext 对象公开的报告运行状况 API、PowerShell cmdlet 或 REST。To send the report, use the FabricClient.HealthClient.ReportHealth API, report health APIs exposed on the Partition or CodePackageActivationContext objects, PowerShell cmdlets, or REST.

运行状况报告Health reports

群集中每个实体的运行状况报告都包含以下信息:The health reports for each of the entities in the cluster contain the following information:

  • SourceIdSourceId. 唯一标识运行状况事件的报告器的字符串。A string that uniquely identifies the reporter of the health event.

  • 实体标识符Entity identifier. 标识对其应用了报告的实体。Identifies the entity where the report is applied. 它会因实体类型而异:It differs based on the entity type:

    • Cluster。Cluster. 无。None.
    • 节点。Node. 节点名称(字符串)。Node name (string).
    • 应用程序。Application. 应用程序名称 (URI)。Application name (URI). 表示群集中部署的应用程序实例的名称。Represents the name of the application instance deployed in the cluster.
    • Service。Service. 服务名称 (URI)。Service name (URI). 表示群集中部署的服务实例的名称。Represents the name of the service instance deployed in the cluster.
    • Partition。Partition. 分区 ID (GUID)。Partition ID (GUID). 表示分区唯一标识符。Represents the partition unique identifier.
    • Replica。Replica. 有状态服务副本 ID 或无状态服务实例 ID (INT64)。The stateful service replica ID or the stateless service instance ID (INT64).
    • DeployedApplication。DeployedApplication. 应用程序名称 (URI) 和节点名称(字符串)。Application name (URI) and node name (string).
    • DeployedServicePackage。DeployedServicePackage. 应用程序名称 (URI)、节点名称(字符串)和服务清单名称(字符串)。Application name (URI), node name (string), and service manifest name (string).
  • 属性Property. 允许报告器针对实体的特定属性分类运行状况事件的字符串(不是固定的枚举)。A string (not a fixed enumeration) that allows the reporter to categorize the health event for a specific property of the entity. 例如,报告者 A 可以报告 Node01“存储”属性的运行状况,报告者 B 可以报告 Node01“连接”属性的运行状况。For example, reporter A can report the health of the Node01 "Storage" property and reporter B can report the health of the Node01 "Connectivity" property. 在运行状况存储中,这两个报告被视为 Node01 实体的单独运行状况事件。In the health store, these reports are treated as separate health events for the Node01 entity.

  • 说明Description. 报告器用于提供有关运行状况事件的详细信息的字符串。A string that allows a reporter to provide detailed information about the health event. SourceId属性HealthState 应完整说明报告。SourceId, Property, and HealthState should fully describe the report. 说明中添加了用户可读的报告相关信息。The description adds human-readable information about the report. 该文本可让管理员和用户更容易了解运行状况报告。The text makes it easier for administrators and users to understand the health report.

  • HealthStateHealthState. 说明报告的运行状况状态的枚举An enumeration that describes the health state of the report. 接受值的值有“正常”、“警告”和“错误”。The accepted values are OK, Warning, and Error.

  • TimeToLiveTimeToLive. 指示运行状况报告有效时长的时间跨度。A timespan that indicates how long the health report is valid. 结合 RemoveWhenExpired时,它能够使运行状况存储知道如何评估过期的事件。Coupled with RemoveWhenExpired, it lets the health store know how to evaluate expired events. 默认情况下,此值为无穷大,表示报告永远有效。By default, the value is infinite, and the report is valid forever.

  • RemoveWhenExpiredRemoveWhenExpired. 布尔值。A boolean. 如果设置为 true,将从运行状况存储自动删除过期的运行状况报告,并且该报告不会影响实体运行状况评估。If set to true, the expired health report is automatically removed from the health store, and the report doesn't impact entity health evaluation. 仅当报告在指定的一段时间内有效,报告器不需要将其显式清除时使用。也用于从运行状况存储删除报告(例如,监视器被更改并停止发送包含以前的源和属性的报告)。Used when the report is valid for a specified period of time only, and the reporter doesn't need to explicitly clear it out. It's also used to delete reports from the health store (for example, a watchdog is changed and stops sending reports with previous source and property). 它可以发送具有短暂的 TimeToLive 和 RemoveWhenExpired 的报告,以清除运行状况存储中的所有以往状态。It can send a report with a brief TimeToLive along with RemoveWhenExpired to clear up any previous state from the health store. 如果该值设置为 false,则运行状况评估中会将过期的报告视为错误。If the value is set to false, the expired report is treated as an error on the health evaluation. false 值向运行状况存储指示,源应该对此属性进行定期报告。The false value signals to the health store that the source should report periodically on this property. 如果没有定期报告,则监视器必然存在问题。If it doesn't, then there must be something wrong with the watchdog. 通过将事件视为错误来捕获监视器运行状况。The watchdog's health is captured by considering the event as an error.

  • SequenceNumberSequenceNumber. 需要不断增加的正整数,它表示报告的顺序。A positive integer that needs to be ever-increasing, it represents the order of the reports. 运行状况存储使用它来检测因网络延迟或其他问题而较晚收到的过时报告。It is used by the health store to detect stale reports that are received late because of network delays or other issues. 如果序列号小于或等于实体、源和属性最近应用的序列号,则会拒绝报告。A report is rejected if the sequence number is less than or equal to the most recently applied number for the same entity, source, and property. 如果未指定,则会自动生成序列号。If it is not specified, the sequence number is generated automatically. 只有在报告状态转换时,才需放入序列号。It is necessary to put in the sequence number only when reporting on state transitions. 在此情况下,源需要记住它所发送的报告,并保留信息以便在故障转移时进行恢复。In this situation, the source needs to remember which reports it sent and keep the information for recovery on failover.

每个运行状况报告都需要四条信息(SourceId、ntity identifier、Property 和 HealthState)。These four pieces of information--SourceId, entity identifier, Property, and HealthState--are required for every health report. 不允许 SourceId 字符串以前缀“System. ”开头,该字符串是为系统报告保留的。The SourceId string is not allowed to start with the prefix "System.", which is reserved for system reports. 对于相同实体,相同的源和属性只有一个报告。For the same entity, there is only one report for the same source and property. 如果为相同的源和属性生成多个报告,它们会在运行状况客户端(如果按批处理)或在运行状况存储端覆盖彼此。Multiple reports for the same source and property override each other, either on the health client side (if they are batched) or on the health store side. 这种替代按序列号进行:较新的报告(具有更高的序列号)替代较旧的报告。The replacement is based on sequence numbers; newer reports (with higher sequence numbers) replace older reports.

运行状况事件Health events

在内部,运行状况存储保留运行状况事件,其中包含报告的所有信息以及其他元数据。Internally, the health store keeps health events, which contain all the information from the reports, and additional metadata. 元数据包括向运行状况客户端提供报告的时间,以及在服务器端修改该报告的时间。The metadata includes the time the report was given to the health client and the time it was modified on the server side. 运行状况事件通过运行状况查询返回。The health events are returned by health queries.

添加的元数据包含:The added metadata contains:

  • SourceUtcTimestampSourceUtcTimestamp. 报告提供给运行状况客户端的时间(协调世界时)。The time the report was given to the health client (Coordinated Universal Time).
  • LastModifiedUtcTimestampLastModifiedUtcTimestamp. 上次在服务器端修改报告的时间(协调世界时)。The time the report was last modified on the server side (Coordinated Universal Time).
  • IsExpiredIsExpired. 用于指示运行状况存储执行查询时报告是否过期的标志。A flag to indicate whether the report was expired when the query was executed by the health store. 仅当 RemoveWhenExpired 为 false 时,事件才可能过期。An event can be expired only if RemoveWhenExpired is false. 否则,查询不会返回事件,并会将其从存储中删除。Otherwise, the event is not returned by query and is removed from the store.
  • LastOkTransitionAtLastWarningTransitionAtLastErrorTransitionAtLastOkTransitionAt, LastWarningTransitionAt, LastErrorTransitionAt. 上一次“正常”/“警告”/“错误”转换的时间。The last time for OK/warning/error transitions. 这些字段提供事件健康状况转换的历史记录。These fields give the history of the health state transitions for the event.

状态转换字段可用于提供更智能的警报或“历史”运行状况事件信息。The state transition fields can be used for smarter alerts or "historical" health event information. 适用的情景如下:They enable scenarios such as:

  • 属性处于“警告”/“错误”状态超过 X 分钟时发出警报。Alert when a property has been at warning/error for more than X minutes. 在一段时间内检查状态可避免对暂时性状态发起警报。Checking the condition for a period of time avoids alerts on temporary conditions. 例如,当健康状况处于“警告”状态超过 5 分钟时发出警报,可表达为 (HealthState == Warning and Now - LastWarningTransitionTime > 5 minutes)。For example, an alert if the health state has been warning for more than five minutes can be translated into (HealthState == Warning and Now - LastWarningTransitionTime > 5 minutes).
  • 仅针对过去 X 分钟内更改的状态发出警报。Alert only on conditions that have changed in the last X minutes. 如果在指定时间之前,报告已处于“错误”状态,则可以忽略它,因为之前已对其进行标志。If a report was already at error before the specified time, it can be ignored because it was already signaled previously.
  • 如果属性在“警告”和“错误”之间切换,则确定它处于不正常状态(即不为“正常”)的时长。If a property is toggling between warning and error, determine how long it has been unhealthy (that is, not OK). 例如,当属性处于不正常状态超过 5 分钟时发出警报,可以表达为 (HealthState != Ok and Now - LastOkTransitionTime > 5 minutes)。For example, an alert if the property hasn't been healthy for more than five minutes can be translated into (HealthState != Ok and Now - LastOkTransitionTime > 5 minutes).

示例:报告和评估应用程序运行状况Example: Report and evaluate application health

下列示例在源 MyWatchdog 中的应用程序 fabric:/WordCount 上通过 PowerShell 发送运行状况报告。The following example sends a health report through PowerShell on the application fabric:/WordCount from the source MyWatchdog. 运行状况报告包含有关“错误”运行状况状态下的运行状况属性可用性的信息,含无限 TimeToLive。The health report contains information about the health property "availability" in an error health state, with infinite TimeToLive. 然后,它会查询应用程序运行状况,此查询会在运行状况事件列表中返回聚合的健康状况错误和报告的运行状况事件。Then it queries the application health, which returns aggregated health state errors and the reported health events in the list of health events.

PS C:\> Send-ServiceFabricApplicationHealthReport -ApplicationName fabric:/WordCount -SourceId "MyWatchdog" -HealthProperty "Availability" -HealthState Error

PS C:\> Get-ServiceFabricApplicationHealth fabric:/WordCount -ExcludeHealthStatistics

ApplicationName                 : fabric:/WordCount
AggregatedHealthState           : Error
UnhealthyEvaluations            :
                                  Error event: SourceId='MyWatchdog', Property='Availability'.

ServiceHealthStates             :
                                  ServiceName           : fabric:/WordCount/WordCountService
                                  AggregatedHealthState : Error

                                  ServiceName           : fabric:/WordCount/WordCountWebService
                                  AggregatedHealthState : Ok

DeployedApplicationHealthStates :
                                  ApplicationName       : fabric:/WordCount
                                  NodeName              : _Node_0
                                  AggregatedHealthState : Ok

                                  ApplicationName       : fabric:/WordCount
                                  NodeName              : _Node_2
                                  AggregatedHealthState : Ok

                                  ApplicationName       : fabric:/WordCount
                                  NodeName              : _Node_3
                                  AggregatedHealthState : Ok

                                  ApplicationName       : fabric:/WordCount
                                  NodeName              : _Node_4
                                  AggregatedHealthState : Ok

                                  ApplicationName       : fabric:/WordCount
                                  NodeName              : _Node_1
                                  AggregatedHealthState : Ok

HealthEvents                    :
                                  SourceId              : System.CM
                                  Property              : State
                                  HealthState           : Ok
                                  SequenceNumber        : 360
                                  SentAt                : 3/22/2016 7:56:53 PM
                                  ReceivedAt            : 3/22/2016 7:56:53 PM
                                  TTL                   : Infinite
                                  Description           : Application has been created.
                                  RemoveWhenExpired     : False
                                  IsExpired             : False
                                  Transitions           : Error->Ok = 3/22/2016 7:56:53 PM, LastWarning = 1/1/0001 12:00:00 AM

                                  SourceId              : MyWatchdog
                                  Property              : Availability
                                  HealthState           : Error
                                  SequenceNumber        : 131032204762818013
                                  SentAt                : 3/23/2016 3:27:56 PM
                                  ReceivedAt            : 3/23/2016 3:27:56 PM
                                  TTL                   : Infinite
                                  Description           :
                                  RemoveWhenExpired     : False
                                  IsExpired             : False
                                  Transitions           : Ok->Error = 3/23/2016 3:27:56 PM, LastWarning = 1/1/0001 12:00:00 AM

运行状况模型用法Health model usage

利用运行状况模型,云服务和基础 Service Fabric 平台可进行缩放,因为监视和运行状况判断分布在群集内的不同监视器中。The health model allows cloud services and the underlying Service Fabric platform to scale, because monitoring and health determinations are distributed among the different monitors within the cluster. 其他系统在群集级别具有单个集中式服务,该服务分析服务发出的所有 可能 有用的信息。Other systems have a single, centralized service at the cluster level that parses all the potentially useful information emitted by services. 此方法会妨碍其可伸缩性。This approach hinders their scalability. 此外,它不允许使用它们收集具体的信息来帮助识别问题和潜在问题,并尽可能接近根本原因。It also doesn't allow them to collect specific information to help identify issues and potential issues as close to the root cause as possible.

运行状况模型大量用于监视和诊断、评估群集和应用程序运行状况以及监视的升级。The health model is used heavily for monitoring and diagnosis, for evaluating cluster and application health, and for monitored upgrades. 其他服务使用运行状况数据执行自动修复、生成群集运行状况历史记录,以及对某些条件发出警报。Other services use health data to perform automatic repairs, build cluster health history, and issue alerts on certain conditions.

后续步骤Next steps

查看 Service Fabric 运行状况报告View Service Fabric health reports

使用系统运行状况报告进行故障排除Use system health reports for troubleshooting

如何报告和检查服务运行状况How to report and check service health

添加自定义 Service Fabric 运行状况报告Add custom Service Fabric health reports

在本地监视和诊断服务Monitor and diagnose services locally

Service Fabric 应用程序升级Service Fabric application upgrade