Reliable Services 生命周期概述Reliable Services lifecycle overview

在考虑要 Azure Service Fabric Reliable Services 的生命周期时,具备生命周期的基础知识这一点极为重要。When you're thinking about the lifecycles of Azure Service Fabric Reliable Services, the basics of the lifecycle are the most important. 一般而言,生命周期包括:In general, the lifecycle includes the following:

  • 在启动期间:During startup:
    • 构造服务。Services are constructed.
    • 服务可能会构造并返回零个或多个侦听器。The services have an opportunity to construct and return zero or more listeners.
    • 打开返回的所有侦听器,以便与服务通信。Any returned listeners are opened, allowing communication with the service.
    • 调用服务的 RunAsync 方法,使服务能够执行长时间运行的任务或后台工作。The service's RunAsync method is called, allowing the service to do long-running tasks or background work.
  • 在关闭期间:During shutdown:
    • 取消传递给 RunAsync 的取消令牌,同时关闭侦听器。The cancellation token passed to RunAsync is canceled, and the listeners are closed.
    • 侦听器关闭后,销毁服务对象本身。After the listeners close, the service object itself is destructed.

有相应的文档详细介绍了有关这些事件的确切顺序。There are details around the exact ordering of these events. 事件的顺序可能会根据 Reliable Service 是无状态服务还是有状态服务而略有变化。The order of events can change slightly depending on whether the Reliable Service is stateless or stateful. 此外,对于有状态服务,必须处理主副本交换方案。In addition, for stateful services, we must deal with the Primary swap scenario. 在执行此序列期间,主副本的角色转移到另一个副本(或者转移回来),而无需关闭服务。During this sequence, the role of Primary is transferred to another replica (or comes back) without the service shutting down. 最后,必须考虑到错误或失败状态。Finally, we must think about error or failure conditions.

无状态服务Stateless service startup

无状态服务的生命周期非常直接了当。The lifecycle of a stateless service is straightforward. 下面是事件的顺序:Here's the order of events:

  1. 构造服务。The service is constructed.
  2. 然后,并行发生两个事件:Then, in parallel, two things happen:
    • 调用 StatelessService.CreateServiceInstanceListeners(),打开返回的所有侦听器。StatelessService.CreateServiceInstanceListeners() is invoked and any returned listeners are opened. 对每个侦听器调用 ICommunicationListener.OpenAsync()ICommunicationListener.OpenAsync() is called on each listener.
    • 调用服务的 StatelessService.RunAsync() 方法。The service's StatelessService.RunAsync() method is called.
  3. 如果存在,则调用服务的 StatelessService.OnOpenAsync() 方法。If present, the service's StatelessService.OnOpenAsync() method is called. 此调用是一种不常见的重写,但是可行的。This call is an uncommon override, but it is available. 此时可以启动扩展的服务初始化任务。Extended service initialization tasks can be started at this time.

请注意,用于创建和打开侦听器与 RunAsync 的调用之间没有一定的顺序。Keep in mind that there is no ordering between the calls to create and open the listeners and RunAsync. 可在启动 RunAsync 之前打开侦听器。The listeners can open before RunAsync is started. 同样,在通信侦听器打开甚至构造之前,即可调用 RunAsyncSimilarly, you can invoke RunAsync before the communication listeners are open or even constructed. 如果需要进行任何同步,这是实现器的任务。If any synchronization is required, it is left as an exercise to the implementer. 下面是一些常见的解决方案:Here are some common solutions:

  • 有时,在创建其他某些信息或者完成工作之前,侦听器无法正常运行。Sometimes listeners can't function until some other information is created or work is done. 对于无状态服务,该操作通常可在其他位置完成,例如:For stateless services, that work can usually be done in other locations, such as the following:
    • 在服务的构造函数中。In the service's constructor.
    • CreateServiceInstanceListeners() 调用期间。During the CreateServiceInstanceListeners() call.
    • 作为侦听器本身构造的一部分。As a part of the construction of the listener itself.
  • 有时,在侦听器打开之前,RunAsync 中的代码不会启动。Sometimes the code in RunAsync doesn't start until the listeners are open. 在这种情况下,必须进行更多的协调。In this case, additional coordination is necessary. 一种常见的解决方案是在侦听器内使用某些标志指示它们完成的时间。One common solution is that there is a flag within the listeners that indicates when they have finished. 此标志签入 RunAsync,然后继续执行实际工作。This flag is then checked in RunAsync before continuing to actual work.

无状态服务关闭Stateless service shutdown

关闭无状态服务时遵循相同的模式,只不过遵循的顺序相反:For shutting down a stateless service, the same pattern is followed, just in reverse:

  1. 并行:In parallel:
    • 关闭任何打开的侦听器。Any open listeners are closed. 对每个侦听器调用 ICommunicationListener.CloseAsync()ICommunicationListener.CloseAsync() is called on each listener.
    • 取消传递给 RunAsync() 的取消令牌。The cancellation token passed to RunAsync() is canceled. 检查取消令牌的 IsCancellationRequested 属性是否返回 true,如果调用令牌的 ThrowIfCancellationRequested 方法,则会引发 OperationCanceledExceptionA check of the cancellation token's IsCancellationRequested property returns true, and if called, the token's ThrowIfCancellationRequested method throws an OperationCanceledException.
  2. 如果存在,则在针对每个侦听器完成 CloseAsync() 并且完成 RunAsync() 后,调用服务的 StatelessService.OnCloseAsync() 方法。After CloseAsync() finishes on each listener and RunAsync() also finishes, the service's StatelessService.OnCloseAsync() method is called, if present. 当要正常关闭无状态服务实例时调用 OnCloseAsync。OnCloseAsync is called when the stateless service instance is going to be gracefully shut down. 升级服务代码、由于负载均衡而移动服务实例或是检测到暂时性故障时,可能会出现这种情况。This can occur when the service's code is being upgraded, the service instance is being moved due to load balancing, or a transient fault is detected. 重写 StatelessService.OnCloseAsync() 并不常见,但它可以用于安全地关闭资源、停止后台处理、完成外部状态保存或关闭现有连接。It is uncommon to override StatelessService.OnCloseAsync(), but it can be used to safely close resources, stop background processing, finish saving external state, or close down existing connections.
  3. 完成 StatelessService.OnCloseAsync() 后,销毁服务对象。After StatelessService.OnCloseAsync() finishes, the service object is destructed.

有状态服务启动Stateful service startup

有状态服务的模式与无状态服务类似,只是稍有不同。Stateful services have a similar pattern to stateless services, with a few changes. 启动有状态服务时,事件的顺序如下所述:For starting up a stateful service, the order of events is as follows:

  1. 构造服务。The service is constructed.

  2. StatefulServiceBase.OnOpenAsync()StatefulServiceBase.OnOpenAsync() is called. 此调用是服务中不常见的重写。This call is not commonly overridden in the service.

  3. 以下情况将并行发生:The following things happen in parallel:

    • 调用 StatefulServiceBase.CreateServiceReplicaListeners()StatefulServiceBase.CreateServiceReplicaListeners() is invoked.
      • 如果服务是主要服务,则打开所有返回的侦听器。If the service is a Primary service, all returned listeners are opened. 对每个侦听器调用 ICommunicationListener.OpenAsync()ICommunicationListener.OpenAsync() is called on each listener.
      • 如果服务是辅助服务,则只打开已标记为 ListenOnSecondary = true 的侦听器。If the service is a Secondary service, only those listeners marked as ListenOnSecondary = true are opened. 在辅助服务上打开的侦听器较不常见。Having listeners that are open on secondaries is less common.
    • 如果该服务目前是主要服务,则调用该服务的 StatefulServiceBase.RunAsync() 方法。If the service is currently a Primary, the service's StatefulServiceBase.RunAsync() method is called.
  4. 所有副本侦听器的 OpenAsync() 调用完成并已调用 RunAsync() 后,将调用 StatefulServiceBase.OnChangeRoleAsync()After all the replica listener's OpenAsync() calls finish and RunAsync() is called, StatefulServiceBase.OnChangeRoleAsync() is called. 此调用是服务中不常见的重写。This call is not commonly overridden in the service.

    类似于无状态服务,创建和打开侦听器的顺序以及调用 RunAsync 的时间不会经过协调。Similar to stateless services, there's no coordination between the order in which the listeners are created and opened and when RunAsync is called. 如果需要协调,解决方法大致相同。If you need coordination, the solutions are much the same. 对于有状态服务,还存在一种情况。There is one additional case for stateful service. 假设抵达通信侦听器的调用需要在某个 Reliable Collections 中保存信息。Say that the calls that arrive at the communication listeners require information kept inside some Reliable Collections.

    Note

    由于通信侦听器可能在 Reliable Collections 可读或可写之前打开,因此,在 RunAsync 可以启动之前,必须经过一定的附加协调。 最简单且最常见的解决方法是让通信侦听器返回错误代码,告知客户端重试请求。

有状态服务关闭Stateful service shutdown

与无状态服务一样,关闭期间的生命周期事件与启动期间是相同的,但顺序相反。Like stateless services, the lifecycle events during shutdown are the same as during startup, but reversed. 关闭有状态服务时,将发生以下事件:When a stateful service is being shut down, the following events occur:

  1. 并行:In parallel:

    • 关闭任何打开的侦听器。Any open listeners are closed. 对每个侦听器调用 ICommunicationListener.CloseAsync()ICommunicationListener.CloseAsync() is called on each listener.
    • 取消传递给 RunAsync() 的取消令牌。The cancellation token passed to RunAsync() is canceled. 检查取消令牌的 IsCancellationRequested 属性是否返回 true,如果调用令牌的 ThrowIfCancellationRequested 方法,则会引发 OperationCanceledExceptionA check of the cancellation token's IsCancellationRequested property returns true, and if called, the token's ThrowIfCancellationRequested method throws an OperationCanceledException.
  2. 在针对每个侦听器完成 CloseAsync() 并且完成 RunAsync() 后,调用服务的 StatefulServiceBase.OnChangeRoleAsync()After CloseAsync() finishes on each listener and RunAsync() also finishes, the service's StatefulServiceBase.OnChangeRoleAsync() is called. 此调用是服务中不常见的重写。This call is not commonly overridden in the service.

    Note

    仅当此副本是主副本时,才需要等待 RunAsync 完成。

  3. StatefulServiceBase.OnChangeRoleAsync() 方法完成后,调用 StatefulServiceBase.OnCloseAsync() 方法。After the StatefulServiceBase.OnChangeRoleAsync() method finishes, the StatefulServiceBase.OnCloseAsync() method is called. 此调用是一种不常见的重写,但是可行的。This call is an uncommon override, but it is available.

  4. 完成 StatefulServiceBase.OnCloseAsync() 后,销毁服务对象。After StatefulServiceBase.OnCloseAsync() finishes, the service object is destructed.

有状态服务主副本交换Stateful service Primary swaps

运行有状态服务时,只有该有状态服务的主副本打开其通信侦听器并调用其 RunAsync 方法。While a stateful service is running, only the Primary replicas of that stateful services have their communication listeners opened and their RunAsync method called. 会构造辅助副本,但不会对其执行进一步的调用。Secondary replicas are constructed, but see no further calls. 在运行有状态服务时,当前用作主副本的副本可能会因故障或群集均衡优化而发生更改。While a stateful service is running, the replica that's currently the Primary can change as a result of fault or cluster balancing optimization. 从副本看到的生命周期事件角度看,这意味着什么呢?What does this mean in terms of the lifecycle events that a replica can see? 有状态副本看到的行为取决于在交换期间该副本是已降级还是已升级。The behavior the stateful replica sees depends on whether it is the replica being demoted or promoted during the swap.

对于已降级的主副本For the Primary that's demoted

对于已降级的主副本,Service Fabric 需要使用此副本来停止处理消息,退出正在执行的任何后台工作。For the Primary replica that's demoted, Service Fabric needs this replica to stop processing messages and quit any background work it is doing. 因此,此步骤类似于关闭服务时的情况。As a result, this step looks like it did when the service is shut down. 一个差别在于,在此情况下不会销毁或关闭服务,因为它保留为辅助副本。One difference is that the service isn't destructed or closed because it remains as a Secondary. 将调用以下 API:The following APIs are called:

  1. 并行:In parallel:
    • 关闭任何打开的侦听器。Any open listeners are closed. 对每个侦听器调用 ICommunicationListener.CloseAsync()ICommunicationListener.CloseAsync() is called on each listener.
    • 取消传递给 RunAsync() 的取消令牌。The cancellation token passed to RunAsync() is canceled. 检查取消令牌的 IsCancellationRequested 属性是否返回 true,如果调用令牌的 ThrowIfCancellationRequested 方法,则会引发 OperationCanceledExceptionA check of the cancellation token's IsCancellationRequested property returns true, and if called, the token's ThrowIfCancellationRequested method throws an OperationCanceledException.
  2. 在针对每个侦听器完成 CloseAsync() 并且完成 RunAsync() 后,调用服务的 StatefulServiceBase.OnChangeRoleAsync()After CloseAsync() finishes on each listener and RunAsync() also finishes, the service's StatefulServiceBase.OnChangeRoleAsync() is called. 此调用是服务中不常见的重写。This call is not commonly overridden in the service.

对于已提升的辅助副本For the Secondary that's promoted

同样,Service Fabric 需要已提升的辅助副本来开始侦听网络上的消息,并启动需要完成的任何后台任务。Similarly, Service Fabric needs the Secondary replica that's promoted to start listening for messages on the wire and start any background tasks it needs to complete. 因此,此过程类似于创建服务时的情况,只不过副本本身已存在。As a result, this process looks like it did when the service is created, except that the replica itself already exists. 将调用以下 API:The following APIs are called:

  1. 并行:In parallel:
    • 调用 StatefulServiceBase.CreateServiceReplicaListeners(),打开返回的所有侦听器。StatefulServiceBase.CreateServiceReplicaListeners() is invoked and any returned listeners are opened. 对每个侦听器调用 ICommunicationListener.OpenAsync()ICommunicationListener.OpenAsync() is called on each listener.
    • 调用服务的 StatefulServiceBase.RunAsync() 方法。The service's StatefulServiceBase.RunAsync() method is called.
  2. 所有副本侦听器的 OpenAsync() 调用完成并已调用 RunAsync() 后,将调用 StatefulServiceBase.OnChangeRoleAsync()After all the replica listener's OpenAsync() calls finish and RunAsync() is called, StatefulServiceBase.OnChangeRoleAsync() is called. 此调用是服务中不常见的重写。This call is not commonly overridden in the service.

有状态服务关闭和主副本降级期间的常见问题Common issues during stateful service shutdown and Primary demotion

Service Fabric 更改有状态服务的主要副本的原因有多种。Service Fabric changes the Primary of a stateful service for a variety of reasons. 最常见的原因是群集重新均衡应用程序升级The most common are cluster rebalancing and application upgrade. 在执行这些操作的过程中(以及在普通服务关闭过程中,例如查看服务是否已删除),服务必须遵从 CancellationTokenDuring these operations (as well as during normal service shutdown, like you'd see if the service was deleted), it is important that the service respect the CancellationToken.

如果服务不完全处理取消,可能会导致若干问题。Services that do not handle cancellation cleanly can experience several issues. 这些操作的速度之所以缓慢,是因为 Service Fabric 要等待服务正常停止。These operations are slow because Service Fabric waits for the services to stop gracefully. 最终导致升级超时失败,然后回滚。This can ultimately lead to failed upgrades that time out and roll back. 未能遵循取消令牌也可能导致群集不均衡。Failure to honor the cancellation token can also cause imbalanced clusters. 群集之所以不均衡,是因为节点变热,但将它们移到其他位置耗时过长,因此无法重新均衡服务。Clusters become unbalanced because nodes get hot, but the services can't be rebalanced because it takes too long to move them elsewhere.

由于服务有状态,它们也可能使用 Reliable CollectionsBecause the services are stateful, it is also likely that they use the Reliable Collections. 在 Service Fabric 中,主要副本降级后,首先会撤销基础状态的写入访问权限。In Service Fabric, when a Primary is demoted, one of the first things that happens is that write access to the underlying state is revoked. 这会导致可能影响服务生命周期的另外一系列问题。This leads to a second set of issues that can affect the service lifecycle. 集合将根据计时和是否已移动或关闭副本返回异常。The collections return exceptions based on the timing and whether the replica is being moved or shut down. 应正确处理这些异常。These exceptions should be handled correctly. 由 Service Fabric 引发的异常分为永久性 (FabricException) 和暂时性 (FabricTransientException) 类别。Exceptions thrown by Service Fabric fall into permanent (FabricException) and transient (FabricTransientException) categories. 永久性异常应记录并引发,而暂时性异常可以基于某种重试逻辑重试。Permanent exceptions should be logged and thrown while the transient exceptions can be retried based on some retry logic.

处理因结合使用 ReliableCollections 和服务生命周期事件而产生的异常是测试和验证 Reliable Service 的重要环节。Handling the exceptions that come from use of the ReliableCollections in conjunction with service lifecycle events is an important part of testing and validating a Reliable Service. 建议在执行升级和混沌测试时始终低负载运行服务,然后再部署到生产环境。We recommend that you always run your service under load while performing upgrades and chaos testing before deploying to production. 以下基本步骤有助于确保已正确实现服务并正确处理生命周期事件。These basic steps help ensure that your service is correctly implemented and handles lifecycle events correctly.

有关服务生命周期的说明Notes on the service lifecycle

  • RunAsync() 方法和 CreateServiceReplicaListeners/CreateServiceInstanceListeners 调用都是可选的。Both the RunAsync() method and the CreateServiceReplicaListeners/CreateServiceInstanceListeners calls are optional. 服务可能使用其中的一个或两个,或者都不使用。A service can have one of them, both, or neither. 例如,如果服务执行的所有工作都只是为了响应用户调用,则无需实现 RunAsync()For example, if the service does all its work in response to user calls, there is no need for it to implement RunAsync(). 只需提供通信侦听器及其关联的代码。Only the communication listeners and their associated code are necessary. 同样,创建和返回通信侦听器的操作也是可选的,因为服务可能只需执行后台工作,在这种情况下,只需实现 RunAsync()Similarly, creating and returning communication listeners is optional, as the service can have only background work to do, and so only needs to implement RunAsync().
  • 服务成功完成 RunAsync() 并从中返回即可。It is valid for a service to complete RunAsync() successfully and return from it. 完成不是失败条件。Completing is not a failure condition. 完成 RunAsync() 表示该服务的后台工作已完成。Completing RunAsync() indicates that the background work of the service has finished. 对于有状态可靠服务,如果副本已从主副本降级为次要副本,然后重新升级为主副本,则会再次调用 RunAsync()For stateful reliable services, RunAsync() is called again if the replica is demoted from Primary to Secondary and then promoted back to Primary.
  • 如果服务因引发某种意外的异常从 RunAsync() 退出,会导致失败。If a service exits from RunAsync() by throwing some unexpected exception, this constitutes a failure. 已关闭服务对象,并已报告运行状况错误。The service object is shut down and a health error is reported.
  • 虽然从这些方法返回没有时间限制,但会立即丧失写入到可靠集合的能力,并因此无法完成任何实际工作。Although there is no time limit on returning from these methods, you immediately lose the ability to write to Reliable Collections, and therefore, cannot complete any real work. 建议尽快在收到取消请求后返回。We recommended that you return as quickly as possible upon receiving the cancellation request. 如果服务在合理的时间内未响应这些 API 调用,Service Fabric 可能会强行终止服务。If your service does not respond to these API calls in a reasonable amount of time, Service Fabric can forcibly terminate your service. 通常,只有在应用程序升级期间或删除服务时,才发生这种情况。Usually this only happens during application upgrades or when a service is being deleted. 此超时默认为 15 分钟。This timeout is 15 minutes by default.
  • OnCloseAsync() 路径中的失败会导致调用 OnAbort(),服务可以凭借这最后一个机会,尽最大努力清理并释放它们所占用的资源。Failures in the OnCloseAsync() path result in OnAbort() being called, which is a last-chance best-effort opportunity for the service to clean up and release any resources that they have claimed. 当在节点上检测到永久性故障时,或者当 Service Fabric 由于内部错误而无法可靠地管理服务实例的生命周期时,通常会调用此方法。This is generally called when a permanent fault is detected on the node, or when Service Fabric cannot reliably manage the service instance's lifecycle due to internal failures.
  • 当有状态服务副本要更改角色(例如,更改为主要副本或次要副本)时,调用 OnChangeRoleAsync()OnChangeRoleAsync() is called when the stateful service replica is changing role, for example to primary or secondary. 主副本将指定为写状态(允许创建和写入可靠集合)。Primary replicas are given write status (are allowed to create and write to Reliable Collections). 辅助副本将指定为读取状态(只能从现有的可靠集合读取)。Secondary replicas are given read status (can only read from existing Reliable Collections). 有状态服务中的大部分工作在主副本执行。Most work in a stateful service is performed at the primary replica. 次要副本可执行只读验证、报表生成、数据挖掘或其他只读作业。Secondary replicas can perform read-only validation, report generation, data mining, or other read-only jobs.

后续步骤Next steps