Reliable Services 生命周期Reliable Services lifecycle

Reliable Services 是 Azure Service Fabric 中可用的编程模型之一。Reliable Services is one of the programming models available in Azure Service Fabric. 了解 Reliable Services 的生命周期时,最重要的是要理解基本生命周期事件。When learning about the lifecycle of Reliable Services, it's most important to understand the basic lifecycle events. 事件的确切顺序取决于配置详细信息。The exact ordering of events depends on configuration details.

一般情况下,Reliable Services 生命周期包括以下事件:In general, the Reliable Services lifecycle includes the following events:

  • 在启动期间:During startup:
    • 构造服务。Services are constructed.
    • 服务可能会构造并返回零个或多个侦听器。Services have an opportunity to construct and return zero or more listeners.
    • 打开返回的任何侦听器,以便与服务通信。Any returned listeners are opened, for communication with the service.
    • 调用服务的 runAsync 方法,使服务能够执行长时间运行的工作或后台工作。The service's runAsync method is called, so the service can do long-running or background work.
  • 在关闭期间:During shutdown:
    • 取消传递给 runAsync 的取消令牌,同时关闭侦听器。The cancellation token that was passed to runAsync is canceled, and the listeners are closed.
    • 销毁服务对象本身。The service object itself is destructed.

根据可靠服务是无状态服务还是有状态服务,Reliable Services 中事件的顺序可能略有变化。The order of events in Reliable Services might change slightly depending on whether the reliable service is stateless or stateful.

此外,对于有状态服务,必须处理主副本交换方案。Also, for stateful services, you must address the primary swap scenario. 在执行此序列期间,主副本的角色将转移到另一个副本(或者转移回来),而无需关闭服务。During this sequence, the role of primary is transferred to another replica (or comes back) without the service shutting down.

最后,必须考虑错误或失败条件。Finally, you have to think about error or failure conditions.

无状态服务Stateless service startup

启动无状态服务的生命周期非常直接了当。The lifecycle of a stateless service is fairly straightforward. 下面是事件的顺序:Here's the order of events:

  1. 构造服务。The service is constructed.
  2. 这些事件将并行发生:These events occur in parallel:
    • 调用 StatelessService.createServiceInstanceListeners(),打开返回的所有侦听器。StatelessService.createServiceInstanceListeners() is invoked, and any returned listeners are opened. 对每个侦听器调用 CommunicationListener.openAsync()CommunicationListener.openAsync() is called on each listener.
    • 调用服务的 runAsync 方法 (StatelessService.runAsync())。The service's runAsync method (StatelessService.runAsync()) is called.
  3. 如果存在,则调用服务本身的 onOpenAsync 方法。If present, the service's own onOpenAsync method is called. 具体而言,即调用 StatelessService.onOpenAsync()Specifically, StatelessService.onOpenAsync() is called. 这是一种不常见的重写,但这种调用是可行的。This is an uncommon override, but it is available.

必须注意,用于创建和打开侦听器的调用和 runAsync 调用之间没有一定的顺序。It's important to note that there is no ordering between the call to create and open the listeners and the call to runAsync. 侦听器可能在 runAsync 启动前就已打开。The listeners might open before runAsync is started. 同样,在通信侦听器打开甚至构造之前,即可调用 runAsyncSimilarly, runAsync might be invoked before the communication listeners are open, or before they have even been constructed. 如果需要进行任何同步,它必须由实现器完成。If any synchronization is required, it must be done by the implementer. 下面是一些常见的解决方案:Here are some common solutions:

  • 有时,在创建其他信息或完成其他工作之前,侦听器无法正常运行。Sometimes listeners can't function until other information is created or other work is done. 对于无状态服务,通常可在服务的构造函数中完成该工作。For stateless services, that work usually can be done in the service's constructor. 可以在 createServiceInstanceListeners() 调用过程中完成,也可在构造侦听器本身的过程中完成。It can be done during the createServiceInstanceListeners() call, or as part of the construction of the listener itself.
  • 有时,在侦听器打开之前,runAsync 中的代码不会启动。Sometimes the code in runAsync won't start until the listeners are open. 在这种情况下,必须进行更多的协调。In this case, additional coordination is necessary. 常见解决方案是向侦听器中添加标志。A common solution is to add a flag in the listeners. 用标志指示侦听器完成的时间。The flag indicates when the listeners have finished. runAsync 方法将检查此时间再继续进行实际工作。The runAsync method checks this before continuing the actual work.

无状态服务关闭Stateless service shutdown

关闭无状态服务,将遵循相同的模式,但顺序相反:When shutting down a stateless service, the same pattern is followed, but in reverse:

  1. 这些事件将并行发生:These events occur in parallel:
    • 关闭任何打开的侦听器。Any open listeners are closed. 对每个侦听器调用 CommunicationListener.closeAsync()CommunicationListener.closeAsync() is called on each listener.
    • 取消传递给 runAsync() 的取消令牌。The cancellation token that was passed to runAsync() is canceled. 检查取消令牌的 isCancelled 属性是否返回 true,如果已调用,则令牌的 throwIfCancellationRequested 方法会引发 CancellationExceptionChecking the cancellation token's isCancelled property returns true, and if called, the token's throwIfCancellationRequested method throws a CancellationException.
  2. closeAsync() 每个侦听器完成,且 runAsync() 也完成后,调用服务的 StatelessService.onCloseAsync() 方法(如果存在)。When closeAsync() finishes on each listener and runAsync() also finishes, the service's StatelessService.onCloseAsync() method is called, if it's present. 再次强调,这是一种不常见的重写,但它可以用于安全地关闭资源、停止后台处理、完成外部状态保存或关闭现有连接。Again, this is not a common override, but it can be used to safely close resources, stop background processing, finish saving external state, or close down existing connections.
  3. 完成 StatelessService.onCloseAsync() 后,销毁服务对象。After StatelessService.onCloseAsync() finishes, the service object is destructed.

有状态服务启动Stateful service startup

有状态服务的模式与无状态服务类似,只是稍有不同。Stateful services have a pattern that is similar to stateless services, with a few changes. 启动有状态服务时,事件的顺序如下:Here's the order of events for starting a stateful service:

  1. 构造服务。The service is constructed.
  2. StatefulServiceBase.onOpenAsync()StatefulServiceBase.onOpenAsync() is called. 此调用是服务中不常见的重写。This call is not commonly overridden in the service.
  3. 这些事件将并行发生:These events occur in parallel:
    • 调用 StatefulServiceBase.createServiceReplicaListeners()StatefulServiceBase.createServiceReplicaListeners() is invoked.
      • 如果服务是主要服务,则打开所有返回的侦听器。If the service is a primary service, all returned listeners are opened. 对每个侦听器调用 CommunicationListener.openAsync()CommunicationListener.openAsync() is called on each listener.
      • 如果服务是辅助服务,则只打开已标记为 listenOnSecondary = true 的侦听器。If the service is a secondary service, only listeners marked as listenOnSecondary = true are opened. 在辅助服务上打开的侦听器较不常见。Having listeners that are open on secondaries is less common.
    • 如果该服务目前是主要服务,则调用该服务的 StatefulServiceBase.runAsync() 方法。If the service is currently a primary, the service's StatefulServiceBase.runAsync() method is called.
  4. 所有副本侦听器的 openAsync() 调用完成并已调用 runAsync() 后,将调用 StatefulServiceBase.onChangeRoleAsync()After all the replica listener's openAsync() calls finish and runAsync() is called, StatefulServiceBase.onChangeRoleAsync() is called. 此调用是服务中不常见的重写。This call is not commonly overridden in the service.

类似于无状态服务,有状态服务中也不会协调创建和打开侦听器的顺序以及调用 runAsync 的时间。Similar to stateless services, in stateful service, there's no coordination between the order in which the listeners are created and opened and when runAsync is called. 如果需要协调,解决方法大致相同。If you need coordination, the solutions are much the same. 对于有状态服务,还存在一种情况。But there's one additional case for stateful service. 假设抵达通信侦听器的调用需要在某个 Reliable Collections 中保存信息。Say that the calls that arrive at the communication listeners require information kept inside some Reliable Collections. 由于通信侦听器可能在 Reliable Collections 可读或可写之前打开,因此在 runAsync 启动之前,必须经过额外协调。Because the communication listeners might open before the Reliable Collections are readable or writeable, and before runAsync starts, some additional coordination is necessary. 最简单且最常见的解决方法是让通信侦听器返回错误代码。The simplest and most common solution is for the communication listeners to return an error code. 客户端将通过该错误代码了解要重试请求。The client uses the error code to know to retry the request.

有状态服务关闭Stateful service shutdown

与无状态服务一样,关闭期间的生命周期事件与启动期间是相同的,但顺序相反。Like stateless services, the lifecycle events during shutdown are the same as during startup, but reversed. 关闭有状态服务时,将发生以下事件:When a stateful service is being shut down, the following events occur:

  1. 这些事件将并行发生:These events occur in parallel:

    • 关闭任何打开的侦听器。Any open listeners are closed. 对每个侦听器调用 CommunicationListener.closeAsync()CommunicationListener.closeAsync() is called on each listener.
    • 取消传递给 runAsync() 的取消令牌。The cancellation token that was passed to runAsync() is canceled. 调用取消令牌的 isCancelled() 属性返回 true,如果已调用,则令牌的 throwIfCancellationRequested() 方法会引发 OperationCanceledExceptionA call to the cancellation token's isCancelled() method returns true, and if called, the token's throwIfCancellationRequested() method throws an OperationCanceledException.
  2. 在针对每个侦听器完成 closeAsync() 并且完成 runAsync() 后,调用服务的 StatefulServiceBase.onChangeRoleAsync()After closeAsync() finishes on each listener and runAsync() also finishes, the service's StatefulServiceBase.onChangeRoleAsync() is called. 此调用是服务中不常见的重写。This call is not commonly overridden in the service.

    备注

    仅当此副本是主副本时,才需要等待 runAsync 完成。Waiting for runAsync to finish is necessary only if this replica is a primary replica.

  3. StatefulServiceBase.onChangeRoleAsync() 方法完成后,调用 StatefulServiceBase.onCloseAsync() 方法。After the StatefulServiceBase.onChangeRoleAsync() method finishes, the StatefulServiceBase.onCloseAsync() method is called. 此调用是一种不常见的重写,但是可行的。This call is an uncommon override, but it is available.

  4. 完成 StatefulServiceBase.onCloseAsync() 后,销毁服务对象。After StatefulServiceBase.onCloseAsync() finishes, the service object is destructed.

有状态服务主副本交换Stateful service primary swaps

运行有状态服务时,只对该有状态服务的主副本打开通信侦听器并调用 runAsync 方法。While a stateful service is running, communication listeners are opened and the runAsync method is called only for the primary replicas of that stateful services. 会构造辅助副本,但不会对其执行进一步的调用。Secondary replicas are constructed, but see no further calls. 在运行有状态服务时,当前用作主副本的副本可能会更改。While a stateful service is running, the replica that's currently the primary can change. 对有状态副本可见的生命周期事件取决于该副本在交换期是降级还是升级了。The lifecycle events that a stateful replica can see depends on whether it is the replica being demoted or promoted during the swap.

对于降级的主副本For the demoted primary

Service Fabric 需要使用降级的主副本,以停止处理消息,并停止任何后台工作。Service Fabric needs the primary replica that's demoted to stop processing messages and stop any background work. 此步骤与关闭服务时类似。This step is similar to when the service is shut down. 一个差别在于,在此情况下不会销毁或关闭服务,因为它保留为次要副本。One difference is that the service isn't destructed or closed, because it remains as a secondary. 将发生以下事件:The following events occur:

  1. 这些事件将并行发生:These events occur in parallel:
    • 关闭任何打开的侦听器。Any open listeners are closed. 对每个侦听器调用 CommunicationListener.closeAsync()CommunicationListener.closeAsync() is called on each listener.
    • 取消传递给 runAsync() 的取消令牌。The cancellation token that was passed to runAsync() is canceled. 检查取消标记的 isCancelled() 方法是否返回 trueA check of the cancellation token's isCancelled() method returns true. 如果已调用,则令牌的 throwIfCancellationRequested() 方法将引发 OperationCanceledExceptionIf called, the token's throwIfCancellationRequested() method throws an OperationCanceledException.
  2. 在针对每个侦听器完成 closeAsync() 并且完成 runAsync() 后,调用服务的 StatefulServiceBase.onChangeRoleAsync()After closeAsync() finishes on each listener and runAsync() also finishes, the service's StatefulServiceBase.onChangeRoleAsync() is called. 此调用是服务中不常见的重写。This call is not commonly overridden in the service.

对于升级的次要副本For the promoted secondary

同样,Service Fabric 需要提升的次要副本,才能开始侦听网络上的消息,并启动需要完成的任何后台任务。Similarly, Service Fabric needs the secondary replica that's promoted to start listening for messages on the wire, and to start any background tasks that it needs to complete. 此过程与创建服务时类似。This process is similar to when the service is created. 不同之处在于副本本身已存在。The difference is that the replica itself already exists. 将发生以下事件:The following events occur:

  1. 这些事件将并行发生:These events occur in parallel:
    • 调用 StatefulServiceBase.createServiceReplicaListeners(),打开返回的所有侦听器。StatefulServiceBase.createServiceReplicaListeners() is invoked and any returned listeners are opened. 对每个侦听器调用 CommunicationListener.openAsync()CommunicationListener.openAsync() is called on each listener.
    • 调用服务的 StatefulServiceBase.runAsync() 方法。The service's StatefulServiceBase.runAsync() method is called.
  2. 所有副本侦听器的 openAsync() 调用完成并已调用 runAsync() 后,将调用 StatefulServiceBase.onChangeRoleAsync()After all the replica listener's openAsync() calls finish and runAsync() is called, StatefulServiceBase.onChangeRoleAsync() is called. 此调用是服务中不常见的重写。This call is not commonly overridden in the service.

有状态服务关闭和主要副本降级期间的常见问题Common issues during stateful service shutdown and primary demotion

Service Fabric 更改有状态服务的主副本的原因有多种。Service Fabric changes the primary of a stateful service for multiple reasons. 最常见的原因是群集重新均衡应用程序升级The most common reasons are cluster rebalancing and application upgrade. 这些在操作期间,请务必使服务遵循 cancellationTokenDuring these operations, it's important that the service respects the cancellationToken. 这在正常服务在关闭期间也适用,比如删除服务时。This also applies during normal service shutdown, such as if the service was deleted.

如果服务不完全处理取消,可能会导致若干问题。Services that don't handle cancellation cleanly can experience several issues. 这些操作的速度之所以缓慢,是因为 Service Fabric 要等待服务正常停止。These operations are slow because Service Fabric waits for the services to stop gracefully. 最终导致升级超时失败,然后回滚。This can ultimately lead to failed upgrades that time out and rollback. 未能遵循取消令牌也可能导致群集不均衡。Failure to honor the cancellation token also can cause imbalanced clusters. 群集将由于节点变热而不均衡。Clusters become unbalanced because nodes get hot. 但是,将它们移到其他位置耗时过长,因此无法重新均衡服务。However, the services can't be rebalanced because it takes too long to move them elsewhere.

由于服务有状态,所以它们也可能使用 Reliable CollectionsBecause the services are stateful, it's also likely that the services use Reliable Collections. 在 Service Fabric 中,主副本降级后,首先会撤销基础状态的写入访问权限。In Service Fabric, when a primary is demoted, one of the first things that happens is that write access to the underlying state is revoked. 这会导致可能影响服务生命周期的另外一系列问题。This leads to a second set of issues that might affect the service lifecycle. 集合将根据计时和是否已移动或关闭副本返回异常。The collections return exceptions based on the timing and whether the replica is being moved or shut down. 请务必正确处理这些异常。It's important to handle these exceptions correctly.

由 Service Fabric 引发的异常可能是永久的 (FabricException) 或临时的 (FabricTransientException)Exceptions thrown by Service Fabric are either permanent (FabricException) or transient (FabricTransientException). 应记录并引发永久异常。Permanent exceptions should be logged and thrown. 可以基于重试逻辑重试临时异常。Transient exceptions can be retried based on retry logic.

测试和验证 Reliable Services 时,处理因结合使用 ReliableCollections 和服务生命周期事件而产生的异常是一个重要环节。An important part of testing and validating Reliable Services is handling the exceptions that come from using the ReliableCollections in conjunction with service lifecycle events. 建议始终在负载范围内运行服务。We recommend that you always run your service under load. 还应执行升级和混沌测试,然后再部署到生产环境。You should also perform upgrades and chaos testing before deploying to production. 以下基本步骤有助于确保已正确实现服务和处理生命周期事件。These basic steps help ensure that your service is implemented correctly, and that it handles lifecycle events correctly.

有关服务生命周期的说明Notes on service lifecycle

  • runAsync() 方法和 createServiceInstanceListeners/createServiceReplicaListeners 调用都是可选的。Both the runAsync() method and the createServiceInstanceListeners/createServiceReplicaListeners calls are optional. 一项服务可能符合其中一项、两项或均不符合。A service might have one, both, or neither. 例如,如果服务执行的所有工作都只是为了响应用户调用,则无需实现 runAsync()For example, if the service does all its work in response to user calls, there's no need for it to implement runAsync(). 只需提供通信侦听器及其关联的代码。Only the communication listeners and their associated code are necessary. 同样,创建和返回通信侦听器是可选的。Similarly, creating and returning communication listeners is optional. 该服务可能具有仅后台工作要做,因此它只需实现 runAsync()The service might have only background work to do, so it only needs to implement runAsync().
  • 服务成功完成 runAsync() 并从中返回即可。It's valid for a service to complete runAsync() successfully and return from it. 这不会被视为失败条件。This isn't considered a failure condition. 它表示服务后台工作的完成。It represents the background work of the service finishing. 对于有状态可靠服务,如果服务已从主副本降级,然后重新升级为主副本,则会再次调用 runAsync()For stateful Reliable Services, runAsync() would be called again if the service is demoted from primary, and then promoted back to primary.
  • 如果服务因引发某种意外的异常从 runAsync() 退出,将导致失败。If a service exits from runAsync() by throwing some unexpected exception, this is a failure. 已关闭服务对象,并已报告运行状况错误。The service object is shut down, and a health error is reported.
  • 虽然从这些方法返回没有时间限制,但会立即丧失写入的能力。Although there's no time limit on returning from these methods, you immediately lose the ability to write. 因此,无法完成任何实际工作。Therefore, you can't complete any real work. 建议尽快在收到取消请求后返回。We recommend that you return as quickly as possible upon receiving the cancellation request. 如果服务在合理的时间内未响应这些 API 调用,Service Fabric 可能会强行终止服务。If your service doesn't respond to these API calls in a reasonable amount of time, Service Fabric might forcibly terminate your service. 通常,只有在应用程序升级期间或删除服务时,才发生这种情况。Usually, this happens only during application upgrades or when a service is being deleted. 此超时默认为 15 分钟。This timeout is 15 minutes by default.
  • onCloseAsync() 路径中的故障将导致 onAbort() 调用。Failures in the onCloseAsync() path result in onAbort() being called. 这一调用是最后一个机会,服务会尽最大努力清理并释放其占用的资源。This call is a last-chance, best-effort opportunity for the service to clean up and release any resources that they have claimed. 当在节点上检测到永久性故障时,或者当 Service Fabric 由于内部错误而无法可靠地管理服务实例的生命周期时,通常会调用此方法。This is generally called when a permanent fault is detected on the node, or when Service Fabric cannot reliably manage the service instance's lifecycle due to internal failures.
  • 当有状态服务副本要更改角色(例如,更改为主要副本或次要副本)时,调用 OnChangeRoleAsync()OnChangeRoleAsync() is called when the stateful service replica is changing role, for example to primary or secondary. 主副本将指定为写状态(允许创建和写入可靠集合)。Primary replicas are given write status (are allowed to create and write to Reliable Collections). 辅助副本将指定为读取状态(只能从现有的可靠集合读取)。Secondary replicas are given read status (can only read from existing Reliable Collections). 有状态服务中的大部分工作在主要副本执行。Most work in a stateful service is performed at the primary replica. 次要副本可执行只读验证、报表生成、数据挖掘或其他只读作业。Secondary replicas can perform read-only validation, report generation, data mining, or other read-only jobs.

后续步骤Next steps