Service Fabric 可测试性方案:服务通信Service Fabric testability scenarios: Service communication

Azure Service Fabric 中自然显露了微服务和面向服务的体系结构风格。Microservices and service-oriented architectural styles surface naturally in Azure Service Fabric. 在这些类型的分布式体系结构中,组件化微服务应用程序通常由需要相互通信的多个服务组成。In these types of distributed architectures, componentized microservice applications are typically composed of multiple services that need to talk to each other. 即使在最简单的情况下,一般至少有一个无状态 Web 服务和一个有状态数据存储服务需要进行通信。In even the simplest cases, you generally have at least a stateless web service and a stateful data storage service that need to communicate.

服务到服务通信是应用程序的关键集成点,因为每个服务向其他服务公开一个远程 API。Service-to-service communication is a critical integration point of an application, because each service exposes a remote API to other services. 使用一组涉及 I/O 的 API 边界通常需要非常谨慎,并且要进行大量的测试和验证。Working with a set of API boundaries that involves I/O generally requires some care, with a good amount of testing and validation.

这些服务边界在一个分布式系统中连接在一起时,需要考虑几个注意事项:There are numerous considerations to make when these service boundaries are wired together in a distributed system:

  • 传输协议Transport protocol. 使用 HTTP 协议以提高互操作性,还是使用一个自定义二进制协议以获得最大吞吐量?Will you use HTTP for increased interoperability, or a custom binary protocol for maximum throughput?
  • 错误处理Error handling. 如何处理永久性错误和暂时性错误?How will permanent and transient errors be handled? 服务转移到另一个节点时会发生什么情况?What will happen when a service moves to a different node?
  • 超时和延迟Timeouts and latency. 在多层应用程序中,每个服务层如何处理从堆栈到用户的延迟?In multitiered applications, how will each service layer handle latency through the stack and to the user?

无论使用 Service Fabric 提供的内置服务通信组件还是构建自己的通信组件,测试服务之间的交互都是确保应用程序恢复能力的关键部分。Whether you use one of the built-in service communication components provided by Service Fabric or you build your own, testing the interactions between your services is critical to ensuring resiliency in your application.

准备进行服务转移Prepare for services to move

服务实例可能随着时间的推移而四处移动。Service instances may move around over time. 在采用加载度量值进行配置以实现量身定制的最佳资源平衡时尤其如此。This is especially true when they are configured with load metrics for custom-tailored optimal resource balancing. 即使在升级、故障转移、扩大和其他随着分布式系统的生命周期的演化而出现的各种情形下,Service Fabric 也会移动服务实例以最大可能地提高可用性。Service Fabric moves your service instances to maximize their availability even during upgrades, failovers, scale-out, and other situations that occur over the lifetime of a distributed system.

因为服务在群集内四处移动,与服务通信时,客户端和其他服务应准备好处理两种方案:As services move around in the cluster, your clients and other services should be prepared to handle two scenarios when they talk to a service:

  • 从上次与之通信起,服务实例或分区副本已经移动。The service instance or partition replica has moved since the last time you talked to it. 这是服务生命周期的正常部分,并且在应用程序的生命周期内应是预计会发生的。This is a normal part of a service lifecycle, and it should be expected to happen during the lifetime of your application.
  • 服务实例或分区副本正在移动。The service instance or partition replica is in the process of moving. 尽管在 Service Fabric 中一个服务从一个节点移到另一个节点时发生故障转移的速度非常快,但如果服务的通信组件启动较慢,可用性仍然有可能出现延迟。Although failover of a service from one node to another occurs very quickly in Service Fabric, there may be a delay in availability if the communication component of your service is slow to start.

适当地处理这些方案对于系统的顺畅运行非常重要。Handling these scenarios gracefully is important for a smooth-running system. 若要实现此目的,请记住:To do so, keep in mind that:

  • 可以连接的每个服务都有一个它要侦听的地址 (例如 HTTP 或 WebSockets)。Every service that can be connected to has an address that it listens on (for example, HTTP or WebSockets). 当服务实例或分区已经移动时,其地址终结点将改变。When a service instance or partition moves, its address endpoint changes. (它已移到另一个节点,该节点具有不同的 IP 地址。)) 如果使用内置通信组件,它们会处理服务地址的重新解析。(It moves to a different node with a different IP address.) If you're using the built-in communication components, they will handle re-resolving service addresses for you.
  • 服务实例再次启动其侦听程序时,服务延迟有可能暂时性增大。There may be a temporary increase in service latency as the service instance starts up its listener again. 这取决于服务实例移动后,服务打开侦听程序的速度。This depends on how quickly the service opens the listener after the service instance is moved.
  • 服务打开新的节点后,任何现有连接都需要关闭并重新打开。Any existing connections need to be closed and reopened after the service opens on a new node. 正常的节点关闭或重新启动会等待现有连接正常关闭。A graceful node shutdown or restart allows time for existing connections to be shut down gracefully.

测试它:移动服务实例Test it: Move service instances

使用 Service Fabric 的可测试性工具,我们可以编写一个测试方案,以不同的方式测试这些情形。By using Service Fabric's testability tools, you can author a test scenario to test these situations in different ways:

  1. 移动有状态服务的主副本。Move a stateful service's primary replica.

    移动有状态服务分区的主副本有无数原因。The primary replica of a stateful service partition can be moved for any number of reasons. 用此来指定某个特定分区的主副本,以查看服务如何以一种非常有控制力的方式对移动做出反应。Use this to target the primary replica of a specific partition to see how your services react to the move in a very controlled manner.

    PS > Move-ServiceFabricPrimaryReplica -PartitionId 6faa4ffa-521a-44e9-8351-dfca0f7e0466 -ServiceName fabric:/MyApplication/MyService
  2. 停止某个节点。Stop a node.

    节点停止时,Service Fabric 将该驻留在该节点上的所有服务实例或分区移到群集中的其他可用节点上。When a node is stopped, Service Fabric moves all of the service instances or partitions that were on that node to one of the other available nodes in the cluster. 用此来测试一种情形,在这种情形中,群集丢失了一个节点,并且该节点上的所有服务实例和副本都必须移动。Use this to test a situation where a node is lost from your cluster and all of the service instances and replicas on that node have to move.

    可以使用 PowerShell Stop-ServiceFabricNode cmdlet 来停止节点:You can stop a node by using the PowerShell Stop-ServiceFabricNode cmdlet:

    PS > Stop-ServiceFabricNode -NodeName Node_1

维持服务可用性Maintain service availability

作为平台,Service Fabric 旨在为服务提供高可用性。As a platform, Service Fabric is designed to provide high availability of your services. 但是,底层基础结构问题在极端情况下仍然可能导致服务不可用。But in extreme cases, underlying infrastructure problems can still cause unavailability. 因此也必须测试这些方案。It is important to test for these scenarios, too.

有状态服务使用基于仲裁的系统来复制状态以获得高可用性。Stateful services use a quorum-based system to replicate state for high availability. 这意味着副本的仲裁必须可用才能执行写入操作。This means that a quorum of replicas needs to be available to perform write operations. 极少数情况下,例如普遍出现的硬件故障,副本的仲裁可能不可用。In rare cases, such as a widespread hardware failure, a quorum of replicas may not be available. 在这种情况下,不能执行写入操作,但是仍然能够执行读取操作。In these cases, you will not be able to perform write operations, but you will still be able to perform read operations.

测试它:写入操作不可用Test it: Write operation unavailability

使用 Service Fabric 中的可测试性工具,可以注入一个引发仲裁丢失的故障作为测试。By using the testability tools in Service Fabric, you can inject a fault that induces quorum loss as a test. 尽管这种情况很少见,但是依赖于有状态服务的客户端和服务准备好处理无法向有状态服务进行写入请求的情形非常重要。Although such a scenario is rare, it is important that clients and services that depend on a stateful service are prepared to handle situations where they cannot make write requests to it. 有状态服务本身知晓这种可能性并且能够以得当的方式将其告知调用方也非常重要。It is also important that the stateful service itself is aware of this possibility and can gracefully communicate it to callers.

可以使用 Invoke-ServiceFabricPartitionQuorumLoss PowerShell cmdlet 引入仲裁丢失: You can induce quorum loss by using the PowerShell Invoke-ServiceFabricPartitionQuorumLoss cmdlet:

PS > Invoke-ServiceFabricPartitionQuorumLoss -ServiceName fabric:/Myapplication/MyService -QuorumLossMode QuorumReplicas -QuorumLossDurationInSeconds 20

在本示例中,我们将 QuorumLossMode 设置为 QuorumReplicas,指出我们希望引入仲裁丢失而不关闭所有副本。In this example, we set QuorumLossMode to QuorumReplicas to indicate that we want to induce quorum loss without taking down all replicas. 这样仍然能够进行读取操作。This way, read operations are still possible. 要测试整个分区不可用的情形,可将此开关设置为 AllReplicasTo test a scenario where an entire partition is unavailable, you can set this switch to AllReplicas.

后续步骤Next steps

了解有关可测试性操作的详细信息Learn more about testability actions

了解有关可测试性方案的详细信息Learn more about testability scenarios