故障分析服务介绍Introduction to the Fault Analysis Service

故障分析服务是在 Azure Service Fabric 基础上专为测试服务构建的。The Fault Analysis Service is designed for testing services that are built on Azure Service Fabric. 借助故障分析服务,可以引入有意义的故障,并对应用程序运行完整的测试方案。With the Fault Analysis Service you can induce meaningful faults and run complete test scenarios against your applications. 这些故障和方案将执行并验证服务在整个生命周期内要经历的大量状态和转换,所有一切都以受控、安全且一致的方式进行。These faults and scenarios exercise and validate the numerous states and transitions that a service will experience throughout its lifetime, all in a controlled, safe, and consistent manner.

操作指用于测试某个服务的单独故障。Actions are the individual faults targeting a service for testing it. 服务开发人员可将这些操作用作构造块来编写复杂的方案。A service developer can use these as building blocks to write complicated scenarios. 例如:For example:

  • 重新启动一个节点以模拟任意数量的重新启动计算机或 VM 的情形。Restart a node to simulate any number of situations where a machine or VM is rebooted.
  • 移动有状态服务的一个副本以模拟负载均衡、故障转移或应用程序升级。Move a replica of your stateful service to simulate load balancing, failover, or application upgrade.
  • 在一个有状态服务上调用仲裁丢失以创建一种因为没有“足够”的“备份”或“辅助”副本来接收新的数据而无法继续写操作的情形。Invoke quorum loss on a stateful service to create a situation where write operations can't proceed because there aren't enough "back-up" or "secondary" replicas to accept new data.
  • 在一个有状态服务上调用数据丢失以创建一种所有内存中的状态都被完全清除的情形。Invoke data loss on a stateful service to create a situation where all in-memory state is completely wiped out.

方案是由一个或多个操作组成的复杂操作。Scenarios are complex operations composed of one or more actions. 故障分析服务提供两种内置的完整方案:The Fault Analysis Service provides two built-in complete scenarios:

  • 混沌方案Chaos Scenario
  • 故障转移方案Failover Scenario

作为服务进行测试Testing as a service

故障分析服务是 Service Fabric 系统服务,该服务使用 Service Fabric 群集自动启动。The Fault Analysis Service is a Service Fabric system service that is automatically started with a Service Fabric cluster. 该服务充当故障注入、测试方案执行和运行状况分析的主机。This service acts as the host for fault injection, test scenario execution, and health analysis.

故障分析服务

启动故障操作或测试方案时,将向故障分析服务发送命令,以运行故障操作或测试方案。When a fault action or test scenario is initiated, a command is sent to the Fault Analysis Service to run the fault action or test scenario. 故障分析服务具有状态,因此能够可靠地运行故障和方案,并验证结果。The Fault Analysis Service is stateful so that it can reliably run faults and scenarios and validate results. 例如,一个长时间运行的测试方案可由故障分析服务可靠地执行。For example, a long-running test scenario can be reliably executed by the Fault Analysis Service. 此外,因为测试是在群集内部执行,所以服务可以检查群集状态和服务,以提供有关故障更深入的信息。And because tests are being executed inside the cluster, the service can examine the state of the cluster and your services to provide more in-depth information about failures.

测试分布式系统Testing distributed systems

Service Fabric 让编写和管理分布式可扩展应用程序的工作变得更加轻松。Service Fabric makes the job of writing and managing distributed scalable applications significantly easier. 故障分析服务使得测试分布式应用程序变得更加容易。The Fault Analysis Service makes testing a distributed application similarly easier. 在测试时有三个需要解决的主要问题:There are three main issues that need to be solved while testing:

  1. 模拟/生成在现实世界中可能发生的故障:Service Fabric 的一个重要方面是它允许分布式应用程序从各种故障中恢复。Simulating/generating failures that might occur in real-world scenarios: One of the important aspects of Service Fabric is that it enables distributed applications to recover from various failures. 然而,为了测试应用程序能够从这些故障中恢复过来,我们需要一种机制,在受控的测试环境中模拟/生成这些现实故障。However, to test that the application is able to recover from these failures, we need a mechanism to simulate/generate these real-world failures in a controlled test environment.
  2. 生成相关故障的能力:系统中的基本故障(例如网络故障、计算机故障)可轻松地单独生成。The ability to generate correlated failures: Basic failures in the system, such as network failures and machine failures, are easy to produce individually. 生成可能在真实环境中由于这些个体故障的相互作用而发生的大量应用场景是一种非凡的能力。Generating a significant number of scenarios that can happen in the real world as a result of the interactions of these individual failures is non-trivial.
  3. 跨各种开发和部署级别的统一体验:有很多能够执行各种故障的故障注入系统。Unified experience across various levels of development and deployment: There are many fault injection systems that can do various types of failures. 然而,当从单机开发人员方案转到在大型测试环境中运行相同的测试以在生产测试中使用它们时,在所有这些系统中的体验就很不理想。However, the experience in all of these is poor when moving from one-box developer scenarios, to running the same tests in large test environments, to using them for tests in production.

尽管有很多用于解决上述问题的机制,仍然缺少一种能够保证完成相同任务的系统 - 从单机开发人员环境以各种方式转到生产群集测试。While there are many mechanisms to solve these problems, a system that does the same with required guarantees--all the way from a one-box developer environment, to test in production clusters--is missing. 故障分析服务可帮助应用程序开发人员专注于测试他们的业务逻辑。The Fault Analysis Service helps the application developers concentrate on testing their business logic. 故障分析服务可提供对服务与底层分布式系统的交互进行测试所需的所有能力。The Fault Analysis Service provides all the capabilities needed to test the interaction of the service with the underlying distributed system.

模拟/生成现实故障方案Simulating/generating real-world failure scenarios

为了测试一个分布式系统针对故障的稳定性,我们需要一种机制来生成故障。To test the robustness of a distributed system against failures, we need a mechanism to generate failures. 尽管在理论上生成诸如节点关闭等故障看起来很简单,但是一开始就会遇到 Service Fabric 正在尝试解决的一系列一致性问题。While in theory, generating a failure like a node down seems easy, it starts hitting the same set of consistency problems that Service Fabric is trying to solve. 举例而言,如果我们希望关闭一个节点,所需的工作流程如下所示:As an example, if we want to shut down a node, the required workflow is the following:

  1. 从客户端发出关闭节点请求。From the client, issue a shutdown node request.

  2. 将请求发送到正确的节点。Send the request to the right node.

    a.a. 如果找不到该节点,则请求失败。If the node is not found, it should fail.

    b.b. 如果找到该节点,则它应该仅返回节点是否关闭。If the node is found, it should return only if the node is shut down.

从测试角度看,为了验证故障,需要知道当引入故障时,故障实际发生的情况。To verify the failure from a test perspective, the test needs to know that when this failure is induced, the failure actually happens. Service Fabric 提供的保证是当命令抵达节点时,该节点要么即将关闭,要么已经关闭。The guarantee that Service Fabric provides is that either the node will go down or was already down when the command reached the node. 在任何一种情况下,测试都应能够正确推断状态,并且在其验证中正确得出成功或失败的结论。In either case the test should be able to correctly reason about the state and succeed or fail correctly in its validation. 未采用 Service Fabric 来实现的提供相同故障的系统会遇到大量的网络、硬件和软件问题,这些问题会妨碍其提供上述保证。A system implemented outside of Service Fabric to do the same set of failures could hit many network, hardware, and software issues, which would prevent it from providing the preceding guarantees. 在出现上述问题时,Service Fabric 将重新配置群集状态以解决问题,因此,故障分析服务仍然能够提供正确的保证。In the presence of the issues stated before, Service Fabric will reconfigure the cluster state to work around the issues, and hence the Fault Analysis Service will still be able to give the right set of guarantees.

生成需要的事件和方案Generating required events and scenarios

尽管以一致的方式模拟真实故障开始就很难,但生成相关的故障则难上加难。While simulating a real-world failure consistently is tough to start with, the ability to generate correlated failures is even tougher. 例如在发生以下情况时,在有状态持久化服务中出现数据丢失:For example, a data loss happens in a stateful persisted service when the following things happen:

  1. 在复制时只捕捉到副本的一个写入仲裁。Only a write quorum of the replicas are caught up on replication. 所有辅助副本滞后于主副本。All the secondary replicas lag behind the primary.
  2. 因为副本关闭(由于代码包或节点关闭),写入仲裁也关闭。The write quorum goes down because of the replicas going down (due to a code package or node going down).
  3. 因为副本数据丢失(由于磁盘损坏或计算机重置映像),写入仲裁无法重新正常运行。The write quorum cannot come back up because the data for the replicas is lost (due to disk corruption or machine reimaging).

这些相关故障在现实世界中确实会出现,但不像单独故障一样频繁。These correlated failures do happen in the real world but not as frequently as individual failures. 在生产过程中出现之前测试这些方案至关重要。The ability to test for these scenarios before they happen in production is critical. 此外,有能力在受控环境中(中午所有工程师都做好准备)在生产工作负荷期间模拟这些方案。Even more important is the ability to simulate these scenarios with production workloads in controlled circumstances (in the middle of the day with all engineers on deck). 比 2:00 A.M 在生产过程中第一次遇到这种情况好得多。That is much better than having it happen for the first time in production at 2:00 A.M.

跨不同环境的统一体验Unified experience across different environments

传统上,实践已经创建三组不同的体验:一组针对开发环境,一组针对测试环境,一组针对生产环境。The practice traditionally has been to create three different sets of experiences, one for the development environment, one for tests, and one for production. 模型如下:The model was:

  1. 在开发环境中,生成状态转换以允许进行单个方法的单元测试。In the development environment, produce state transitions that allow unit tests of individual methods.
  2. 在测试环境中,生成各种故障以允许采用各种故障方案进行端到端测试。In the test environment, produce failures to allow end-to-end tests that exercise various failure scenarios.
  3. 保持生产环境处于纯洁的状态,禁止任何人为故障并且确保有人力资源能以极快的速度响应故障。Keep the production environment pristine to prevent any non-natural failures and to ensure that there is extremely quick human response to failure.

在 Service Fabric 中,通过故障分析服务,我们主张使用相同的方法从开发环境过渡到生产环境。In Service Fabric, through the Fault Analysis Service, we are proposing to turn this around and use the same methodology from developer environment to production. 可通过两种方式实现此目的:There are two ways to achieve this:

  1. 为了引入受控故障,从一个单机环境到生产群集,都使用故障分析服务 API。To induce controlled failures, use the Fault Analysis Service APIs from a one-box environment all the way to production clusters.
  2. 为了让群集出现问题,从而自动引入故障,可以使用故障分析服务来生成自动故障。To give the cluster a fever that causes automatic induction of failures, use the Fault Analysis Service to generate automatic failures. 通过配置来控制故障的发生速率可允许在不同的环境中测试同一服务。Controlling the rate of failures through configuration enables the same service to be tested differently in different environments.

使用 Service Fabric,尽管故障的规模在不同的环境中会有所不同,但是实际机制将是相同的。With Service Fabric, though the scale of failures would be different in the different environments, the actual mechanisms would be identical. 这样从编写代码到部署过程的速度可以更快,并且能够在实际工作负荷情况下测试服务。This allows for a much quicker code-to-deployment pipeline and the ability to test the services under real-world loads.

使用故障分析服务Using the Fault Analysis Service

C#C#

故障分析服务功能位于 Microsoft.ServiceFabric NuGet 包中的 System.Fabric 命名空间中。Fault Analysis Service features are in the System.Fabric namespace in the Microsoft.ServiceFabric NuGet package. 要使用故障分析服务功能,请在项目中作为一个引用包含 nuget 程序包。To use the Fault Analysis Service features, include the nuget package as a reference in your project.

PowerShellPowerShell

若要使用 PowerShell,必须安装 Service Fabric SDK。To use PowerShell, you must install the Service Fabric SDK. 安装 SDK 后,ServiceFabric PowerShell 模块会自动加载以供使用。After the SDK is installed, the ServiceFabric PowerShell module is auto loaded for you to use.

后续步骤Next steps

若要创建真正的云级服务,必须确保在部署之前和之后,服务能够承受现实的故障。To create truly cloud-scale services, it is critical to ensure, both before and after deployment, that services can withstand real world failures. 在当今的服务世界中,能够快速创新以及将代码投入生产环境非常重要。In the services world today, the ability to innovate quickly and move code to production quickly is very important. 故障分析服务能够帮助服务开发人员确切实现该目的。The Fault Analysis Service helps service developers to do precisely that.

使用内置测试方案测试应用程序和服务,或使用由故障分析服务提供的故障操作编写自己的测试方案。Begin testing your applications and services using the built-in test scenarios, or author your own test scenarios using the fault actions provided by the Fault Analysis Service.