可测试性方案Testability scenarios

大型分布式系统,例如云基础结构,在本质上都是不可靠的。Large distributed systems like cloud infrastructures are inherently unreliable. Azure Service Fabric 使开发人员能够编写出可以在不可靠基础结构上运行的服务。Azure Service Fabric gives developers the ability to write services to run on top of unreliable infrastructures. 若要编写高质量的服务,开发人员需要能够引入这种不可靠的基础结构来测试其服务的稳定性。In order to write high-quality services, developers need to be able to induce such unreliable infrastructure to test the stability of their services.

故障分析服务使开发人员能够引入故障操作,在存在故障的情况下测试服务。The Fault Analysis Service gives developers the ability to induce fault actions to test services in the presence of failures. 然而,定向模拟故障只能做到这种程度。However, targeted simulated faults will get you only so far. 若要进一步测试,可以使用 Service Fabric 中的测试方案:混沌测试和故障转移测试。To take the testing further, you can use the test scenarios in Service Fabric: a chaos test and a failover test. 这些方案在整个群集模拟连续交叉出现的故障,包括常规故障和非常规故障,时间跨度很长。These scenarios simulate continuous interleaved faults, both graceful and ungraceful, throughout the cluster over extended periods of time. 为测试配置故障率和故障类型后,就可以通过 C# API 或 PowerShell 启动该测试,在群集和服务中生成故障。Once a test is configured with the rate and kind of faults, it can be started through either C# APIs or PowerShell, to generate faults in the cluster and your service.

警告

ChaosTestScenario 会被更具弹性的、基于服务的混沌测试取代。ChaosTestScenario is being replaced by a more resilient, service-based Chaos. 有关详细信息,请参阅新文章控制混沌Please refer to the new article Controlled Chaos for more details.

混沌测试Chaos test

混沌方案跨整个 Service Fabric 群集生成故障。The chaos scenario generates faults across the entire Service Fabric cluster. 一般而言,该方案将几个月或几年经历的故障压缩到几小时。The scenario compresses faults generally seen in months or years to a few hours. 各种交叉故障的组合并具有高故障率,能够找出很有可能被忽视的极端状况。The combination of interleaved faults with the high fault rate finds corner cases that are otherwise missed. 这会使服务的代码质量得到显著提高。This leads to a significant improvement in the code quality of the service.

在混沌测试中模拟的故障Faults simulated in the chaos test

  • 重新启动节点Restart a node
  • 重新启动已部署的代码包Restart a deployed code package
  • 删除副本Remove a replica
  • 重新启动副本Restart a replica
  • 移动主要副本(可选)Move a primary replica (optional)
  • 移动次要副本(可选)Move a secondary replica (optional)

混沌测试在指定时间段内运行故障和群集验证的多次循环。The chaos test runs multiple iterations of faults and cluster validations for the specified period of time. 让群集达到稳定状态以及验证成功所用的时间也是可配置的。The time spent for the cluster to stabilize and for validation to succeed is also configurable. 若在群集验证中遇到故障,则方案失败。The scenario fails when you hit a single failure in cluster validation.

例如,考虑一个设置为运行一小时且最多三个并发故障的测试。For example, consider a test set to run for one hour with a maximum of three concurrent faults. 该测试将引入三个故障,并验证群集运行状况。The test will induce three faults, and then validate the cluster health. 该测试将重复进行以上步骤,直到群集变得不正常或一小时的时间已过。The test will iterate through the previous step till the cluster becomes unhealthy or one hour passes. 如果在任何一次迭代中群集变得不正常,即在配置的时间内不稳定,则测试会引发异常而失败。If the cluster becomes unhealthy in any iteration, i.e. it does not stabilize within a configured time, the test will fail with an exception. 此异常指出系统出现问题,并且需要进一步调查。This exception indicates that something has gone wrong and needs further investigation.

当前,混沌测试故障生成引擎只引入安全故障。In its current form, the fault generation engine in the chaos test induces only safe faults. 这意味着在没有外部故障的情况下,绝对不会出现仲裁丢失或数据丢失。This means that in the absence of external faults, a quorum or data loss will never occur.

重要的配置选项Important configuration options

  • TimeToRun:测试在成功完成之前将要运行的总时间。TimeToRun: Total time that the test will run before finishing with success. 测试也可以提前完成来代替验证失败。The test can finish earlier in lieu of a validation failure.
  • MaxClusterStabilizationTimeout:在测试失败之前,等待群集变得正常的最长时间。MaxClusterStabilizationTimeout: Maximum amount of time to wait for the cluster to become healthy before failing the test. 执行的检查包括群集运行状况是否正常、服务运行状况是否正常、对于服务分区而言目标副本集是否达到设定的大小以及是否不存在 InBuild 副本。The checks performed are whether cluster health is OK, service health is OK, the target replica set size is achieved for the service partition, and no InBuild replicas exist.
  • MaxConcurrentFaults:每次迭代造成的最大并发错误数。MaxConcurrentFaults: Maximum number of concurrent faults induced in each iteration. 数字越大,测试越激进,从而导致更复杂的故障转移和转换组合。The higher the number, the more aggressive the test, hence resulting in more complex failovers and transition combinations. 在没有外部故障的情况下,测试保证不管此配置有多高,都不会出现仲裁丢失或数据丢失。The test guarantees that in absence of external faults there will not be a quorum or data loss, irrespective of how high this configuration is.
  • EnableMoveReplicaFaults:启用或禁用导致主副本或辅助副本移动的故障。EnableMoveReplicaFaults: Enables or disables the faults that are causing the move of the primary or secondary replicas. 默认情况下,这些故障处于禁用状态。These faults are disabled by default.
  • WaitTimeBetweenIterations:循环之间(即在一轮故障和对应的验证之后)等待的时间量。WaitTimeBetweenIterations: Amount of time to wait between iterations, i.e. after a round of faults and corresponding validation.

如何运行混沌测试How to run the chaos test

C# 示例C# sample

using System;
using System.Fabric;
using System.Fabric.Testability.Scenario;
using System.Threading;
using System.Threading.Tasks;

class Test
{
    public static int Main(string[] args)
    {
        string clusterConnection = "localhost:19000";

        Console.WriteLine("Starting Chaos Test Scenario...");
        try
        {
            RunChaosTestScenarioAsync(clusterConnection).Wait();
        }
        catch (AggregateException ae)
        {
            Console.WriteLine("Chaos Test Scenario did not complete: ");
            foreach (Exception ex in ae.InnerExceptions)
            {
                if (ex is FabricException)
                {
                    Console.WriteLine("HResult: {0} Message: {1}", ex.HResult, ex.Message);
                }
            }
            return -1;
        }

        Console.WriteLine("Chaos Test Scenario completed.");
        return 0;
    }

    static async Task RunChaosTestScenarioAsync(string clusterConnection)
    {
        TimeSpan maxClusterStabilizationTimeout = TimeSpan.FromSeconds(180);
        uint maxConcurrentFaults = 3;
        bool enableMoveReplicaFaults = true;

        // Create FabricClient with connection and security information here.
        FabricClient fabricClient = new FabricClient(clusterConnection);

        // The chaos test scenario should run at least 60 minutes or until it fails.
        TimeSpan timeToRun = TimeSpan.FromMinutes(60);
        ChaosTestScenarioParameters scenarioParameters = new ChaosTestScenarioParameters(
          maxClusterStabilizationTimeout,
          maxConcurrentFaults,
          enableMoveReplicaFaults,
          timeToRun);

        // Other related parameters:
        // Pause between two iterations for a random duration bound by this value.
        // scenarioParameters.WaitTimeBetweenIterations = TimeSpan.FromSeconds(30);
        // Pause between concurrent actions for a random duration bound by this value.
        // scenarioParameters.WaitTimeBetweenFaults = TimeSpan.FromSeconds(10);

        // Create the scenario class and execute it asynchronously.
        ChaosTestScenario chaosScenario = new ChaosTestScenario(fabricClient, scenarioParameters);

        try
        {
            await chaosScenario.ExecuteAsync(CancellationToken.None);
        }
        catch (AggregateException ae)
        {
            throw ae.InnerException;
        }
    }
}

PowerShellPowerShell

Service Fabric PowerShell 模块提供了两种方法来开始混沌测试。The Service Fabric Powershell module includes two ways to begin a chaos scenario. Invoke-ServiceFabricChaosTestScenario 是基于客户端的,如果客户端计算机在测试期间关闭,则不会引入进一步的故障。Invoke-ServiceFabricChaosTestScenario is client-based, and if the client machine is shutdown midway through the test, no further faults will be introduced. 另外,还提供了一组命令,用于在计算机关闭的情况下使测试保持运行。Alternatively, there is a set of commands meant to keep the test running in the event of machine shutdown. Start-ServiceFabricChaos 使用名为 FaultAnalysisService 的有状态且可靠的系统服务,确保在 TimeToRun 启动之前,将继续引入故障。Start-ServiceFabricChaos uses a stateful and reliable system service called FaultAnalysisService, ensuring faults will remain introduced until the TimeToRun is up. Stop-ServiceFabricChaos 可用于手动停止该测试,Get-ServiceFabricChaosReport 将用于获取报表。Stop-ServiceFabricChaos can be used to manually stop the scenario, and Get-ServiceFabricChaosReport will obtain a report. 有关详细信息,请参阅 Azure Service Fabric Powershell 参考在 Service Fabric 群集中引入受控混沌For more information see the Azure Service Fabric Powershell reference and Inducing controlled chaos in Service Fabric clusters.

$connection = "localhost:19000"
$timeToRun = 60
$maxStabilizationTimeSecs = 180
$concurrentFaults = 3
$waitTimeBetweenIterationsSec = 60

Connect-ServiceFabricCluster $connection

Invoke-ServiceFabricChaosTestScenario -TimeToRunMinute $timeToRun -MaxClusterStabilizationTimeoutSec $maxStabilizationTimeSecs -MaxConcurrentFaults $concurrentFaults -EnableMoveReplicaFaults -WaitTimeBetweenIterationsSec $waitTimeBetweenIterationsSec

故障转移测试Failover test

故障转移测试方案是混沌测试方案针对特定服务分区的一个版本。The failover test scenario is a version of the chaos test scenario that targets a specific service partition. 它在特定服务分区上测试故障转移的效果,同时不影响其他服务。It tests the effect of failover on a specific service partition while leaving the other services unaffected. 一旦配置好目标分区信息和其他参数,它就可以作为一个客户端工具运行,使用 C# API 或 PowerShell 生成针对一个服务分区的故障。Once it's configured with the target partition information and other parameters, it runs as a client-side tool that uses either C# APIs or PowerShell to generate faults for a service partition. 该方案重复一系列的模块故障和服务验证,同时业务逻辑在一边继续运行以提供工作负荷。The scenario iterates through a sequence of simulated faults and service validation while your business logic runs on the side to provide a workload. 服务验证失败指出存在需要进一步调查的问题。A failure in service validation indicates an issue that needs further investigation.

在故障转移测试中模拟的故障Faults simulated in the failover test

  • 重新启动分区所在的已部署代码包Restart a deployed code package where the partition is hosted
  • 删除主/辅助副本或无状态实例Remove a primary/secondary replica or stateless instance
  • 重启主要/次要副本(如果是持久化服务)Restart a primary secondary replica (if a persisted service)
  • 移动主副本Move a primary replica
  • 移动辅助副本Move a secondary replica
  • 重新启动分区Restart the partition

故障转移测试工作引入所选的故障,并在服务上运行验证以确保其稳定性。The failover test induces a chosen fault and then runs validation on the service to ensure its stability. 与混沌测试中可能出现多个故障相比,故障转移测试一次仅引入一个故障。The failover test induces only one fault at a time, as opposed to possible multiple faults in the chaos test. 如果每次故障之后,服务分区在配置的等待时间内没有达到稳定,则测试失败。If the service partition does not stabilize within the configured timeout after each fault, the test fails. 测试只引入安全的故障。The test induces only safe faults. 这意味着在没有外部故障的情况下,不会出现仲裁丢失或数据丢失。This means that in absence of external failures, a quorum or data loss will not occur.

重要的配置选项Important configuration options

  • PartitionSelector:选择器对象,指定需要作为目标的分区。PartitionSelector: Selector object that specifies the partition that needs to be targeted.
  • TimeToRun:测试在完成之前将要运行的总时间。TimeToRun: Total time that the test will run before finishing.
  • MaxServiceStabilizationTimeout:在测试失败之前,等待群集变得正常的最长时间。MaxServiceStabilizationTimeout: Maximum amount of time to wait for the cluster to become healthy before failing the test. 执行的检查包括服务运行状况是否正常、对于所有分区而言目标副本集是否达到设定的大小以及是否不存在 InBuild 副本。The checks performed are whether service health is OK, the target replica set size is achieved for all partitions, and no InBuild replicas exist.
  • WaitTimeBetweenFaults:每次故障和验证循环之间等待的时间量。WaitTimeBetweenFaults: Amount of time to wait between every fault and validation cycle.

如何运行故障转移测试How to run the failover test

C#C#

using System;
using System.Fabric;
using System.Fabric.Testability.Scenario;
using System.Threading;
using System.Threading.Tasks;

class Test
{
    public static int Main(string[] args)
    {
        string clusterConnection = "localhost:19000";
        Uri serviceName = new Uri("fabric:/samples/PersistentToDoListApp/PersistentToDoListService");

        Console.WriteLine("Starting Chaos Test Scenario...");
        try
        {
            RunFailoverTestScenarioAsync(clusterConnection, serviceName).Wait();
        }
        catch (AggregateException ae)
        {
            Console.WriteLine("Chaos Test Scenario did not complete: ");
            foreach (Exception ex in ae.InnerExceptions)
            {
                if (ex is FabricException)
                {
                    Console.WriteLine("HResult: {0} Message: {1}", ex.HResult, ex.Message);
                }
            }
            return -1;
        }

        Console.WriteLine("Chaos Test Scenario completed.");
        return 0;
    }

    static async Task RunFailoverTestScenarioAsync(string clusterConnection, Uri serviceName)
    {
        TimeSpan maxServiceStabilizationTimeout = TimeSpan.FromSeconds(180);
        PartitionSelector randomPartitionSelector = PartitionSelector.RandomOf(serviceName);

        // Create FabricClient with connection and security information here.
        FabricClient fabricClient = new FabricClient(clusterConnection);

        // The chaos test scenario should run at least 60 minutes or until it fails.
        TimeSpan timeToRun = TimeSpan.FromMinutes(60);
        FailoverTestScenarioParameters scenarioParameters = new FailoverTestScenarioParameters(
          randomPartitionSelector,
          timeToRun,
          maxServiceStabilizationTimeout);

        // Other related parameters:
        // Pause between two iterations for a random duration bound by this value.
        // scenarioParameters.WaitTimeBetweenIterations = TimeSpan.FromSeconds(30);
        // Pause between concurrent actions for a random duration bound by this value.
        // scenarioParameters.WaitTimeBetweenFaults = TimeSpan.FromSeconds(10);

        // Create the scenario class and execute it asynchronously.
        FailoverTestScenario failoverScenario = new FailoverTestScenario(fabricClient, scenarioParameters);

        try
        {
            await failoverScenario.ExecuteAsync(CancellationToken.None);
        }
        catch (AggregateException ae)
        {
            throw ae.InnerException;
        }
    }
}

PowerShellPowerShell

$connection = "localhost:19000"
$timeToRun = 60
$maxStabilizationTimeSecs = 180
$waitTimeBetweenFaultsSec = 10
$serviceName = "fabric:/SampleApp/SampleService"

Connect-ServiceFabricCluster $connection

Invoke-ServiceFabricFailoverTestScenario -TimeToRunMinute $timeToRun -MaxServiceStabilizationTimeoutSec $maxStabilizationTimeSecs -WaitTimeBetweenFaultsSec $waitTimeBetweenFaultsSec -ServiceName $serviceName -PartitionKindSingleton