在 Service Fabric 群集中引入受控的混沌测试Induce controlled Chaos in Service Fabric clusters

大规模分布式系统,例如云基础结构,在本质上都是不可靠的。Large-scale distributed systems like cloud infrastructures are inherently unreliable. Azure Service Fabric 可让开发人员在不可靠的基础结构之上编写可靠的分布式服务。Azure Service Fabric enables developers to write reliable distributed services on top of an unreliable infrastructure. 若要在不可靠的基础结构之上编写可靠的分布式服务,开发人员应能够在不可靠的底层基础结构因故障而进行复杂的状态转换时,测试其服务的稳定性。To write robust distributed services on top of an unreliable infrastructure, developers need to be able to test the stability of their services while the underlying unreliable infrastructure is going through complicated state transitions due to faults.

故障注入和群集分析服务(也称为故障分析服务)使开发人员能够引入故障来测试其服务。The Fault Injection and Cluster Analysis Service (also known as the Fault Analysis Service) gives developers the ability to induce faults to test their services. 这些定向模拟故障(如重新启动分区)可以帮助执行最常见的状态转换。These targeted simulated faults, like restarting a partition, can help exercise the most common state transitions. 但是,定向模拟故障易被定义左右,因此可能会漏掉仅在一系列难以预测、持续时间长且复杂的状态转换中出现的 bug。However targeted simulated faults are biased by definition and thus may miss bugs that show up only in hard-to-predict, long and complicated sequence of state transitions. 若要进行无偏测试,可以使用混沌测试。For an unbiased testing, you can use Chaos.

混沌测试在整个群集模拟定期交叉出现的故障,包括常规故障和非常规故障,时间跨度很长。Chaos simulates periodic, interleaved faults (both graceful and ungraceful) throughout the cluster over extended periods of time. 常规故障由一组 Service Fabric API 调用组成,例如,重启副本故障是常规故障,因为这次关闭后接着副本打开。A graceful fault consists of a set of Service Fabric API calls, for example, restart replica fault is a graceful fault because this is a close followed by an open on a replica. 删除副本、移动主要副本和移动次要副本是混沌测试执行的其他常规故障。Remove replica, move primary replica, and move secondary replica are the other graceful faults exercised by Chaos. 非常规故障是进程退出,例如重启节点和重启代码包。Ungraceful faults are process exits, like restart node and restart code package.

配置混沌测试的故障率和故障类型后,就可以通过 C#、Powershell 或 REST API 启动该测试,开始在群集和服务中生成故障。Once you have configured Chaos with the rate and the kind of faults, you can start Chaos through C#, Powershell, or REST API to start generating faults in the cluster and in your services. 可以将混沌测试配置为运行一段指定的时间(例如运行一个小时),此时间段过后,混沌测试自动停止,也可以随时调用 StopChaos API(C#、Powershell 或 REST)来停止混沌测试。You can configure Chaos to run for a specified time period (for example, for one hour), after which Chaos stops automatically, or you can call StopChaos API (C#, Powershell, or REST) to stop it at any time.

Note

从目前来看,混沌测试只会引入安全的故障,这意味着,在没有外部故障的情况下,绝对不会发生仲裁丢失或数据丢失。In its current form, Chaos induces only safe faults, which implies that in the absence of external faults a quorum loss, or data loss never occurs.

混沌测试在运行时,将生成不同的事件来捕获当前的运行状态。While Chaos is running, it produces different events that capture the state of the run at the moment. 例如,ExecutingFaultsEvent 包含混沌测试决定在该迭代中执行的所有故障。For example, an ExecutingFaultsEvent contains all the faults that Chaos has decided to execute in that iteration. ValidationFailedEvent 包含群集验证期间发现的验证故障(运行状况或稳定性问题)的详细信息。A ValidationFailedEvent contains the details of a validation failure (health or stability issues) that was found during the validation of the cluster. 可以调用 GetChaosReport API(C#、Powershell 或 REST)来获取混沌测试运行报告。You can invoke the GetChaosReport API (C#, Powershell, or REST) to get the report of Chaos runs. 这些事件保存在一个 Reliable Dictionary 中,该字典的截断策略由两项配置决定:MaxStoredChaosEventCount(默认值为 25000)和 StoredActionCleanupIntervalInSeconds(默认值为 3600)。These events get persisted in a reliable dictionary, which has a truncation policy dictated by two configurations: MaxStoredChaosEventCount (default value is 25000) and StoredActionCleanupIntervalInSeconds (default value is 3600). 混沌测试每隔 StoredActionCleanupIntervalInSeconds 进行一次检查,从 Reliable Dictionary 中清除除最新 MaxStoredChaosEventCount 事件以外的所有事件。Every StoredActionCleanupIntervalInSeconds Chaos checks and all but the most recent MaxStoredChaosEventCount events, are purged from the reliable dictionary.

在混沌测试中引入的故障Faults induced in Chaos

混沌测试在整个 Service Fabric 群集中生成故障,将几个月或几年内出现的故障压缩成几小时。Chaos generates faults across the entire Service Fabric cluster and compresses faults that are seen in months or years into a few hours. 通过将各种具有高故障率的交叉故障组合在一起,能够找出很有可能被忽视的极端状况。The combination of interleaved faults with the high fault rate finds corner cases that may otherwise be missed. 运行这种混沌测试会使服务的代码质量得到显著提高。This exercise of Chaos leads to a significant improvement in the code quality of the service.

混沌测试引入以下类别的故障:Chaos induces faults from the following categories:

  • 重新启动节点Restart a node
  • 重新启动已部署的代码包Restart a deployed code package
  • 删除副本Remove a replica
  • 重新启动副本Restart a replica
  • 移动主副本(可配置)Move a primary replica (configurable)
  • 移动辅助副本(可配置)Move a secondary replica (configurable)

混沌测试在多个迭代中运行。Chaos runs in multiple iterations. 每个迭代包含指定时间段的故障和群集验证。Each iteration consists of faults and cluster validation for the specified period. 可以配置使群集达到稳定状态以及成功完成验证所用的时间。You can configure the time spent for the cluster to stabilize and for validation to succeed. 如果在群集验证中发现故障,混沌测试将生成并保留一个包含 UTC 时间戳与故障详细信息的 ValidationFailedEvent。If a failure is found in cluster validation, Chaos generates and persists a ValidationFailedEvent with the UTC timestamp and the failure details. 例如,假设某个混沌测试实例设置为运行 1 小时,并且最多有 3 个并发故障。For example, consider an instance of Chaos that is set to run for an hour with a maximum of three concurrent faults. 该混沌测试将引入三个故障,然后验证群集运行状况。Chaos induces three faults, and then validates the cluster health. 它会循环访问上一步骤,直到通过 StopChaosAsync API 或者在一小时后将它显式停止。It iterates through the previous step until it is explicitly stopped through the StopChaosAsync API or one-hour passes. 如果在任何一次迭代中群集变得不正常(即在传入的 MaxClusterStabilizationTimeout 内不稳定或未变得正常),混沌测试会生成 ValidationFailedEvent。If the cluster becomes unhealthy in any iteration (that is, it does not stabilize or it does not become healthy within the passed-in MaxClusterStabilizationTimeout), Chaos generates a ValidationFailedEvent. 此事件指明系统出现问题,可能需要进一步调查。This event indicates that something has gone wrong and might need further investigation.

若要获取混沌测试引入的故障,可以使用 GetChaosReport API(powershell、C# 或 REST)。To get which faults Chaos induced, you can use GetChaosReport API (powershell, C#, or REST). 该 API 根据传入的继续标记或传入的时间范围获取混沌测试报告的下一段。The API gets the next segment of the Chaos report based on the passed-in continuation token or the passed-in time-range. 可以指定 ContinuationToken 以获取混沌测试报告的下一段或者通过 StartTimeUtc 和 EndTimeUtc 指定时间范围,但不能在同一调用中同时指定 ContinuationToken 和时间范围。You can either specify the ContinuationToken to get the next segment of the Chaos report or you can specify the time-range through StartTimeUtc and EndTimeUtc, but you cannot specify both the ContinuationToken and the time-range in the same call. 如果发生了 100 个以上的混沌测试事件,混沌测试报告会分段返回,每一段包含的混沌测试事件均不能超过 100 个。When there are more than 100 Chaos events, the Chaos report is returned in segments where a segment contains no more than 100 Chaos events.

重要的配置选项Important configuration options

  • TimeToRun:混沌测试在成功完成之前的总运行时间。TimeToRun: Total time that Chaos runs before it finishes with success. 在混沌测试运行了 TimeToRun 这段时间之前,可以通过 StopChaos API 将它停止。You can stop Chaos before it has run for the TimeToRun period through the StopChaos API.

  • MaxClusterStabilizationTimeout:在生成 ValidationFailedEvent 之前,等待群集变得正常的最长时间。MaxClusterStabilizationTimeout: The maximum amount of time to wait for the cluster to become healthy before producing a ValidationFailedEvent. 等待这段时间的目的是在恢复时减少群集上的负载。This wait is to reduce the load on the cluster while it is recovering. 执行的检查包括:The checks performed are:

    • 群集运行状况是否正常If the cluster health is OK
    • 服务运行状况是否正常If the service health is OK
    • 服务分区是否达到目标副本集大小If the target replica set size is achieved for the service partition
    • 不存在 InBuild 副本That no InBuild replicas exist
  • MaxConcurrentFaults:在每个迭代中引入的最大并发故障数。MaxConcurrentFaults: The maximum number of concurrent faults that are induced in each iteration. 该数字越大,混沌测试就越激进,群集进行的故障转移和状态转换组合也更复杂。The higher the number, the more aggressive Chaos is and the failovers and the state transition combinations that the cluster goes through are also more complex.

Note

无论 MaxConcurrentFaults 的值有多大,混沌测试都能保证在没有外部故障的情况下,不会发生仲裁丢失或数据丢失。Regardless how high a value MaxConcurrentFaults has, Chaos guarantees - in the absence of external faults - there is no quorum loss or data loss.

  • EnableMoveReplicaFaults:启用或禁用导致主副本或辅助副本移动的故障。EnableMoveReplicaFaults: Enables or disables the faults that cause the primary or secondary replicas to move. 默认情况下,这些故障处于启用状态。These faults are enabled by default.
  • WaitTimeBetweenIterations:每两次迭代之间的等待时间。WaitTimeBetweenIterations: The amount of time to wait between iterations. 即在执行一轮故障并完成群集运行状况的相应验证之后,混沌测试将暂停的时长。That is, the amount of time Chaos will pause after having executed a round of faults and having finished the corresponding validation of the health of the cluster. 该值越高,平均故障注入速率就越低。The higher the value, the lower is the average fault injection rate.
  • WaitTimeBetweenFaults:单个迭代中每两次连续故障之间的等待时间。WaitTimeBetweenFaults: The amount of time to wait between two consecutive faults in a single iteration. 该值越高,故障的并发性(即故障之间的重叠)越低。The higher the value, the lower the concurrency of (or the overlap between) faults.
  • ClusterHealthPolicy:群集运行状况策略用于验证两次混沌测试迭代之间的群集运行状况。ClusterHealthPolicy: Cluster health policy is used to validate the health of the cluster in between Chaos iterations. 如果群集运行状况出错或者如果在故障执行期间发生了意外的异常,混沌测试会等 30 分钟后再进行下一轮运行状况检查,从而为群集留出一些恢复时间。If the cluster health is in error or if an unexpected exception happens during fault execution, Chaos will wait for 30 minutes before the next health-check - to provide the cluster with some time to recuperate.
  • Context:一组 (string, string) 类型的键值对。Context: A collection of (string, string) type key-value pairs. 此映射可用于记录混沌测试的相关运行信息。The map can be used to record information about the Chaos run. 这种键值对不能超过 100 个,并且每个字符串(键或值)的长度不能超过 4095 个字符。There cannot be more than 100 such pairs and each string (key or value) can be at most 4095 characters long. 此映射由混沌测试运行的启动程序设置为根据需要存储特定运行的相关上下文。This map is set by the starter of the Chaos run to optionally store the context about the specific run.
  • ChaosTargetFilter:此筛选器可指定混沌测试故障仅面向特定节点类型或特定应用程序实例。ChaosTargetFilter: This filter can be used to target Chaos faults only to certain node types or only to certain application instances. 如未使用 ChaosTargetFilter,混沌测试会使所有群集实体故障。If ChaosTargetFilter is not used, Chaos faults all cluster entities. 如果使用 ChaosTargetFilter,混沌测试仅使满足 ChaosTargetFilter 规定的实体故障。If ChaosTargetFilter is used, Chaos faults only the entities that meet the ChaosTargetFilter specification. NodeTypeInclusionList 和 ApplicationInclusionList 仅允许联合语义。NodeTypeInclusionList and ApplicationInclusionList allow union semantics only. 换句话说,不可指定 NodeTypeInclusionList 和 ApplicationInclusionList 的交集。In other words, it is not possible to specify an intersection of NodeTypeInclusionList and ApplicationInclusionList. 例如,不可指定“仅当此应用程序在该节点类型上时使其故障”。For example, it is not possible to specify "fault this application only when it is on that node type." 一旦实体包含在 NodeTypeInclusionList 或 ApplicationInclusionList 中,便不能使用 ChaosTargetFilter 排除该实体。Once an entity is included in either NodeTypeInclusionList or ApplicationInclusionList, that entity cannot be excluded using ChaosTargetFilter. 即使 applicationX 未出现在 ApplicationInclusionList 中,在一些混沌测试迭代中,也可使 applicationX 故障,因为它恰好在 NodeTypeInclusionList 中的 nodeTypeY 的节点上。Even if applicationX does not appear in ApplicationInclusionList, in some Chaos iteration applicationX can be faulted because it happens to be on a node of nodeTypeY that is included in NodeTypeInclusionList. 如果 NodeTypeInclusionList 和 ApplicationInclusionList 都为 null 或者为空,则会引发 ArgumentException。If both NodeTypeInclusionList and ApplicationInclusionList are null or empty, an ArgumentException is thrown.
    • NodeTypeInclusionList:包括在混沌测试故障中的节点类型列表。NodeTypeInclusionList: A list of node types to include in Chaos faults. 所有类型故障(重启节点、重启代码包、删除副本、重启副本、移动主副本和移动辅助副本)均为这些节点类型的节点启用。All types of faults (restart node, restart codepackage, remove replica, restart replica, move primary, and move secondary) are enabled for the nodes of these node types. 如果节点类型(比如 NodeTypeX)未出现在 NodeTypeInclusionList 中,节点级别故障(比如 NodeRestart)将不会为 NodeTypeX 的节点启用。但是,如果 ApplicationInclusionList 中的应用程序碰巧位于 NodeTypeX 的节点上,那么代码包和副本故障仍可为 NodeTypeX 启用。If a nodetype (say NodeTypeX) does not appear in the NodeTypeInclusionList, then node level faults (like NodeRestart) will never be enabled for the nodes of NodeTypeX, but code package and replica faults can still be enabled for NodeTypeX if an application in the ApplicationInclusionList happens to reside on a node of NodeTypeX. 此列表最多可以包含 100 个节点类型名称,若要增加,MaxNumberOfNodeTypesInChaosTargetFilter 配置需要升级。At most 100 node type names can be included in this list, to increase this number, a config upgrade is required for MaxNumberOfNodeTypesInChaosTargetFilter configuration.
    • ApplicationInclusionList:包含在混沌测试故障中的应用程序 URI 列表。ApplicationInclusionList: A list of application URIs to include in Chaos faults. 所有属于这些应用程序服务的副本服从混沌测试的副本故障(重启副本、删除副本、移动主副本和移动辅助副本)。All replicas belonging to services of these applications are amenable to replica faults (restart replica, remove replica, move primary, and move secondary) by Chaos. 仅在代码包仅托管这些应用程序的副本时,混沌测试可重启代码包。Chaos may restart a code package only if the code package hosts replicas of these applications only. 如果应用程序未出现在此列表中,那么还是可以在某些混沌测试迭代中使它故障,条件是应用程序最终位于 NodeTypeInclusionList 中的节点类型的节点上。If an application does not appear in this list, it can still be faulted in some Chaos iteration if the application ends up on a node of a node type that is included in NodeTypeInclusionList. 但是,如果 applicationX 通过放置约束固定为 nodeTypeY,并且 applicationX 不在 ApplicationInclusionList 中同时 nodeTypeY 不在 NodeTypeInclusionList 中,那么不会使 applicationX 故障。However if applicationX is tied to nodeTypeY through placement constraints and applicationX is absent from ApplicationInclusionList and nodeTypeY is absent from NodeTypeInclusionList, then applicationX will never be faulted. 此列表最多可以包含 1000 个应用程序名称,若要增加,MaxNumberOfApplicationsInChaosTargetFilter 配置需要升级。At most 1000 application names can be included in this list, to increase this number, a config upgrade is required for MaxNumberOfApplicationsInChaosTargetFilter configuration.

如何运行混沌测试How to run Chaos

using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using System.Fabric;

using System.Diagnostics;
using System.Fabric.Chaos.DataStructures;

class Program
{
    private class ChaosEventComparer : IEqualityComparer<ChaosEvent>
    {
        public bool Equals(ChaosEvent x, ChaosEvent y)
        {
            return x.TimeStampUtc.Equals(y.TimeStampUtc);
        }
        public int GetHashCode(ChaosEvent obj)
        {
            return obj.TimeStampUtc.GetHashCode();
        }
    }

    static void Main(string[] args)
    {
        var clusterConnectionString = "localhost:19000";
        using (var client = new FabricClient(clusterConnectionString))
        {
            var startTimeUtc = DateTime.UtcNow;

            // The maximum amount of time to wait for all cluster entities to become stable and healthy. 
            // Chaos executes in iterations and at the start of each iteration it validates the health of cluster
            // entities. 
            // During validation if a cluster entity is not stable and healthy within
            // MaxClusterStabilizationTimeoutInSeconds, Chaos generates a validation failed event.
            var maxClusterStabilizationTimeout = TimeSpan.FromSeconds(30.0);

            var timeToRun = TimeSpan.FromMinutes(60.0);

            // MaxConcurrentFaults is the maximum number of concurrent faults induced per iteration. 
            // Chaos executes in iterations and two consecutive iterations are separated by a validation phase.
            // The higher the concurrency, the more aggressive the injection of faults -- inducing more complex
            // series of states to uncover bugs.
            // The recommendation is to start with a value of 2 or 3 and to exercise caution while moving up.
            var maxConcurrentFaults = 3;

            // Describes a map, which is a collection of (string, string) type key-value pairs. The map can be
            // used to record information about the Chaos run. There cannot be more than 100 such pairs and
            // each string (key or value) can be at most 4095 characters long.
            // This map is set by the starter of the Chaos run to optionally store the context about the specific run.
            var startContext = new Dictionary<string, string>{{"ReasonForStart", "Testing"}};

            // Time-separation (in seconds) between two consecutive iterations of Chaos. The larger the value, the
            // lower the fault injection rate.
            var waitTimeBetweenIterations = TimeSpan.FromSeconds(10);

            // Wait time (in seconds) between consecutive faults within a single iteration.
            // The larger the value, the lower the overlapping between faults and the simpler the sequence of
            // state transitions that the cluster goes through. 
            // The recommendation is to start with a value between 1 and 5 and exercise caution while moving up.
            var waitTimeBetweenFaults = TimeSpan.Zero;

            // Passed-in cluster health policy is used to validate health of the cluster in between Chaos iterations. 
            var clusterHealthPolicy = new ClusterHealthPolicy
            {
                ConsiderWarningAsError = false,
                MaxPercentUnhealthyApplications = 100,
                MaxPercentUnhealthyNodes = 100
            };

            // All types of faults, restart node, restart code package, restart replica, move primary replica,
            // and move secondary replica will happen for nodes of type 'FrontEndType'
            var nodetypeInclusionList = new List<string> { "FrontEndType"};

            // In addition to the faults included by nodetypeInclusionList,
            // restart code package, restart replica, move primary replica, move secondary replica faults will
            // happen for 'fabric:/TestApp2' even if a replica or code package from 'fabric:/TestApp2' is residing
            // on a node which is not of type included in nodeypeInclusionList.
            var applicationInclusionList = new List<string> { "fabric:/TestApp2" };

            // List of cluster entities to target for Chaos faults.
            var chaosTargetFilter = new ChaosTargetFilter
            {
                NodeTypeInclusionList = nodetypeInclusionList,
                ApplicationInclusionList = applicationInclusionList
            };

            var parameters = new ChaosParameters(
                maxClusterStabilizationTimeout,
                maxConcurrentFaults,
                true, /* EnableMoveReplicaFault */
                timeToRun,
                startContext,
                waitTimeBetweenIterations,
                waitTimeBetweenFaults,
                clusterHealthPolicy) {ChaosTargetFilter = chaosTargetFilter};

            try
            {
                client.TestManager.StartChaosAsync(parameters).GetAwaiter().GetResult();
            }
            catch (FabricChaosAlreadyRunningException)
            {
                Console.WriteLine("An instance of Chaos is already running in the cluster.");
            }

            var filter = new ChaosReportFilter(startTimeUtc, DateTime.MaxValue);

            var eventSet = new HashSet<ChaosEvent>(new ChaosEventComparer());

            string continuationToken = null;

            while (true)
            {
                ChaosReport report;
                try
                {
                    report = string.IsNullOrEmpty(continuationToken)
                        ? client.TestManager.GetChaosReportAsync(filter).GetAwaiter().GetResult()
                        : client.TestManager.GetChaosReportAsync(continuationToken).GetAwaiter().GetResult();
                }
                catch (Exception e)
                {
                    if (e is FabricTransientException)
                    {
                        Console.WriteLine("A transient exception happened: '{0}'", e);
                    }
                    else if(e is TimeoutException)
                    {
                        Console.WriteLine("A timeout exception happened: '{0}'", e);
                    }
                    else
                    {
                        throw;
                    }

                    Task.Delay(TimeSpan.FromSeconds(1.0)).GetAwaiter().GetResult();
                    continue;
                }

                continuationToken = report.ContinuationToken;

                foreach (var chaosEvent in report.History)
                {
                    if (eventSet.Add(chaosEvent))
                    {
                        Console.WriteLine(chaosEvent);
                    }
                }

                // When Chaos stops, a StoppedEvent is created.
                // If a StoppedEvent is found, exit the loop.
                var lastEvent = report.History.LastOrDefault();

                if (lastEvent is StoppedEvent)
                {
                    break;
                }

                Task.Delay(TimeSpan.FromSeconds(1.0)).GetAwaiter().GetResult();
            }
        }
    }
}
$clusterConnectionString = "localhost:19000"
$timeToRunMinute = 60

# The maximum amount of time to wait for all cluster entities to become stable and healthy.
# Chaos executes in iterations and at the start of each iteration it validates the health of cluster entities.
# During validation if a cluster entity is not stable and healthy within MaxClusterStabilizationTimeoutInSeconds,
# Chaos generates a validation failed event.
$maxClusterStabilizationTimeSecs = 30

# MaxConcurrentFaults is the maximum number of concurrent faults induced per iteration.
# Chaos executes in iterations and two consecutive iterations are separated by a validation phase.
# The higher the concurrency, the more aggressive the injection of faults -- inducing more complex series of
# states to uncover bugs.
# The recommendation is to start with a value of 2 or 3 and to exercise caution while moving up.
$maxConcurrentFaults = 3

# Time-separation (in seconds) between two consecutive iterations of Chaos. The larger the value, the lower the
# fault injection rate.
$waitTimeBetweenIterationsSec = 10

# Wait time (in seconds) between consecutive faults within a single iteration.
# The larger the value, the lower the overlapping between faults and the simpler the sequence of state
# transitions that the cluster goes through.
# The recommendation is to start with a value between 1 and 5 and exercise caution while moving up.
$waitTimeBetweenFaultsSec = 0

# Passed-in cluster health policy is used to validate health of the cluster in between Chaos iterations. 
$clusterHealthPolicy = new-object -TypeName System.Fabric.Health.ClusterHealthPolicy
$clusterHealthPolicy.MaxPercentUnhealthyNodes = 100
$clusterHealthPolicy.MaxPercentUnhealthyApplications = 100
$clusterHealthPolicy.ConsiderWarningAsError = $False

# Describes a map, which is a collection of (string, string) type key-value pairs. The map can be used to record
# information about the Chaos run.
# There cannot be more than 100 such pairs and each string (key or value) can be at most 4095 characters long.
# This map is set by the starter of the Chaos run to optionally store the context about the specific run.
$context = @{"ReasonForStart" = "Testing"}

#List of cluster entities to target for Chaos faults.
$chaosTargetFilter = new-object -TypeName System.Fabric.Chaos.DataStructures.ChaosTargetFilter
$chaosTargetFilter.NodeTypeInclusionList = new-object -TypeName "System.Collections.Generic.List[String]"

# All types of faults, restart node, restart code package, restart replica, move primary replica, and move
# secondary replica will happen for nodes of type 'FrontEndType'
$chaosTargetFilter.NodeTypeInclusionList.AddRange( [string[]]@("FrontEndType") )
$chaosTargetFilter.ApplicationInclusionList = new-object -TypeName "System.Collections.Generic.List[String]"

# In addition to the faults included by nodetypeInclusionList, 
# restart code package, restart replica, move primary replica, move secondary replica faults will happen for
# 'fabric:/TestApp2' even if a replica or code package from 'fabric:/TestApp2' is residing on a node which is
# not of type included in nodeypeInclusionList.
$chaosTargetFilter.ApplicationInclusionList.Add("fabric:/TestApp2")

Connect-ServiceFabricCluster $clusterConnectionString

$events = @{}
$now = [System.DateTime]::UtcNow

Start-ServiceFabricChaos -TimeToRunMinute $timeToRunMinute -MaxConcurrentFaults $maxConcurrentFaults -MaxClusterStabilizationTimeoutSec $maxClusterStabilizationTimeSecs -EnableMoveReplicaFaults -WaitTimeBetweenIterationsSec $waitTimeBetweenIterationsSec -WaitTimeBetweenFaultsSec $waitTimeBetweenFaultsSec -ClusterHealthPolicy $clusterHealthPolicy -ChaosTargetFilter $chaosTargetFilter

while($true)
{
    $stopped = $false
    $report = Get-ServiceFabricChaosReport -StartTimeUtc $now -EndTimeUtc ([System.DateTime]::MaxValue)

    foreach ($e in $report.History) {

        if(-Not ($events.Contains($e.TimeStampUtc.Ticks)))
        {
            $events.Add($e.TimeStampUtc.Ticks, $e)
            if($e -is [System.Fabric.Chaos.DataStructures.ValidationFailedEvent])
            {
                Write-Host -BackgroundColor White -ForegroundColor Red $e
            }
            else
            {
                Write-Host $e
                # When Chaos stops, a StoppedEvent is created.
                # If a StoppedEvent is found, exit the loop.
                if($e -is [System.Fabric.Chaos.DataStructures.StoppedEvent])
                {
                    return
                }
            }
        }
    }

    Start-Sleep -Seconds 1
}