可测试性操作Testability actions

为了模拟不可靠的基础结构,Azure Service Fabric 向开发者提供了众多选项,用于模拟各种现实世界故障和状态转换。In order to simulate an unreliable infrastructure, Azure Service Fabric provides you, the developer, with ways to simulate various real-world failures and state transitions. 这些方式被称为可测试操作。These are exposed as testability actions. 这些操作属于低级别 API,导致具体的故障注入、状态转换或验证。The actions are the low-level APIs that cause a specific fault injection, state transition, or validation. 结合使用这些操作,可以为服务编写全面的测试方案。By combining these actions, you can write comprehensive test scenarios for your services.

Service Fabric 提供某些由这些操作组成的常见测试方案。Service Fabric provides some common test scenarios composed of these actions. 强烈建议使用这些内置方案,这些方案经过精心挑选,用于测试常见状态转换和故障案例。We highly recommend that you utilize these built-in scenarios, which are carefully chosen to test common state transitions and failure cases. 但是,当希望添加尚未包含在内置方案中的方案时,或者需要为应用程序量身订做一个方案时,可以使用这些操作来创建自定义测试方案。However, actions can be used to create custom test scenarios when you want to add coverage for scenarios that are not covered by the built-in scenarios yet or that are custom tailored for your application.

System.Fabric.dll 程序集包含了这些操作的 C# 实现。C# implementations of the actions are found in the System.Fabric.dll assembly. Microsoft.ServiceFabric.Powershell.dll 程序集包含了 System Fabric PowerShell 模块。The System Fabric PowerShell module is found in the Microsoft.ServiceFabric.Powershell.dll assembly. 作为运行时安装的一部分,安装了 ServiceFabric PowerShell 模块以便易于使用。As part of runtime installation, the ServiceFabric PowerShell module is installed to allow for ease of use.

常规故障与非常规故障操作Graceful vs. ungraceful fault actions

可测试性操作分为两个主要的类型:Testability actions are classified into two major buckets:

  • 非常规故障:这些故障诸如计算机重新启动和进程崩溃等故障。Ungraceful faults: These faults simulate failures like machine restarts and process crashes. 在此类故障情形下,进程的执行上下文将突然停止。In such cases of failures, the execution context of process stops abruptly. 这意味着在该应用程序重新启动之前,无法运行任何状态清理。This means no cleanup of the state can run before the application starts up again.
  • 常规错误:这些故障模拟诸如副本移动和负载均衡触发的删除等常规操作。Graceful faults: These faults simulate graceful actions like replica moves and drops triggered by load balancing. 在此类情形下,服务收到关闭通知并且可以在退出前清理状态。In such cases, the service gets a notification of the close and can clean up the state before exiting.

为了实现更好的质量验证,在引入各种常规故障和非常规故障的情况下运行服务和业务工作负荷。For better quality validation, run the service and business workload while inducing various graceful and ungraceful faults. 非常规故障执行服务进程在某些工作流程中实然退出的方案。Ungraceful faults exercise scenarios where the service process abruptly exits in the middle of some workflow. 这会在 Service Fabric 恢复服务副本时立即测试恢复路径。This tests the recovery path once the service replica is restored by Service Fabric. 这会有助于一致性地测试数据以及是否在故障之后正确维护服务状态。This will help test data consistency and whether the service state is maintained correctly after failures. 另一组故障,即常规故障,测试服务是否对 Service Fabric 正在移动的副本做出正确的反应。The other set of failures (the graceful failures) test that the service correctly reacts to replicas being moved around by Service Fabric. 这会测试 RunAsync 方法中的取消处理。This tests handling of cancellation in the RunAsync method. 服务需要检查是否正在设置取消标记、正确保存其状态及退出 RunAsync 方法。The service needs to check for the cancellation token being set, correctly save its state, and exit the RunAsync method.

可测试性操作列表Testability actions list

操作Action 说明Description 托管 APIManaged API PowerShell CmdletPowerShell cmdlet 常规/非常规故障Graceful/ungraceful faults
CleanTestStateCleanTestState 在测试驱动器非正常关闭时从群集删除所有测试状态。Removes all the test state from the cluster in case of a bad shutdown of the test driver. CleanTestStateAsyncCleanTestStateAsync Remove-ServiceFabricTestStateRemove-ServiceFabricTestState 不适用Not applicable
InvokeDataLossInvokeDataLoss 将数据丢失引入到服务分区。Induces data loss into a service partition. InvokeDataLossAsyncInvokeDataLossAsync Invoke-ServiceFabricPartitionDataLossInvoke-ServiceFabricPartitionDataLoss 常规Graceful
InvokeQuorumLossInvokeQuorumLoss 在一个给定有状态服务分区放入仲裁丢失。Puts a given stateful service partition into quorum loss. InvokeQuorumLossAsyncInvokeQuorumLossAsync Invoke-ServiceFabricQuorumLossInvoke-ServiceFabricQuorumLoss 常规Graceful
MovePrimaryMovePrimary 将有状态服务的指定主副本移动到指定群集节点。Moves the specified primary replica of a stateful service to the specified cluster node. MovePrimaryAsyncMovePrimaryAsync Move-ServiceFabricPrimaryReplicaMove-ServiceFabricPrimaryReplica 常规Graceful
MoveSecondaryMoveSecondary 将一个有状态服务的当前辅助副本移动到另一个群集节点。Moves the current secondary replica of a stateful service to a different cluster node. MoveSecondaryAsyncMoveSecondaryAsync Move-ServiceFabricSecondaryReplicaMove-ServiceFabricSecondaryReplica 常规Graceful
RemoveReplicaRemoveReplica 通过从群集删除一个副本来模拟副本故障。Simulates a replica failure by removing a replica from a cluster. 这会关闭该副本,将其转换为角色“None”,且从群集删除其所有状态。This will close the replica and will transition it to role 'None', removing all of its state from the cluster. RemoveReplicaAsyncRemoveReplicaAsync Remove-ServiceFabricReplicaRemove-ServiceFabricReplica 常规Graceful
RestartDeployedCodePackageRestartDeployedCodePackage 通过重新启动部署在群集中某个节点上的一个代码包来模拟代码包进程故障。Simulates a code package process failure by restarting a code package deployed on a node in a cluster. 这会中止代码包进程,并会重新启动驻留在该进程中的所有用户服务副本。This aborts the code package process, which will restart all the user service replicas hosted in that process. RestartDeployedCodePackageAsyncRestartDeployedCodePackageAsync Restart-ServiceFabricDeployedCodePackageRestart-ServiceFabricDeployedCodePackage 非常规Ungraceful
RestartNodeRestartNode 通过重新启动一个节点来模拟 Service Fabric 群集节点故障。Simulates a Service Fabric cluster node failure by restarting a node. RestartNodeAsyncRestartNodeAsync Restart-ServiceFabricNodeRestart-ServiceFabricNode 非常规Ungraceful
RestartPartitionRestartPartition 通过重新启动一个分区的某些或全部副本来模拟数据中心中断或群集中断方案。Simulates a datacenter blackout or cluster blackout scenario by restarting some or all replicas of a partition. RestartPartitionAsyncRestartPartitionAsync Restart-ServiceFabricPartitionRestart-ServiceFabricPartition 常规Graceful
RestartReplicaRestartReplica 副本故障的模拟方法为,在群集中重启持久化副本,关闭此副本,再重新打开它。Simulates a replica failure by restarting a persisted replica in a cluster, closing the replica and then reopening it. RestartReplicaAsyncRestartReplicaAsync Restart-ServiceFabricReplicaRestart-ServiceFabricReplica 常规Graceful
StartNodeStartNode 启动群集中一个已经停止的节点。Starts a node in a cluster that is already stopped. StartNodeAsyncStartNodeAsync Start-ServiceFabricNodeStart-ServiceFabricNode 不适用Not applicable
StopNodeStopNode 通过停止群集中的一个节点来模拟节点故障。Simulates a node failure by stopping a node in a cluster. 该节点将一直处于关闭状态,直到调用了 StartNode 为止。The node will stay down until StartNode is called. StopNodeAsyncStopNodeAsync Stop-ServiceFabricNodeStop-ServiceFabricNode 非常规Ungraceful
ValidateApplicationValidateApplication 验证一个应用程序内所有 Service Fabric 服务的可用性和运行状况,通常在将某些故障引入系统之后。Validates the availability and health of all Service Fabric services within an application, usually after inducing some fault into the system. ValidateApplicationAsyncValidateApplicationAsync Test-ServiceFabricApplicationTest-ServiceFabricApplication 不适用Not applicable
ValidateServiceValidateService 验证一个 Service Fabric 服务的可用性和运行状况,通常在将某些故障引入系统之后。Validates the availability and health of a Service Fabric service, usually after inducing some fault into the system. ValidateServiceAsyncValidateServiceAsync Test-ServiceFabricServiceTest-ServiceFabricService 不适用Not applicable

使用 PowerShell 运行可测试性操作Running a testability action using PowerShell

本教程说明如何使用 PowerShell 运行可测试性操作。This tutorial shows you how to run a testability action by using PowerShell. 将了解如何针对本地(也称为“单机”)群集或 Azure 群集运行可测试性操作。You will learn how to run a testability action against a local (one-box) cluster or an Azure cluster. 安装 Azure Service Fabric MSI 时,会自动安装 Microsoft.Fabric.Powershell.dll(Service Fabric PowerShell 模块)。Microsoft.Fabric.Powershell.dll--the Service Fabric PowerShell module--is installed automatically when you install the Azure Service Fabric MSI. 该模块在打开一个 PowerShell 提示符时自动加载。The module is loaded automatically when you open a PowerShell prompt.

教程章节:Tutorial segments:

针对单机群集运行一个操作Run an action against a one-box cluster

要针对本地群集运行一个可测试性操作,首先需要连接到群集并且应在管理员模式下打开 PowerShell 提示符。To run a testability action against a local cluster, first connect to the cluster and open the PowerShell prompt in administrator mode. 让我们看一下 Restart-ServiceFabricNode 操作。Let us look at the Restart-ServiceFabricNode action.

Restart-ServiceFabricNode -NodeName Node1 -CompletionMode DoNotVerify

在这里,操作 Restart-ServiceFabricNode 在一个名为“Node1”的节点上运行。Here the action Restart-ServiceFabricNode is being run on a node named "Node1". 完成模式指定不应该验证实际上是否成功执行了重启节点操作。The completion mode specifies that it should not verify whether the restart-node action actually succeeded. 将完成模式指定为“Verify”会让其验证实际是否成功执行了重新启动操作。Specifying the completion mode as "Verify" will cause it to verify whether the restart action actually succeeded. 除了按其名称直接指定节点以外,还可以通过分区键和副本类型指定节点,如下所示:Instead of directly specifying the node by its name, you can specify it via a partition key and the kind of replica, as follows:

Restart-ServiceFabricNode -ReplicaKindPrimary  -PartitionKindNamed -PartitionKey Partition3 -CompletionMode Verify
$connection = "localhost:19000"
$nodeName = "Node1"

Connect-ServiceFabricCluster $connection
Restart-ServiceFabricNode -NodeName $nodeName -CompletionMode DoNotVerify

应使用 Restart-ServiceFabricNode 来重新启动群集中的一个 Service Fabric 节点。Restart-ServiceFabricNode should be used to restart a Service Fabric node in a cluster. 这会停止会重新启动驻留在该节点上的所有系统服务和用户服务副本的 Fabric.exe 进程。This will stop the Fabric.exe process, which will restart all of the system service and user service replicas hosted on that node. 使用此 API 来测试服务有助于沿故障转移恢复路径发现 Bug。Using this API to test your service helps uncover bugs along the failover recovery paths. 它帮助模拟群集中的节点故障。It helps simulate node failures in the cluster.

以下屏幕快照显示操作中的 Restart-ServiceFabricNode 可测试性命令。The following screenshot shows the Restart-ServiceFabricNode testability command in action.

第一个 Get-ServiceFabricNode(来自 Service Fabric PowerShell 模块的一个 cmdlet)的输出显示本地群集有五个节点:Node.1 至 Node.5。The output of the first Get-ServiceFabricNode (a cmdlet from the Service Fabric PowerShell module) shows that the local cluster has five nodes: Node.1 to Node.5. 在名为 Node.4 的节点上执行可测试性操作 (cmdlet) Restart-ServiceFabricNode 之后,我们看到节点的正常运行时间已被重置。After the testability action (cmdlet) Restart-ServiceFabricNode is executed on the node, named Node.4, we see that the node's uptime has been reset.

针对 Azure 群集运行操作Run an action against an Azure cluster

针对 Azure 群集运行一个可测试性操作(使用 PowerShell)与针对本地群集运行一个操作类似。Running a testability action (by using PowerShell) against an Azure cluster is similar to running the action against a local cluster. 唯一的区别在于:在能够运行操作之前,不是连接到本地群集,而是需要首先连接到 Azure 群集。The only difference is that before you can run the action, instead of connecting to the local cluster, you need to connect to the Azure cluster first.

使用 C# 运行可测试性操作Running a testability action using C#

要使用 C# 运行可测试性操作,首先需要使用 FabricClient 连接到群集。To run a testability action by using C#, first you need to connect to the cluster by using FabricClient. 然后,获取运行操作所需的参数。Then obtain the parameters needed to run the action. 可用不同的参数来运行相同的操作。Different parameters can be used to run the same action. 请看一看 RestartServiceFabricNode 操作,运行该操作的方式之一是在群集中使用节点信息(节点名称和节点实例 ID)。Looking at the RestartServiceFabricNode action, one way to run it is by using the node information (node name and node instance ID) in the cluster.

RestartNodeAsync(nodeName, nodeInstanceId, completeMode, operationTimeout, CancellationToken.None)

参数说明:Parameter explanation:

  • CompleteMode 指定该模式不应该验证实际上是否成功执行了重新启动操作。CompleteMode specifies that the mode should not verify whether the restart action actually succeeded. 将完成模式指定为“Verify”会让其验证实际是否成功执行了重新启动操作。Specifying the completion mode as "Verify" will cause it to verify whether the restart action actually succeeded.
  • OperationTimeout 设置在引发 TimeoutException 异常之前等待操作完成的时间量。OperationTimeout sets the amount of time for the operation to finish before a TimeoutException exception is thrown.
  • CancellationToken 允许取消挂起调用。CancellationToken enables a pending call to be canceled.

除了按其名称直接指定节点以外,还可以通过分区键和副本类型指定节点。Instead of directly specifying the node by its name, you can specify it via a partition key and the kind of replica.

有关更多信息,请参阅 PartitionSelector 和 ReplicaSelector。For further information, see PartitionSelector and ReplicaSelector.

// Add a reference to System.Fabric.Testability.dll and System.Fabric.dll
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Fabric.Testability;
using System.Fabric;
using System.Threading;
using System.Numerics;

class Test
{
    public static int Main(string[] args)
    {
        string clusterConnection = "localhost:19000";
        Uri serviceName = new Uri("fabric:/samples/PersistentToDoListApp/PersistentToDoListService");
        string nodeName = "N0040";
        BigInteger nodeInstanceId = 130743013389060139;

        Console.WriteLine("Starting RestartNode test");
        try
        {
            //Restart the node by using ReplicaSelector
            RestartNodeAsync(clusterConnection, serviceName).Wait();

            //Another way to restart node is by using nodeName and nodeInstanceId
            RestartNodeAsync(clusterConnection, nodeName, nodeInstanceId).Wait();
        }
        catch (AggregateException exAgg)
        {
            Console.WriteLine("RestartNode did not complete: ");
            foreach (Exception ex in exAgg.InnerExceptions)
            {
                if (ex is FabricException)
                {
                    Console.WriteLine("HResult: {0} Message: {1}", ex.HResult, ex.Message);
                }
            }
            return -1;
        }

        Console.WriteLine("RestartNode completed.");
        return 0;
    }

    static async Task RestartNodeAsync(string clusterConnection, Uri serviceName)
    {
        PartitionSelector randomPartitionSelector = PartitionSelector.RandomOf(serviceName);
        ReplicaSelector primaryofReplicaSelector = ReplicaSelector.PrimaryOf(randomPartitionSelector);

        // Create FabricClient with connection and security information here
        FabricClient fabricclient = new FabricClient(clusterConnection);
        await fabricclient.FaultManager.RestartNodeAsync(primaryofReplicaSelector, CompletionMode.Verify);
    }

    static async Task RestartNodeAsync(string clusterConnection, string nodeName, BigInteger nodeInstanceId)
    {
        // Create FabricClient with connection and security information here
        FabricClient fabricclient = new FabricClient(clusterConnection);
        await fabricclient.FaultManager.RestartNodeAsync(nodeName, nodeInstanceId, CompletionMode.Verify);
    }
}

PartitionSelector 和 ReplicaSelectorPartitionSelector and ReplicaSelector

PartitionSelectorPartitionSelector

PartitionSelector 是在可测试性中运用的一个帮助程序,用于选择在其上执行任何可测试性操作的具体分区。PartitionSelector is a helper exposed in testability and is used to select a specific partition on which to perform any of the testability actions. 如果事先知道分区 ID,则它可用于选择具体分区。It can be used to select a specific partition if the partition ID is known beforehand. 或者,可以提供分区键,操作会在内部解析分区 ID。Or, you can provide the partition key and the operation will resolve the partition ID internally. 还可以选择一个随机分区。You also have the option of selecting a random partition.

若要使用此帮助器,请创建 PartitionSelector 对象,并使用 Select* 方法之一选择分区。To use this helper, create the PartitionSelector object and select the partition by using one of the Select* methods. 然后在 PartitionSelector 对象中将其传递给需要它的 API。Then pass in the PartitionSelector object to the API that requires it. 如果未选择任何选项,则默认为随机分区。If no option is selected, it defaults to a random partition.

Uri serviceName = new Uri("fabric:/samples/InMemoryToDoListApp/InMemoryToDoListService");
Guid partitionIdGuid = new Guid("8fb7ebcc-56ee-4862-9cc0-7c6421e68829");
string partitionName = "Partition1";
Int64 partitionKeyUniformInt64 = 1;

// Select a random partition
PartitionSelector randomPartitionSelector = PartitionSelector.RandomOf(serviceName);

// Select a partition based on ID
PartitionSelector partitionSelectorById = PartitionSelector.PartitionIdOf(serviceName, partitionIdGuid);

// Select a partition based on name
PartitionSelector namedPartitionSelector = PartitionSelector.PartitionKeyOf(serviceName, partitionName);

// Select a partition based on partition key
PartitionSelector uniformIntPartitionSelector = PartitionSelector.PartitionKeyOf(serviceName, partitionKeyUniformInt64);

ReplicaSelectorReplicaSelector

ReplicaSelector 是在可测试性中运用的一个帮助程序,用于帮助选择在其上执行任何可测试性操作的副本。ReplicaSelector is a helper exposed in testability and is used to help select a replica on which to perform any of the testability actions. 如果事先知道副本 ID,则它可用于选择具体副本。It can be used to select a specific replica if the replica ID is known beforehand. 此外,可以选择主副本,也可以选择随机辅助副本。In addition, you have the option of selecting a primary replica or a random secondary. ReplicaSelector 派生于 PartitionSelector,因此需要同时选择要在其上执行可测试性操作的副本和分区。ReplicaSelector derives from PartitionSelector, so you need to select both the replica and the partition on which you wish to perform the testability operation.

若要使用此帮助器,请创建一个 ReplicaSelector 对象,并设置副本的分区的选择方式。To use this helper, create a ReplicaSelector object and set the way you want to select the replica and the partition. 然后,可以将它传递给需要它的 API。You can then pass it into the API that requires it. 如果未选择任何选项,则默认为随机副本和随机分区。If no option is selected, it defaults to a random replica and random partition.

Guid partitionIdGuid = new Guid("8fb7ebcc-56ee-4862-9cc0-7c6421e68829");
PartitionSelector partitionSelector = PartitionSelector.PartitionIdOf(serviceName, partitionIdGuid);
long replicaId = 130559876481875498;

// Select a random replica
ReplicaSelector randomReplicaSelector = ReplicaSelector.RandomOf(partitionSelector);

// Select the primary replica
ReplicaSelector primaryReplicaSelector = ReplicaSelector.PrimaryOf(partitionSelector);

// Select the replica by ID
ReplicaSelector replicaByIdSelector = ReplicaSelector.ReplicaIdOf(partitionSelector, replicaId);

// Select a random secondary replica
ReplicaSelector secondaryReplicaSelector = ReplicaSelector.RandomSecondaryOf(partitionSelector);

后续步骤Next steps