将启动节点 API 和停止节点 API 替换为节点转换 APIReplacing the Start Node and Stop node APIs with the Node Transition API

停止节点 API 和启动节点 API 有什么作用?What do the Stop Node and Start Node APIs do?

停止节点 API(在托管环境中为:StopNodeAsync(),PowerShell:Stop-ServiceFabricNode)停止 Service Fabric 节点。The Stop Node API (managed: StopNodeAsync(), PowerShell: Stop-ServiceFabricNode) stops a Service Fabric node. Service Fabric 节点是进程,而不是 VM 或计算机 - VM 或计算机仍会运行。A Service Fabric node is process, not a VM or machine - the VM or machine will still be running. 在本文档的余下部分中,“节点”是指 Service Fabric 节点。For the rest of the document "node" will mean Service Fabric node. 停止某个节点会将其置于已停止状态,此时,该节点不再是群集的成员,也无法托管服务,因此相当于一个关闭的节点。Stopping a node puts it into a stopped state where it is not a member of the cluster and cannot host services, thus simulating a down node. 这种做法有助于将故障注入系统,对应用程序进行测试。This is useful for injecting faults into the system to test your application. 启动节点 API(在托管环境中为:StartNodeAsync(),PowerShell:Start-ServiceFabricNode])与停止节点 API 相反,可使节点恢复正常状态。The Start Node API (managed: StartNodeAsync(), PowerShell: Start-ServiceFabricNode]) reverses the Stop Node API, which brings the node back to a normal state.

为什么要替换这些 API?Why are we replacing these?

如前所述, 停止的 Service Fabric 节点是使用停止节点 API 有针对性地停止的节点。As described earlier, a stopped Service Fabric node is a node intentionally targeted using the Stop Node API. 关闭的节点是出于其他任何原因(例如,VM 或计算机已关闭)而关闭的节点。A down node is a node that is down for any other reason (for example, the VM or machine is off). 使用停止节点 API 时,系统不会公开信息来区分停止的节点和关闭的节点。With the Stop Node API, the system does not expose information to differentiate between stopped nodes and down nodes.

此外,这些 API 返回的某些错误并不是那么浅显易懂。In addition, some errors returned by these APIs are not as descriptive as they could be. 例如,在已停止的节点上调用停止节点 API 将返回错误 InvalidAddressFor example, invoking the Stop Node API on an already stopped node will return the error InvalidAddress. 这种体验可以得到改善。This experience could be improved.

此外,在调用启动节点 API 之前,节点停止的持续时间为“无限期”。Also, the duration a node is stopped for is "infinite" until the Start Node API is invoked. 我们已发现,这可能会导致问题且容易出错。We've found this can cause problems and may be error-prone. 例如,如果用户在节点上调用了停止节点 API,但后来忘记了这件事,则就会出现问题。For example, we've seen problems where a user invoked the Stop Node API on a node and then forgot about it. 以后无法确定该节点的状态到底是关闭还是停止Later, it was unclear if the node was down or stopped.

节点转换 API 简介Introducing the Node Transition APIs

我们已凭借一组新的 API 解决了上述问题。We've addressed these issues above in a new set of APIs. 新的节点转换 API(在托管环境中为:StartNodeTransitionAsync())可用于将 Service Fabric 节点转换为“已停止” 状态,或将其从“已停止” 状态转换为正常启动状态。The new Node Transition API (managed: StartNodeTransitionAsync()) may be used to transition a Service Fabric node to a stopped state, or to transition it from a stopped state to a normal up state. 请注意,该 API 的名称中的“Start”不是指启动节点,Please note that the "Start" in the name of the API does not refer to starting a node. 而是指系统开始执行异步操作,将节点转换为停止或启动状态。It refers to beginning an asynchronous operation that the system will execute to transition the node to either stopped or started state.

使用情况Usage

如果调用节点转换 API 后未引发异常,则表示系统已接受异步操作,将执行该操作。If the Node Transition API does not throw an exception when invoked, then the system has accepted the asynchronous operation, and will execute it. 成功的调用并不意味着操作已完成。A successful call does not imply the operation is finished yet. 若要获取有关该操作的当前状态的信息,请使用为此操作调用节点转换 API 时使用的 guid 来调用节点转换进度 API(在托管环境中为:GetNodeTransitionProgressAsync())。To get information about the current state of the operation, call the Node Transition Progress API (managed: GetNodeTransitionProgressAsync()) with the guid used when invoking Node Transition API for this operation. 节点转换进度 API 返回 NodeTransitionProgress 对象。The Node Transition Progress API returns an NodeTransitionProgress object. 此对象的 State 属性指定当前操作状态。This object's State property specifies the current state of the operation. 如果状态为“Running”,则表示该操作正在执行。If the state is "Running", then the operation is executing. 如果状态为“Completed”,则表示该操作已完成且未出错。If it is Completed, the operation finished without error. 如果状态为“Faulted”,则表示执行该操作时出现问题。If it is Faulted, there was a problem executing the operation. Result 属性的 Exception 属性会指示具体的问题。The Result property's Exception property will indicate what the issue was. 有关 State 属性的详细信息,请参阅 https://docs.azure.cn/dotnet/api/system.fabric.testcommandprogressstate ;有关代码示例,请参阅下面的“示例用法”部分。See https://docs.azure.cn/dotnet/api/system.fabric.testcommandprogressstate for more information about the State property, and the "Sample Usage" section below for code examples.

区分停止的节点和关闭的节点 如果节点是使用节点转换 API“停止的” ,则节点查询(在托管环境中为:GetNodeListAsync(),PowerShell:Get-ServiceFabricNode)的输出将显示此节点的“IsStopped” 属性值为 true。Differentiating between a stopped node and a down node If a node is stopped using the Node Transition API, the output of a node query (managed: GetNodeListAsync(), PowerShell: Get-ServiceFabricNode) will show that this node has an IsStopped property value of true. 请注意,这与 NodeStatus 属性的值不同,后者显示 DownNote this is different from the value of the NodeStatus property, which will say Down. 如果 NodeStatus 属性的值为 Down,但 IsStopped 为 false,则表示未使用节点转换 API 停止该节点,而是由于其他某个原因而使该节点处于关闭状态。If the NodeStatus property has a value of Down, but IsStopped is false, then the node was not stopped using the Node Transition API, and is Down due to some other reason. 如果 IsStopped 属性为 true,NodeStatus 属性为 Down,则表示已使用节点转换 API 停止该节点。If the IsStopped property is true, and the NodeStatus property is Down, then it was stopped using the Node Transition API.

使用节点转换 API 启动 停止的 节点可让它重新成为群集的正常成员。Starting a stopped node using the Node Transition API will return it to function as a normal member of the cluster again. 节点查询 API 的输出将显示 IsStopped 为 false,NodeStatus 为除 Down 以外的某个值(例如 Up)。The output of the node query API will show IsStopped as false, and NodeStatus as something that is not Down (for example, Up).

有限持续时间使用节点转换 API 停止某个节点时,必须使用 stopNodeDurationInSeconds 参数,表示将该节点保持停止状态的秒数。Limited Duration When using the Node Transition API to stop a node, one of the required parameters, stopNodeDurationInSeconds, represents the amount of time in seconds to keep the node stopped. 此值必须在允许的范围内,最小为 600,最大为 14400。This value must be in the allowed range, which has a minimum of 600, and a maximum of 14400. 经过这段时间后,该节点将自行重新启动,自动进入 Up(启动)状态。After this time expires, the node will restart itself into Up state automatically. 有关用法示例,请参阅下面的“示例 1”。Refer to Sample 1 below for an example of usage.

警告

避免将节点转换 API 与停止节点 API 和启动节点 API 混合使用。Avoid mixing Node Transition APIs and the Stop Node and Start Node APIs. 建议只使用节点转换 API。The recommendation is to use the Node Transition API only. > 如果已使用停止节点 API 停止某个节点,应先使用启动节点 API 将它启动,> 然后再使用节点转换 API。> If a node has been already been stopped using the Stop Node API, it should be started using the Start Node API first before using the > Node Transition APIs.

警告

不能同时针对同一个节点执行多次节点转换 API 调用。Multiple Node Transition APIs calls cannot be made on the same node in parallel. 否则,节点转换 API 将 > 引发 FabricException 并返回 ErrorCode 属性值 NodeTransitionInProgress。In such a situation, the Node Transition API will > throw a FabricException with an ErrorCode property value of NodeTransitionInProgress. 在特定的节点上启动 > 节点转换后,应等到该操作进入终止状态(Completed、Faulted 或 ForceCancelled),> 然后在同一节点上启动新的转换。Once a node transition on a specific node has > been started, you should wait until the operation reaches a terminal state (Completed, Faulted, or ForceCancelled) before starting a > new transition on the same node. 允许同时针对不同的节点执行节点转换调用。Parallel node transition calls on different nodes are allowed.

示例用法Sample Usage

示例 1 - 以下示例使用节点转换 API 停止节点。Sample 1 - The following sample uses the Node Transition API to stop a node.

        // Helper function to get information about a node
        static Node GetNodeInfo(FabricClient fc, string node)
        {
            NodeList n = null;
            while (n == null)
            {
                n = fc.QueryManager.GetNodeListAsync(node).GetAwaiter().GetResult();
                Task.Delay(TimeSpan.FromSeconds(1)).GetAwaiter();
            };

            return n.FirstOrDefault();
        }

        static async Task WaitForStateAsync(FabricClient fc, Guid operationId, TestCommandProgressState targetState)
        {
            NodeTransitionProgress progress = null;

            do
            {
                bool exceptionObserved = false;
                try
                {
                    progress = await fc.TestManager.GetNodeTransitionProgressAsync(operationId, TimeSpan.FromMinutes(1), CancellationToken.None).ConfigureAwait(false);
                }
                catch (OperationCanceledException oce)
                {
                    Console.WriteLine("Caught exception '{0}'", oce);
                    exceptionObserved = true;
                }
                catch (FabricTransientException fte)
                {
                    Console.WriteLine("Caught exception '{0}'", fte);
                    exceptionObserved = true;
                }

                if (!exceptionObserved)
                {
                    Console.WriteLine("Current state of operation '{0}': {1}", operationId, progress.State);

                    if (progress.State == TestCommandProgressState.Faulted)
                    {
                        // Inspect the progress object's Result.Exception.HResult to get the error code.
                        Console.WriteLine("'{0}' failed with: {1}, HResult: {2}", operationId, progress.Result.Exception, progress.Result.Exception.HResult);

                        // ...additional logic as required
                    }

                    if (progress.State == targetState)
                    {
                        Console.WriteLine("Target state '{0}' has been reached", targetState);
                        break;
                    }
                }

                await Task.Delay(TimeSpan.FromSeconds(5)).ConfigureAwait(false);
            }
            while (true);
        }

        static async Task StopNodeAsync(FabricClient fc, string nodeName, int durationInSeconds)
        {
            // Uses the GetNodeListAsync() API to get information about the target node
            Node n = GetNodeInfo(fc, nodeName);

            // Create a Guid
            Guid guid = Guid.NewGuid();

            // Create a NodeStopDescription object, which will be used as a parameter into StartNodeTransition
            NodeStopDescription description = new NodeStopDescription(guid, n.NodeName, n.NodeInstanceId, durationInSeconds);

            bool wasSuccessful = false;

            do
            {
                try
                {
                    // Invoke StartNodeTransitionAsync with the NodeStopDescription from above, which will stop the target node.  Retry transient errors.
                    await fc.TestManager.StartNodeTransitionAsync(description, TimeSpan.FromMinutes(1), CancellationToken.None).ConfigureAwait(false);
                    wasSuccessful = true;
                }
                catch (OperationCanceledException oce)
                {
                    // This is retryable
                }
                catch (FabricTransientException fte)
                {
                    // This is retryable
                }

                // Backoff
                await Task.Delay(TimeSpan.FromSeconds(5)).ConfigureAwait(false);
            }
            while (!wasSuccessful);

            // Now call StartNodeTransitionProgressAsync() until the desired state is reached.
            await WaitForStateAsync(fc, guid, TestCommandProgressState.Completed).ConfigureAwait(false);
        }

示例 2 - 以下示例启动 已停止的 节点。Sample 2 - The following sample starts a stopped node. 该示例使用第一个示例中的某些帮助器方法。It uses some helper methods from the first sample.

        static async Task StartNodeAsync(FabricClient fc, string nodeName)
        {
            // Uses the GetNodeListAsync() API to get information about the target node
            Node n = GetNodeInfo(fc, nodeName);

            Guid guid = Guid.NewGuid();
            BigInteger nodeInstanceId = n.NodeInstanceId;

            // Create a NodeStartDescription object, which will be used as a parameter into StartNodeTransition
            NodeStartDescription description = new NodeStartDescription(guid, n.NodeName, nodeInstanceId);

            bool wasSuccessful = false;

            do
            {
                try
                {
                    // Invoke StartNodeTransitionAsync with the NodeStartDescription from above, which will start the target stopped node.  Retry transient errors.
                    await fc.TestManager.StartNodeTransitionAsync(description, TimeSpan.FromMinutes(1), CancellationToken.None).ConfigureAwait(false);
                    wasSuccessful = true;
                }
                catch (OperationCanceledException oce)
                {
                    Console.WriteLine("Caught exception '{0}'", oce);
                }
                catch (FabricTransientException fte)
                {
                    Console.WriteLine("Caught exception '{0}'", fte);
                }

                await Task.Delay(TimeSpan.FromSeconds(5)).ConfigureAwait(false);

            }
            while (!wasSuccessful);

            // Now call StartNodeTransitionProgressAsync() until the desired state is reached.
            await WaitForStateAsync(fc, guid, TestCommandProgressState.Completed).ConfigureAwait(false);
        }

示例 3 - 以下示例显示了错误的用法。Sample 3 - The following sample shows incorrect usage. 这种用法之所以不正确,是因为它提供的 stopDurationInSeconds 超出了允许的范围。This usage is incorrect because the stopDurationInSeconds it provides is greater than the allowed range. 由于 StartNodeTransitionAsync() 将会失败并返回严重错误,因此该操作未被接受。不应调用进度 API。Since StartNodeTransitionAsync() will fail with a fatal error, the operation was not accepted, and the progress API should not be called. 此示例使用第一个示例中的某些帮助器方法。This sample uses some helper methods from the first sample.

        static async Task StopNodeWithOutOfRangeDurationAsync(FabricClient fc, string nodeName)
        {
            Node n = GetNodeInfo(fc, nodeName);

            Guid guid = Guid.NewGuid();

            // Use an out of range value for stopDurationInSeconds to demonstrate error
            NodeStopDescription description = new NodeStopDescription(guid, n.NodeName, n.NodeInstanceId, 99999);

            try
            {
                await fc.TestManager.StartNodeTransitionAsync(description, TimeSpan.FromMinutes(1), CancellationToken.None).ConfigureAwait(false);
            }

            catch (FabricException e)
            {
                Console.WriteLine("Caught {0}", e);
                Console.WriteLine("ErrorCode {0}", e.ErrorCode);
                // Output:
                // Caught System.Fabric.FabricException: System.Runtime.InteropServices.COMException (-2147017629)
                // StopDurationInSeconds is out of range ---> System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80071C63
                // << Parts of exception omitted>>
                //
                // ErrorCode InvalidDuration
            }
        }

示例 4 - 以下示例演示当节点转换 API 发起的操作被接受,但随后在执行过程中失败时,节点转换进度 API 将返回的错误信息。Sample 4 - The following sample shows the error information that will be returned from the Node Transition Progress API when the operation initiated by the Node Transition API is accepted, but fails later while executing. 在本例中,失败的原因是节点转换 API 尝试启动一个不存在的节点。In the case, it fails because the Node Transition API attempts to start a node that does not exist. 此示例使用第一个示例中的某些帮助器方法。This sample uses some helper methods from the first sample.

        static async Task StartNodeWithNonexistentNodeAsync(FabricClient fc)
        {
            Guid guid = Guid.NewGuid();
            BigInteger nodeInstanceId = 12345;

            // Intentionally use a nonexistent node
            NodeStartDescription description = new NodeStartDescription(guid, "NonexistentNode", nodeInstanceId);

            bool wasSuccessful = false;

            do
            {
                try
                {
                    // Invoke StartNodeTransitionAsync with the NodeStartDescription from above, which will start the target stopped node.  Retry transient errors.
                    await fc.TestManager.StartNodeTransitionAsync(description, TimeSpan.FromMinutes(1), CancellationToken.None).ConfigureAwait(false);
                    wasSuccessful = true;
                }
                catch (OperationCanceledException oce)
                {
                    Console.WriteLine("Caught exception '{0}'", oce);
                }
                catch (FabricTransientException fte)
                {
                    Console.WriteLine("Caught exception '{0}'", fte);
                }

                await Task.Delay(TimeSpan.FromSeconds(5)).ConfigureAwait(false);

            }
            while (!wasSuccessful);

            // Now call StartNodeTransitionProgressAsync() until the desired state is reached.  In this case, it will end up in the Faulted state since the node does not exist.
            // When StartNodeTransitionProgressAsync()'s returned progress object has a State if Faulted, inspect the progress object's Result.Exception.HResult to get the error code.
            // In this case, it will be NodeNotFound.
            await WaitForStateAsync(fc, guid, TestCommandProgressState.Faulted).ConfigureAwait(false);
        }