Azure SignalR 服务中的复原能力和灾难恢复Resiliency and disaster recovery in Azure SignalR Service

复原能力和灾难恢复是联机系统的常见需求。Resiliency and disaster recovery is a common need for online systems. Azure SignalR 服务可保证 99.9% 的可用性,但它仍是一个区域性的服务。Azure SignalR Service already guarantees 99.9% availability, but it's still a regional service. 服务实例始终在一个区域中运行,出现区域范围的服务中断时,它不会故障转移到另一个区域。Your service instance is always running in one region and won't fail-over to another region when there is a region-wide outage.

相反,我们的服务 SDK 提供了相应的功能来支持多个 SignalR 服务实例,当其中的某些实例不可用时,它会自动切换到其他实例。Instead, our service SDK provides a functionality to support multiple SignalR service instances and automatically switch to other instances when some of them are not available. 发生灾难时,可以使用此功能进行恢复,不过,仍然需要自行设置正确的系统拓扑。With this feature, you'll be able to recover when a disaster takes place, but you will need to set up the right system topology by yourself. 本文档将会介绍此操作。You'll learn how to do so in this document.

SignalR 服务的高可用性体系结构High available architecture for SignalR service

若要获得 SignalR 服务的跨区域复原能力,需要在不同的区域中设置多个服务实例。In order to have cross region resiliency for SignalR service, you need to set up multiple service instances in different regions. 这样,当某个区域出现故障时,可将其他区域用作备用区域。So when one region is down, the others can be used as backup. 将多个服务实例连接到应用服务器时,有两个角色:主要角色和辅助角色。When connecting multiple service instances to app server, there are two roles, primary and secondary. 主要角色是接收联机流量的实例,辅助角色是完全正常运行的实例,但它是主要角色的备用实例。Primary is an instance who is taking online traffic and secondary is a fully functional but backup instance for primary. 在 SDK 实现中,协商仅返回主要终结点,因此,在正常情况下,客户端只连接到主要终结点。In our SDK implementation, negotiate will only return primary endpoints so in normal case clients only connect to primary endpoints. 但是,当主要实例出现故障时,协商将返回辅助终结点,因此客户端仍可建立连接。But when primary instance is down, negotiate will return secondary endpoints so client can still make connections. 主要实例和应用服务器通过正常的服务器连接进行连接,但辅助实例和应用服务器通过一种称作“弱连接”的特殊连接进行连接。Primary instance and app server are connected through normal server connections but secondary instance and app server are connected through a special type of connection called weak connection. 弱连接的主要区别在于,它不接受客户端连接路由,因为辅助实例位于另一个区域中。The main difference of a weak connection is that it doesn't accept client connection routing, because secondary instance is located in another region. 将客户端路由到另一个区域不是最佳选择(会增大延迟)。Routing a client to another region is not an optimal choice (increases latency).

一个服务实例在连接到多个应用服务器时可以有不同的角色。One service instance can have different roles when connecting to multiple app servers. 跨区域方案的一种典型设置是使用两对(或更多对)SignalR 服务实例和应用服务器。One typical setup for cross region scenario is to have two (or more) pairs of SignalR service instances and app servers. 在每一对中,应用服务器和 SignalR 服务位于同一区域,SignalR 服务作为主要角色连接到应用服务器。Inside each pair app server and SignalR service are located in the same region, and SignalR service is connected to the app server as a primary role. 在每对之间,应用服务器和 SignalR 服务也会建立连接,但是,在连接到另一区域中的服务器时,SignalR 将变成辅助角色。Between each pairs app server and SignalR service are also connected, but SignalR becomes a secondary when connecting to server in another region.

使用此拓扑时,来自一台服务器的消息仍可传送到所有客户端,因为所有应用服务器和 SignalR 服务实例是互连的。With this topology, message from one server can still be delivered to all clients as all app servers and SignalR service instances are interconnected. 但是,客户端在连接后,始终会路由到同一区域中的应用服务器,以实现最佳网络延迟。But when a client is connected, it's always routed to the app server in the same region to achieve optimal network latency.

下图演示了这种拓扑:Below is a diagram that illustrates such topology:

拓扑

使用多个 SignalR 服务实例配置应用服务器Configure app servers with multiple SignalR service instances

在每个区域中创建 SignalR 服务和应用服务器后,可将应用服务器配置为连接到所有 SignalR 服务实例。Once you have SignalR service and app servers created in each region, you can configure your app servers to connect to all SignalR service instances.

可通过两种方式实现此目的:There are two ways you can do it:

通过配置Through config

你应该已经知道如何通过环境变量/应用设置/web.cofig 在名为 Azure:SignalR:ConnectionString 的配置项中设置 SignalR 服务连接字符串。You should already know how to set SignalR service connection string through environment variables/app settings/web.cofig, in a config entry named Azure:SignalR:ConnectionString. 如果有多个终结点,可在多个配置项中设置这些终结点,每个项采用以下格式:If you have multiple endpoints, you can set them in multiple config entries, each in the following format:

Azure:SignalR:ConnectionString:<name>:<role>

此处的 <name> 是终结点的名称,<role> 是其角色(主要或辅助)。Here <name> is the name of the endpoint and <role> is its role (primary or secondary). 名称是可选的,但如果你想要进一步自定义多个终结点之间的路由行为,则名称非常有用。Name is optional but it will be useful if you want to further customize the routing behavior among multiple endpoints.

通过代码Through code

如果你偏向于将连接字符串存储到其他位置,则也可以在代码中读取连接字符串,并在调用 AddAzureSignalR()(在 ASP.NET Core 中)或 MapAzureSignalR()(在 ASP.NET 中)时将其用作参数。If you prefer to store the connection strings somewhere else, you can also read them in your code and use them as parameters when calling AddAzureSignalR() (in ASP.NET Core) or MapAzureSignalR() (in ASP.NET).

以下是示例代码:Here is the sample code:

ASP.NET Core:ASP.NET Core:

services.AddSignalR()
        .AddAzureSignalR(options => options.Endpoints = new ServiceEndpoint[]
        {
            new ServiceEndpoint("<connection_string1>", EndpointType.Primary, "region1"),
            new ServiceEndpoint("<connection_string2>", EndpointType.Secondary, "region2"),
        });

ASP.NET:ASP.NET:

app.MapAzureSignalR(GetType().FullName, hub,  options => options.Endpoints = new ServiceEndpoint[]
    {
        new ServiceEndpoint("<connection_string1>", EndpointType.Primary, "region1"),
        new ServiceEndpoint("<connection_string2>", EndpointType.Secondary, "region2"),
    };

可以配置多个主要或次要实例。You can configure multiple primary or secondary instances. 如果有多个主要和/或次要实例,则协商会按以下顺序返回终结点:If there're multiple primary and/or secondary instances, negotiate will return an endpoint in the following order:

  1. 如果有至少一个主要实例处于联机状态,则会返回一个随机的联机主要实例。If there is at least one primary instance online, return a random primary online instance.
  2. 如果所有主要实例都停机,则会返回一个随机的联机次要实例。If all primary instances are down, return a random secondary online instance.

故障转移序列和最佳做法Failover sequence and best practice

现已设置正确的系统拓扑。Now you have the right system topology setup. 每当某个 SignalR 服务实例出现故障时,联机流量将路由到其他实例。Whenever one SignalR service instance is down, online traffic will be routed to other instances. 下面是当主要实例出现故障(以及一段时间后进行恢复)时发生的情况:Here is what happens when a primary instance is down (and recovers after some time):

  1. 主要服务实例出现故障,此实例上的所有服务器连接将被删除。Primary service instance is down, all server connections on this instance will be dropped.
  2. 连接到此实例的所有服务器会将此实例标记为脱机,协商将停止返回此终结点,并开始返回辅助终结点。All servers connected to this instance will mark it as offline, and negotiate will stop returning this endpoint and start returning secondary endpoint.
  3. 此实例上的所有客户端连接也会关闭,客户端将重新连接。All client connections on this instance will also be closed, clients will reconnect. 由于应用服务器现在返回辅助终结点,因此客户端将连接到辅助实例。Since app servers now return secondary endpoint, clients will connect to secondary instance.
  4. 现在,辅助实例将接收所有联机流量。Now secondary instance takes all online traffic. 由于辅助实例已连接到所有应用服务器,因此从服务器发往客户端的所有消息仍可传送。All messages from server to clients can still be delivered as secondary is connected to all app servers. 但是,从客户端发往服务器的消息只能路由到同一区域中的应用服务器。But client to server messages are only routed to the app server in the same region.
  5. 主要实例恢复并重新联机后,应用服务器将与它重新建立连接,并将其标记为联机。After primary instance is recovered and back online, app server will reestablish connections to it and mark it as online. 协商现在会再次返回主要终结点,因此,新客户端将重新连接到主要实例。Negotiate will now return primary endpoint again so new clients are connected back to primary. 但是,现有客户端不会被删除,并继续路由到辅助实例,直到它们自行断开连接。But existing clients won't be dropped and will continue being routed to secondary until they disconnect themselves.

下图演示了 SignalR 服务中如何实现故障转移:Below diagrams illustrate how failover is done in SignalR service:

图 1:故障转移之前故障转移之前Fig.1 Before failover Before Failover

图 2:故障转移之后故障转移之后Fig.2 After failover After Failover

图 3:主要实例恢复后的短时间内主要实例恢复后的短时间内Fig.3 Short time after primary recovers Short time after primary recovers

可以看到,在正常情况下,只有主要应用服务器和 SignalR 服务包含联机流量(以蓝色表示)。You can see in normal case only primary app server and SignalR service have online traffic (in blue). 故障转移后,辅助应用服务器和 SignalR 服务也处于活动状态。After failover, secondary app server and SignalR service also become active. 主要 SignalR 服务重新联机后,新客户端将连接到主要 SignalR。After primary SignalR service is back online, new clients will connect to primary SignalR. 但是,现有客户端仍连接到辅助实例,因此这两个实例都包含流量。But existing clients still connect to secondary so both instances have traffic. 所有现有客户端断开连接后,系统将会恢复正常(图 1)。After all existing clients disconnect, your system will be back to normal (Fig.1).

可以使用两种主要模式来实现跨区域的高可用性体系结构:There are two main patterns for implementing a cross region high available architecture:

  1. 第一种模式是使用一对应用服务器和 SignalR 服务实例来接收所有联机流量,并使用另一对作为备用实例(称为主动/被动配置,如图 1 所示)。The first one is to have a pair of app server and SignalR service instance taking all online traffic, and have another pair as a backup (called active/passive, illustrated in Fig.1).
  2. 另一种模式是使用两对(或更多对)应用服务器和 SignalR 服务实例,其中每个实例接收一部分联机流量,并充当其他对的备用实例(称为主动/主动配置,类似于图 3)。The other one is to have two (or more) pairs of app servers and SignalR service instances, each one taking part of the online traffic and serves as backup for other pairs (called active/active, similar to Fig.3).

SignalR 服务支持这两种模式,主要差别在于实现应用服务器的方式。SignalR service can support both patterns, the main difference is how you implement app servers. 如果应用服务器采用主动/被动配置,则 SignalR 服务也采用主动/被动配置(因为主要应用服务器仅返回其主要 SignalR 服务实例)。If app servers are active/passive, SignalR service will also be active/passive (as the primary app server only returns its primary SignalR service instance). 如果应用服务器采用主动/主动配置,则 SignalR 服务也采用主动/主动配置(因为所有应用服务器将返回其自己的主要 SignalR 实例,因此它们都可以接收流量)。If app servers are active/active, SignalR service will also be active/active (as all app servers will return their own primary SignalR instances, so all of them can get traffic).

请注意,无论选择使用哪种模式,都需要将每个 SignalR 服务实例作为主要实例连接到应用服务器。Be noted no matter which patterns you choose to use, you'll need to connect each SignalR service instance to an app server as primary.

另外,由于 SignalR 连接的性质(远距离连接),发生灾难和故障转移时,客户端会遇到连接断开的情况。Also due to the nature of SignalR connection (it's a long connection), clients will experience connection drops when there is a disaster and failover take place. 需要在客户端上处理这种情况,使其对最终客户透明。You'll need to handle such cases at client side to make it transparent to your end customers. 例如,关闭连接后不要重新连接。For example, do reconnect after a connection is closed.

后续步骤Next steps

本文已介绍如何配置应用程序以实现 SignalR 服务的复原能力。In this article, you have learned how to configure your application to achieve resiliency for SignalR service. 若要更详细地了解 SignalR 服务中的服务器/客户端连接和连接路由,请阅读此文,其中介绍了 SignalR 服务的内部情况。To understand more details about server/client connection and connection routing in SignalR service, you can read this article for SignalR service internals.

对于使用多个实例一起处理大量连接的缩放方案(例如分片),请阅读如何缩放多个实例For scaling scenarios such as sharding, that use multiple instances together to handle large number of connections, read how to scale multiple instances.