Azure 服务总线异地灾难恢复Azure Service Bus Geo-disaster recovery

当整个 Azure 区域或数据中心遭遇停机时,在另一个区域或数据中心继续实现数据处理变得至关重要。When entire Azure regions or datacenters experience downtime, it is critical for data processing to continue to operate in a different region or datacenter. 因此,“异地灾难恢复”对于任何企业而言都是非常重要的功能。As such, Geo-disaster recovery is an important feature for any enterprise. Azure 服务总线支持命名空间级别的异地灾难恢复。Azure Service Bus supports geo-disaster recovery at the namespace level.

异地灾难恢复功能在全球范围内可用于服务总线高级 SKU。The Geo-disaster recovery feature is globally available for the Service Bus Premium SKU.

备注

异地灾难恢复当前仅确保元数据(队列、主题、订阅、筛选器)在配对时从主要命名空间复制到次要命名空间。Geo-Disaster recovery currently only ensures that the metadata (Queues, Topics, Subscriptions, Filters) are copied over from the primary namespace to secondary namespace when paired.

中断和灾难Outages and disasters

请务必注意“中断”和“灾难”的区别。It's important to note the distinction between "outages" and "disasters."

中断是指 Azure 服务总线暂时不可用,可能会影响服务的某些组件,如消息存储,甚至是整个数据中心。An outage is the temporary unavailability of Azure Service Bus, and can affect some components of the service, such as a messaging store, or even the entire datacenter. 但在问题解决后,服务总线将恢复可用。However, after the problem is fixed, Service Bus becomes available again. 通常情况下,中断不会导致消息或其他数据丢失。Typically, an outage does not cause the loss of messages or other data. 例如,数据中心的电源故障可能会导致此类中断。An example of such an outage might be a power failure in the datacenter. 某些中断由于暂时性故障或网络问题只是短时间连接丢失。Some outages are only short connection losses due to transient or network issues.

根据定义,灾难是指永久或长期丢失服务总线群集、Azure 区域或数据中心。A disaster is defined as the permanent, or longer-term loss of a Service Bus cluster, Azure region, or datacenter. 该区域或数据中心不一定会恢复可用,或可能停用数小时或数天。The region or datacenter may or may not become available again, or may be down for hours or days. 例如,火灾、洪灾或地震等可能导致此类灾难。Examples of such disasters are fire, flooding, or earthquake. 永久性灾难可能会导致一些消息、事件或其他数据丢失。A disaster that becomes permanent might cause the loss of some messages, events, or other data. 不过,在大多数情况下,都不应该会有数据丢失,并且在数据中心备份后,便可以恢复消息。However, in most cases there should be no data loss and messages can be recovered once the data center is back up.

Azure 服务总线的异地灾难恢复功能是一项面向灾难恢复的解决方案。The Geo-disaster recovery feature of Azure Service Bus is a disaster recovery solution. 本文中所述的概念和工作流适用于灾难方案,而不适用于暂时或临时中断。The concepts and workflow described in this article apply to disaster scenarios, and not to transient, or temporary outages.

基本概念和术语Basic concepts and terms

灾难恢复功能可实现元数据灾难恢复,并且依赖于主要和次要灾难恢复命名空间。The disaster recovery feature implements metadata disaster recovery, and relies on primary and secondary disaster recovery namespaces. 请注意,异地灾难恢复功能仅适用于高级 SKUNote that the Geo-disaster recovery feature is available for the Premium SKU only. 不需要对连接字符串进行任何更改,因为连接是通过别名建立的。You do not need to make any connection string changes, as the connection is made via an alias.

本文涉及以下术语:The following terms are used in this article:

  • 别名:所设置的灾难恢复配置的名称。Alias: The name for a disaster recovery configuration that you set up. 别名提供单个稳定的完全限定域名 (FQDN) 连接字符串。The alias provides a single stable Fully Qualified Domain Name (FQDN) connection string. 应用程序使用此别名连接字符串连接到命名空间。Applications use this alias connection string to connect to a namespace. 使用别名可确保连接字符串在触发故障转移后保持不变。Using an alias ensures that the connection string is unchanged when the failover is triggered.

  • 主要/次要命名空间:别名所对应的命名空间。Primary/secondary namespace: The namespaces that correspond to the alias. 主要命名空间是“主动”的,并且会接收消息(该命名空间可以是现有的命名空间,也可以是新的命名空间)。The primary namespace is "active" and receives messages (this can be an existing or new namespace). 次要命名空间是“被动”的,不接收消息。The secondary namespace is "passive" and does not receive messages. 这两者之间的元数据保持同步,因此这两者可以无缝接受消息,而不必更改任何应用程序代码或连接字符串。The metadata between both is in sync, so both can seamlessly accept messages without any application code or connection string changes. 若要确保只有主动命名空间接收消息,必须使用别名。To ensure that only the active namespace receives messages, you must use the alias.

  • 元数据:队列、主题、订阅等实体及其与命名空间关联的服务的属性。Metadata: Entities such as queues, topics, and subscriptions; and their properties of the service that are associated with the namespace. 请注意,仅自动复制实体及其设置。Note that only entities and their settings are replicated automatically. 不会复制消息。Messages are not replicated.

  • 故障转移:激活辅助命名空间的过程。Failover: The process of activating the secondary namespace.

设置Setup

以下部分概述了命名空间之间的设置配对。The following section is an overview to setup pairing between the namespaces.

1

设置过程如下所述 -The setup process is as follows -

  1. 预配“主要”服务总线高级命名空间。Provision a Primary Service Bus Premium Namespace.

  2. 在与预配了主要命名空间的位置不同的区域中预配“次要”服务总线高级命名空间。Provision a Secondary Service Bus Premium Namespace in a region different from where the primary namespace is provisioned. 这将有助于跨不同的数据中心区域进行故障隔离。This will help allow fault isolation across different datacenter regions.

  3. 在主要命名空间与次要命名空间之间创建配对以获取“别名”。Create pairing between the Primary namespace and Secondary namespace to obtain the alias.

    备注

    如果已将 Azure 服务总线标准命名空间迁移到 Azure 服务总线高级命名空间,则必须使用预先存在的别名(即服务总线标准命名空间连接字符串)通过 PS/CLI 或 REST API 创建灾难恢复配置 。If you have migrated your Azure Service Bus Standard namespace to Azure Service Bus Premium, then you must use the pre-existing alias (i.e. your Service Bus Standard namespace connection string) to create the disaster recovery configuration through the PS/CLI or REST API.

    这是因为,在迁移过程中,Azure 服务总线标准命名空间连接字符串/DNS 名称本身会成为 Azure 服务总线高级命名空间的别名。This is because, during migration, your Azure Service Bus Standard namespace connection string/DNS name itself becomes an alias to your Azure Service Bus Premium namespace.

    客户端应用程序必须利用此别名(即 Azure 服务总线标准命名空间连接字符串)连接到已设置灾难恢复配对的高级命名空间。Your client applications must utilize this alias (i.e. the Azure Service Bus Standard namespace connection string) to connect to the Premium namespace where the disaster recovery pairing has been setup.

    如果使用门户来设置灾难恢复配置,则由你在门户中提供此注意事项。If you use the Portal to setup the Disaster recovery configuration, then the portal will abstract this caveat from you.

  4. 使用在步骤 3 中获取的“别名”将你的客户端应用程序连接到启用了异地灾难恢复的主要命名空间。Use the alias obtained in step 3 to connect your client applications to the Geo-DR enabled primary namespace. 最初,别名指向主要命名空间。Initially, the alias points to the primary namespace.

  5. [可选] 添加一些监视功能,以检测是否有必要进行故障转移。[Optional] Add some monitoring to detect if a failover is necessary.

故障转移流程Failover flow

故障转移由客户手动触发(通过命令显式触发,或者通过客户拥有的触发了该命令的业务逻辑),从不会由 Azure 触发。A failover is triggered manually by the customer (either explicitly through a command, or through client owned business logic that triggers the command) and never by Azure. 这为客户提供了对 Azure 主干的中断解决方法的完整所有权和可见性。This gives the customer full ownership and visibility for outage resolution on Azure's backbone.

4

在故障转移触发后 -After the failover is triggered -

  1. “别名”连接字符串更新为指向次要高级命名空间。The alias connection string is updated to point to the Secondary Premium namespace.

  2. 客户端(发送方和接收方)自动连接到次要命名空间。Clients(senders and receivers) automatically connect to the Secondary namespace.

  3. 主要高级命名空间与次要高级命名空间之间的现有配对将被破坏。The existing pairing between Primary and Secondary premium namespace is broken.

在故障转移启动后 -Once the failover is initiated -

  1. 如果又出现其他中断,则需要能够再次进行故障转移。If another outage occurs, you want to be able to fail over again. 因此,请设置另一个被动命名空间,并更新配对。Therefore, set up another passive namespace and update the pairing.

  2. 一旦以前的主要命名空间恢复可用,请从该命名空间拉取消息。Pull messages from the former primary namespace once it is available again. 在此之后,请使用该命名空间在异地恢复设置之外进行常规消息传送,或删除旧的主要命名空间。After that, use that namespace for regular messaging outside of your geo-recovery setup, or delete the old primary namespace.

备注

仅支持失败转发语义。Only fail forward semantics are supported. 在此方案中,先进行故障转移,然后与新的命名空间重新配对。In this scenario, you fail over and then re-pair with a new namespace. 不支持故障回复;例如,在 SQL 群集中不支持这样做。Failing back is not supported; for example, in a SQL cluster.

可以使用监视系统或定制监视解决方案自动执行故障转移。You can automate failover either with monitoring systems, or with custom-built monitoring solutions. 但是,这种自动执行需要额外的规划和工作,它超出了本文的讨论范围。However, such automation takes extra planning and work, which is out of the scope of this article.

2

管理Management

如果出了错,例如在初始设置过程中将错误的区域配对,则随时可以中断这两个命名空间的配对。If you made a mistake; for example, you paired the wrong regions during the initial setup, you can break the pairing of the two namespaces at any time. 如果想要使用配对命名空间作为常规命名空间,请删除别名。If you want to use the paired namespaces as regular namespaces, delete the alias.

使用现有的命名空间作为别名Use existing namespace as alias

如果遇到不能更改生产者和使用者连接的情况,则可将命名空间名称作为别名重用。If you have a scenario in which you cannot change the connections of producers and consumers, you can reuse your namespace name as the alias name. 请参阅此处提供的 GitHub 上的示例代码See the sample code on GitHub here.

示例Samples

GitHub 上的示例演示如何设置和启动故障转移。The samples on GitHub show how to set up and initiate a failover. 这些示例演示以下概念:These samples demonstrate the following concepts:

  • 在 Azure Active Directory 中将 Azure 资源管理器与服务总线配合使用所需的 .NET 示例和设置,用来设置和启用异地灾难恢复。A .NET sample and settings that are required in Azure Active Directory to use Azure Resource Manager with Service Bus, to set up and enable Geo-disaster recovery.
  • 执行示例代码所需的步骤。Steps required to execute the sample code.
  • 如何使用现有的命名空间作为别名。How to use an existing namespace as an alias.
  • 改用 PowerShell 或 CLI 启用异地灾难恢复的步骤。Steps to alternatively enable Geo-disaster recovery via PowerShell or CLI.
  • 使用别名从当前的主要或次要命名空间进行发送和接收Send and receive from the current primary or secondary namespace using the alias.

注意事项Considerations

此版本需要注意以下事项:Note the following considerations to keep in mind with this release:

  1. 在故障转移规划中,还应考虑时间因素。In your failover planning, you should also consider the time factor. 例如,如果失去连接的时间超过 15 到 20 分钟,你可能会决定启动故障转移。For example, if you lose connectivity for longer than 15 to 20 minutes, you might decide to initiate the failover.

  2. 未复制数据是指未复制当前处于活动状态的会话。The fact that no data is replicated means that currently active sessions are not replicated. 此外,重复检测和计划消息可能无法正常工作。Additionally, duplicate detection and scheduled messages may not work. 新会话、新计划消息和新的重复项可以正常工作。New sessions, new scheduled messages and new duplicates will work.

  3. 故障转移复杂的分布式基础结构应至少演练一次。Failing over a complex distributed infrastructure should be rehearsed at least once.

  4. 同步实体可能需要一些时间,每分钟大约 50-100 个实体。Synchronizing entities can take some time, approximately 50-100 entities per minute. 订阅和规则也计为实体。Subscriptions and rules also count as entities.

3

专用终结点Private endpoints

本部分提供了将异地灾难恢复与使用专用终结点的命名空间一起使用时的其他注意事项。This section provides additional considerations when using Geo-disaster recovery with namespaces that use private endpoints.

新建配对New pairings

如果尝试在具有专用终结点的主命名空间与没有专用终结点的辅助命名空间之间创建配对,则配对会失败。If you try to create a pairing between a primary namespace with a private endpoint and a secondary namespace without a private endpoint, the pairing will fail. 仅当主命名空间和辅助命名空间都具有专用终结点时,配对才会成功。The pairing will succeed only if both primary and secondary namespaces have private endpoints. 建议在主命名空间和辅助命名空间以及创建了专用终结点的虚拟网络上使用相同的配置。We recommend that you use same configurations on the primary and secondary namespaces and on virtual networks in which private endpoints are created.

备注

尝试将具有专用终结点的主命名空间与某个辅助命名空间配对时,验证过程仅检查辅助命名空间上是否存在专用终结点。When you try to pair the primary namespace with a private endpoint and the secondary namespace, the validation process only checks whether a private endpoint exists on the secondary namespace. 它不会检查在故障转移后终结点是否正常工作或是否将正常工作。It doesn't check whether the endpoint works or will work after failover. 你需要负责确保在故障转移后,具有专用终结点的辅助命名空间能够按预期工作。It's your responsibility to ensure that the secondary namespace with private endpoint will work as expected after failover.

若要测试专用终结点配置是否相同,请从虚拟网络外部向辅助命名空间发送获取队列请求,并验证是否收到来自服务的错误消息。To test that the private endpoint configurations are same, send a Get queues request to the secondary namespace from outside the virtual network, and verify that you receive an error message from the service.

现有配对Existing pairings

如果主命名空间和辅助命名空间之间已存在配对,则在主命名空间上创建专用终结点将失败。If pairing between primary and secondary namespace already exists, private endpoint creation on the primary namespace will fail. 若要解决此问题,请首先在辅助命名空间上创建专用终结点,然后为主命名空间创建专用终结点。To resolve, create a private endpoint on the secondary namespace first and then create one for the primary namespace.

备注

尽管我们允许对辅助命名空间进行只读访问,但也允许对专用终结点配置进行更新。While we allow read-only access to the secondary namespace, updates to the private endpoint configurations are permitted.

在为应用程序和服务总线创建灾难恢复配置时,必须针对承载应用程序的主实例和辅助实例的虚拟网络,为主要和辅助服务总线命名空间创建专用终结点。When creating a disaster recovery configuration for your application and Service Bus, you must create private endpoints for both primary and secondary Service Bus namespaces against virtual networks hosting both primary and secondary instances of your application.

假设你有两个虚拟网络:VNET-1、VNET-2 以及以下主命名空间和辅助命名空间:ServiceBus-Namespace1-Primary、ServiceBus-Namespace2-Secondary。Let's say you have two virtual networks: VNET-1, VNET-2 and these primary and second namespaces: ServiceBus-Namespace1-Primary, ServiceBus-Namespace2-Secondary. 需要执行以下步骤:You need to do the following steps:

  • 在 ServiceBus-Namespace1-Primary 上创建两个专用终结点,这两个专用终结点使用 VNET-1 和 VNET-2 中的子网On ServiceBus-Namespace1-Primary, create two private endpoints that use subnets from VNET-1 and VNET-2
  • 在 ServiceBus-Namespace2-Secondary 上创建两个专用终结点,这两个专用终结点使用 VNET-1 和 VNET-2 中的相同子网On ServiceBus-Namespace2-Secondary, create two private endpoints that use the same subnets from VNET-1 and VNET-2

专用终结点和虚拟网络

此方法的优点是,可以在独立于服务总线命名空间的应用程序层上进行故障转移。Advantage of this approach is that failover can happen at the application layer independent of Service Bus namespace. 请考虑下列情形:Consider the following scenarios:

仅限应用程序的故障转移: 此处,应用程序不会在 VNET-1 中,而是会移到 VNET-2 中。Application-only failover: Here, the application won't exist in VNET-1 but will move to VNET-2. 由于主命名空间和辅助命名空间的 VNET-1 和 VNET 2 上都同时配置了这两个专用终结点,因此该应用程序将正常工作。As both private endpoints are configured on both VNET-1 and VNET-2 for both primary and secondary namespaces, the application will just work.

仅限服务总线命名空间的故障转移:此处再次说明,因为在主命名空间和辅助命名空间的两个虚拟网络上都同时配置了这两个专用终结点,因此该应用程序将正常工作。Service Bus namespace-only failover: Here again, since both private endpoints are configured on both virtual networks for both primary and secondary namespaces, the application will just work.

备注

如需虚拟网络异地灾难恢复的相关指导,请参阅虚拟网络 - 业务连续性For guidance on geo-disaster recovery of a virtual network, see Virtual Network - Business Continuity.

后续步骤Next steps

若要了解有关服务总线消息传送的详细信息,请参阅以下文章:To learn more about Service Bus messaging, see the following articles: