使应用程序免受服务总线中断和灾难影响的最佳实践Best practices for insulating applications against Service Bus outages and disasters

任务关键型应用程序必须连续运行,即使是在计划外中断或灾难发生时。Mission-critical applications must operate continuously, even in the presence of unplanned outages or disasters. 本文介绍可用于保护服务总线应用程序免受潜在的服务中断和灾难影响的技术。This article describes techniques you can use to protect Service Bus applications against a potential service outage or disaster.

中断定义为 Azure 服务总线暂时不可用。An outage is defined as the temporary unavailability of Azure Service Bus. 中断会影响服务总线的一些组件,例如消息存储空间,甚至整个数据中心。The outage can affect some components of Service Bus, such as a messaging store, or even the entire datacenter. 问题解决后,服务总线恢复可用。After the problem has been fixed, Service Bus becomes available again. 通常情况下,中断不会导致消息或其他数据丢失。Typically, an outage does not cause loss of messages or other data. 组件故障的一个示例是特定的消息存储空间不可用。An example of a component failure is the unavailability of a particular messaging store. 数据中心范围中断的示例有数据中心电源故障或数据中心网络交换机故障。An example of a datacenter-wide outage is a power failure of the datacenter, or a faulty datacenter network switch. 中断可能会持续几分钟到几天的时间。An outage can last from a few minutes to a few days.

灾难定义为服务总线缩放单元或数据中心永久丢失。A disaster is defined as the permanent loss of a Service Bus scale unit or datacenter. 数据中心可能会也可能不会恢复可用。The datacenter may or may not become available again. 通常,灾难会导致消息或其他数据的部分或全部丢失。Typically a disaster causes loss of some or all messages or other data. 灾难的示例包括火灾、洪灾或地震。Examples of disasters are fire, flooding, or earthquake.

防范中断和灾难 - 服务总线高级版Protecting against Outages and Disasters - Service Bus Premium

高可用性和灾难恢复概念直接内置到 Azure 服务总线高级层中,无论是在同一区域中(通过可用性区域)还是跨不同的区域(通过异地灾难恢复)都可以实现。High Availability and Disaster Recovery concepts are built right into the Azure Service Bus Premium tier, both within the same region (via Availability Zones) and across different regions (via Geo-Disaster Recovery).

异地灾难恢复Geo-Disaster Recovery

服务总线高级版支持命名空间级别的异地灾难恢复。Service Bus Premium supports Geo-disaster recovery, at the namespace level. 有关详细信息,请参阅 Azure 服务总线异地灾难恢复For more information, see Azure Service Bus Geo-disaster recovery. 灾难恢复功能仅适用于高级 SKU,可实现元数据灾难恢复,并且依赖于主要和辅助灾难恢复命名空间。The disaster recovery feature, available for the Premium SKU only, implements metadata disaster recovery, and relies on primary and secondary disaster recovery namespaces.

防范中断和灾难 - 服务总线标准版Protecting against Outages and Disasters - Service Bus Standard

为了在使用标准消息传送定价层时实现针对数据中心中断的恢复,服务总线支持两种方法:主动和被动复制。To achieve resilience against datacenter outages when using the standard messaging pricing tier, Service Bus supports two approaches: active and passive replication. 对于每一种方法,如果必须在数据中心中断的情况下仍可访问给定的队列或主题,可以在两个命名空间中创建。For each approach, if a given queue or topic must remain accessible in the presence of a datacenter outage, you can create it in both namespaces. 两个实体可以具有相同的名称。Both entities can have the same name. 例如,可在 contosoPrimary.servicebus.chinacloudapi.cn/myQueue 下访问主要队列,而在 contosoSecondary.servicebus.chinacloudapi.cn/myQueue 下访问其辅助队列。For example, a primary queue can be reached under contosoPrimary.servicebus.chinacloudapi.cn/myQueue, while its secondary counterpart can be reached under contosoSecondary.servicebus.chinacloudapi.cn/myQueue.


主动复制被动复制设置是常规用途解决方案,不是服务总线的特定功能。The Active Replication and Passive Replication setup are general purpose solutions and not specific features of Service Bus. 复制逻辑(发送到 2 个不同的命名空间)存在于发送方应用程序上,而接收方必须具有用于检测重复项的自定义逻辑。The replication logic (sending to 2 different namespaces) lives on the sender applications and the receiver has to have custom logic for duplicate detection.

如果应用程序不需要发送方到接收方的持续通信,则该应用程序可实施一个用于防止消息丢失的持久客户端队列,从而保护发送方免受任何暂时性服务总线故障的影响。If the application does not require permanent sender-to-receiver communication, the application can implement a durable client-side queue to prevent message loss and to shield the sender from any transient Service Bus errors.

主动复制Active replication

主动复制对于每个操作都使用这两个命名空间中的实体。Active replication uses entities in both namespaces for every operation. 任何发送消息的客户端都会发送同一条消息的两个副本。Any client that sends a message sends two copies of the same message. 第一个副本发送到主要实体(例如 contosoPrimary.servicebus.chinacloudapi.cn/sales),该消息的第二个副本发送到辅助实体(例如 contosoSecondary.servicebus.chinacloudapi.cn/sales)。The first copy is sent to the primary entity (for example, contosoPrimary.servicebus.chinacloudapi.cn/sales), and the second copy of the message is sent to the secondary entity (for example, contosoSecondary.servicebus.chinacloudapi.cn/sales).

客户端从两个队列接收消息。A client receives messages from both queues. 如果接收方处理了消息的第一个副本,则第二个副本会被取消。The receiver processes the first copy of a message, and the second copy is suppressed. 要取消重复的消息,发送方必须用唯一标识符标记每一条消息。To suppress duplicate messages, the sender must tag each message with a unique identifier. 必须用同一标识符标记消息的两个副本。Both copies of the message must be tagged with the same identifier. 可使用 BrokeredMessage.MessageIdBrokeredMessage.Label 属性或自定义属性对消息进行标记。You can use the BrokeredMessage.MessageId or BrokeredMessage.Label properties, or a custom property to tag the message. 接收方必须保留已接收消息的列表。The receiver must maintain a list of messages that it has already received.

使用服务总线标准层进行异地复制示例演示了消息传送实体的主动复制。The Geo-replication with Service Bus Standard Tier sample demonstrates active replication of messaging entities.


主动复制方法会使操作数加倍,因此这种方法可能导致成本上升。The active replication approach doubles the number of operations, therefore this approach can lead to higher cost.

被动复制Passive replication

在无故障的情况下,被动复制仅使用两个消息传送实体之一。In the fault-free case, passive replication uses only one of the two messaging entities. 客户端将消息发送给活动实体。A client sends the message to the active entity. 如果针对活动实体的操作失败并返回错误代码,表明承载活动实体的数据中心可能不可用,则客户端将该消息的副本发送到备份实体。If the operation on the active entity fails with an error code that indicates the datacenter that hosts the active entity might be unavailable, the client sends a copy of the message to the backup entity. 此时,活动实体与备份实体互换角色:进行发送的客户端将旧的活动实体认定为新的备份实体,而将旧的备份实体认定为新的活动实体。At that point the active and the backup entities switch roles: the sending client considers the old active entity to be the new backup entity, and the old backup entity is the new active entity. 如果两次发送操作都失败,则两个实体的角色将保持不变并返回错误。If both send operations fail, the roles of the two entities remain unchanged and an error is returned.

客户端从两个队列接收消息。A client receives messages from both queues. 因为接收方可能接收同一条消息的两个副本,所以接收方必须取消重复消息。Because there is a chance that the receiver receives two copies of the same message, the receiver must suppress duplicate messages. 可通过与主动复制中所述的相同方式取消重复消息。You can suppress duplicates in the same way as described for active replication.

一般来说,被动复制比主动重复更经济,因为在大多数情况下仅执行一个操作。In general, passive replication is more economical than active replication because in most cases only one operation is performed. 延迟、吞吐量和货币成本均与非复制场景相同。Latency, throughput, and monetary cost are identical to the non-replicated scenario.

使用被动复制时,在以下情况下可能丢失消息或接收两次:When using passive replication, in the following scenarios messages can be lost or received twice:

  • 消息延迟或丢失:假定发送方将消息 m1 成功发送到主要队列,而该队列在接收方接收 m1 之前变为不可用。Message delay or loss: Assume that the sender successfully sent a message m1 to the primary queue, and then the queue becomes unavailable before the receiver receives m1. 发送方将后续消息 m2 发送给辅助队列。The sender sends a subsequent message m2 to the secondary queue. 如果主要队列是暂时不可用,则接收方会在该队列恢复可用后接收 m1。If the primary queue is temporarily unavailable, the receiver receives m1 after the queue becomes available again. 如果发生灾难,则接收方可能永远无法接收 m1。In case of a disaster, the receiver may never receive m1.
  • 重复接收:假定发送方将消息 m 发送到主要队列。Duplicate reception: Assume that the sender sends a message m to the primary queue. 服务总线成功处理了 m 但无法发送响应。Service Bus successfully processes m but fails to send a response. 发送操作超时后,发送方将向辅助队列发送 m 的一份相同副本。After the send operation times out, the sender sends an identical copy of m to the secondary queue. 如果接收方能够在主要队列变为不可用之前接收 m 的第一个副本,则接收方会在几乎同一时间接收 m 的两个副本。If the receiver is able to receive the first copy of m before the primary queue becomes unavailable, the receiver receives both copies of m at approximately the same time. 如果接收方不能在主要队列变为不可用之前接收 m 的第一个副本,则接收方首先仅接收 m 的第二个副本,但在主要队列变为可用后接收 m 的另一个副本。If the receiver is not able to receive the first copy of m before the primary queue becomes unavailable, the receiver initially receives only the second copy of m, but then receives a second copy of m when the primary queue becomes available.

使用服务总线标准层进行异地复制示例演示了消息传送实体的被动复制。The Geo-replication with Service Bus Standard Tier sample demonstrates passive replication of messaging entities.

保护中继终结点免受数据中心中断或灾难的影响Protecting relay endpoints against datacenter outages or disasters

Azure 中继终结点的异地复制使得公开中继终结点的服务在服务总线中断时可用。Geo-replication of Azure Relay endpoints allows a service that exposes a relay endpoint to be reachable in the presence of Service Bus outages. 若要实现异地复制,该服务必须在不同的命名空间中创建两个中继终结点。To achieve geo-replication, the service must create two relay endpoints in different namespaces. 命名空间必须位于不同的数据中心,且两个终结点必须具有不同的名称。The namespaces must reside in different datacenters and the two endpoints must have different names. 例如,可在 contosoPrimary.servicebus.chinacloudapi.cn/myPrimaryService 下访问主要终结点,而在 contosoSecondary.servicebus.chinacloudapi.cn/mySecondaryService 下访问其辅助终结点。For example, a primary endpoint can be reached under contosoPrimary.servicebus.chinacloudapi.cn/myPrimaryService, while its secondary counterpart can be reached under contosoSecondary.servicebus.chinacloudapi.cn/mySecondaryService.

该服务随后侦听两个终结点,客户端可通过其中任一终结点调用服务。The service then listens on both endpoints, and a client can invoke the service via either endpoint. 客户端应用程序随机选取一个中继作为主要终结点,并向活动终结点发送请求。A client application randomly picks one of the relays as the primary endpoint, and sends its request to the active endpoint. 如果操作失败并返回错误代码,此故障指示中继终结点不可用。If the operation fails with an error code, this failure indicates that the relay endpoint is not available. 应用程序会打开通向备份终结点的通道并重新发送请求。The application opens a channel to the backup endpoint and reissues the request. 此时,活动终结点与备份终结点将互换角色:客户端应用程序会将旧的活动终结点认定为新的备份终结点,而将旧的备份终结点认定为新的活动终结点。At that point the active and the backup endpoints switch roles: the client application considers the old active endpoint to be the new backup endpoint, and the old backup endpoint to be the new active endpoint. 如果两次发送操作都失败,则两个实体的角色将保持不变并返回错误。If both send operations fail, the roles of the two entities remain unchanged and an error is returned.

后续步骤Next steps

若要了解有关灾难恢复的详细信息,请参阅这些文章:To learn more about disaster recovery, see these articles: