异步消息传送模式和高可用性Asynchronous messaging patterns and high availability

可以通过多种不同的方式实现异步消息传送。Asynchronous messaging can be implemented in a variety of different ways. 对于队列、主题和订阅,Azure 服务总线通过存储和转发机制支持异步。With queues, topics, and subscriptions, Azure Service Bus supports asynchronism via a store and forward mechanism. 在正常(同步)操作中,会将消息发送到队列和主题,并从队列和主题接收消息。In normal (synchronous) operation, you send messages to queues and topics, and receive messages from queues and subscriptions. 编写的应用程序依赖于这些始终可用的实体。Applications you write depend on these entities always being available. 当实体运行状况因各种环境而发生变化时,需要一种能够提供满足大多数需求的缩减功能实体的方式。When the entity health changes, due to a variety of circumstances, you need a way to provide a reduced capability entity that can satisfy most needs.

应用程序通常使用异步消息传送模式来实现大量通信方案。Applications typically use asynchronous messaging patterns to enable a number of communication scenarios. 可以构建一些应用程序,以便客户端在其中可以向服务发送消息(即使该服务未运行)。You can build applications in which clients can send messages to services, even when the service is not running. 对于经历大量通信的应用程序,队列可以通过提供缓冲通信的场所,帮助对负载进行分级。For applications that experience bursts of communications, a queue can help level the load by providing a place to buffer communications. 最后,可以获得一个简单而高效的负载均衡器,从而在多台计算机间分发消息。Finally, you can get a simple but effective load balancer to distribute messages across multiple machines.

为了维护任何这些实体的可用性,请考虑表达这些实体可能不可用的多种方式,从而构建持久的消息传送系统。In order to maintain availability of any of these entities, consider a number of different ways in which these entities can appear unavailable for a durable messaging system. 一般而言,发现实体对应用程序不可用时,有以下表达方式:Generally speaking, we see the entity become unavailable to applications we write in the following different ways:

  • 无法发送消息。Unable to send messages.
  • 无法接收消息。Unable to receive messages.
  • 无法管理实体(创建、检索、更新或删除实体)。Unable to manage entities (create, retrieve, update, or delete entities).
  • 无法与服务取得联系。Unable to contact the service.

对于以上每种故障,存在不同的故障模式,从而使应用程序能够在某种程度功能缩减的情况下继续执行工作。For each of these failures, different failure modes exist that enable an application to continue to perform work at some level of reduced capability. 例如,可以发送消息但无法接收消息的系统仍可以从客户接收指令,但无法处理这些指令。For example, a system that can send messages but not receive them can still receive orders from customers but cannot process those orders. 本主题讨论了可能发生的潜在问题,以及如何缓解这些问题。This topic discusses potential issues that can occur, and how those issues are mitigated. 服务总线引入了必须选择加入的大量缓解措施,本主题还介绍了管理这些选择性加入缓解措施的规则。Service Bus has introduced a number of mitigations which you must opt into, and this topic also discusses the rules governing the use of those opt-in mitigations.

服务总线可靠性Reliability in Service Bus

可通过多种方式来处理消息和实体问题,有一套对这些缓解措施的恰当使用进行管理的准则。There are several ways to handle message and entity issues, and there are guidelines governing the appropriate use of those mitigations. 要了解这些准则,必须先了解服务总线中可能出现的故障。To understand the guidelines, you must first understand what can fail in Service Bus. 由于 Azure 系统的设计,所有这些故障往往都是短期的。Due to the design of Azure systems, all of these issues tend to be short-lived. 在高级别中,引起不可用的各种原因如下所示:At a high level, the different causes of unavailability appear as follows:

  • 来自服务总线所依赖的外部系统的限制。Throttling from an external system on which Service Bus depends. 与存储和计算资源的交互存在限制。Throttling occurs from interactions with storage and compute resources.
  • 服务总线所依赖的系统出现问题。Issue for a system on which Service Bus depends. 例如,存储的给定部分可能遇到问题。For example, a given part of storage can encounter issues.
  • 单个子系统上出现服务总线故障。Failure of Service Bus on single subsystem. 在此情况下,计算节点可能会陷入不一致状态而必须重新启动其自身,从而导致它负责处理的所有实体负载均衡到其他节点。In this situation, a compute node can get into an inconsistent state and must restart itself, causing all entities it serves to load balance to other nodes. 这又可能导致短时间内消息处理变慢。This in turn can cause a short period of slow message processing.
  • Azure 数据中心内的服务总线故障。Failure of Service Bus within an Azure datacenter. 这是“灾难性故障”,无论故障时间是数分钟还是几小时,在此期间都无法访问系统。This is a "catastrophic failure" during which the system is unreachable for many minutes or a few hours.

备注

“存储”这一术语既能表示 Azure 存储又能表示 SQL Azure。The term storage can mean both Azure Storage and SQL Azure.

服务总线包含了针对这些问题的许多缓解措施。Service Bus contains a number of mitigations for these issues. 以下各节介绍了每个问题及其相应的缓解措施。The following sections discuss each issue and their respective mitigations.

限制Throttling

通过服务总线,设置限制可以实现协作消息速率管理。With Service Bus, throttling enables cooperative message rate management. 每个单独的服务总线节点包含许多实体。Each individual Service Bus node houses many entities. 其中每个实体都需要在 CPU、内存、存储和其他方面占用系统。Each of those entities makes demands on the system in terms of CPU, memory, storage, and other facets. 当上述任一方面检测到超出定义阈值的使用情况时,服务总线可以拒绝给定的请求。When any of these facets detects usage that exceeds defined thresholds, Service Bus can deny a given request. 调用方会接收到 ServerBusyException,并在 10 秒后重试。The caller receives a ServerBusyException and retries after 10 seconds.

作为一种缓解措施,该代码必须读取错误并停止该消息的任何重试至少 10 秒。As a mitigation, the code must read the error and halt any retries of the message for at least 10 seconds. 由于此错误可能发生在多个客户应用程序之间,所以最好使每个应用程序独立执行重试逻辑。Since the error can happen across pieces of the customer application, it is expected that each piece independently executes the retry logic. 该代码可以通过对队列或主题启用分区来减少受限概率。The code can reduce the probability of being throttled by enabling partitioning on a queue or topic.

Azure 依赖项的问题Issue for an Azure dependency

Azure 中的其他组件可能偶尔会发生服务问题。Other components within Azure can occasionally have service issues. 例如,当服务总线使用的系统正在升级时,该系统可能会暂时出现功能缩减。For example, when a system that Service Bus uses is being upgraded, that system can temporarily experience reduced capabilities. 为了解决这些类型的问题,服务总线会定期进行调查并实施缓解措施。To work around these types of issues, Service Bus regularly investigates and implements mitigations. 这些缓解措施的副作用的确存在。Side effects of these mitigations do appear. 例如,为了处理存储的暂时性问题,服务总线通过实现系统来允许消息发送操作持续工作。For example, to handle transient issues with storage, Service Bus implements a system that allows message send operations to work consistently. 由于缓解措施的性质,发送的消息可能最多需要 15 分钟才能在受影响的队列或订阅中显示以及才可以接收得到。Due to the nature of the mitigation, a sent message can take up to 15 minutes to appear in the affected queue or subscription and be ready for a receive operation. 一般而言,大多数实体不会遇到此问题。Generally speaking, most entities will not experience this issue. 但是,考虑到 Azure 服务总线中的实体数,有时需要为服务总线客户的一小部分实施此缓解措施。However, given the number of entities in Service Bus within Azure, this mitigation is sometimes needed for a small subset of Service Bus customers.

单个子系统上的服务总线故障Service Bus failure on a single subsystem

使用任何应用程序时,环境都可能导致服务总线的内部组件出现不一致。With any application, circumstances can cause an internal component of Service Bus to become inconsistent. 当服务总线检测到这种不一致时,它从该应用程序收集数据以辅助诊断问题。When Service Bus detects this, it collects data from the application to aid in diagnosing what happened. 收集到数据后,会重新启动该应用程序以尝试使其返回一致状态。Once the data is collected, the application is restarted in an attempt to return it to a consistent state. 此过程发生得相当迅速,并且会导致实体长达数分钟不可用,而典型的停机时间则要短得多。This process happens fairly quickly, and results in an entity appearing to be unavailable for up to a few minutes, though typical down times are much shorter.

在这些情况下,客户端应用程序会生成 System.TimeoutExceptionMessagingException 异常。In these cases, the client application generates a System.TimeoutException or MessagingException exception. 服务总线通过自动客户端重试逻辑来缓解该问题。Service Bus contains a mitigation for this issue in the form of automated client retry logic. 如果重试周期用尽而未能传递消息,可以尝试使用关于处理中断和灾难问题的文章中描述的其他功能进行研究。Once the retry period is exhausted and the message is not delivered, you can explore using other mentioned in the article on handling outages and disasters.

后续步骤Next steps

了解服务总线中的异步消息传送的基础知识后,可阅读有关处理中断和灾难问题的详细信息。Now that you've learned the basics of asynchronous messaging in Service Bus, read more details about handling outages and disasters.