Azure Cache for Redis 的故障转移和修补Failover and patching for Azure Cache for Redis

若要构建可复原的、成功的客户端应用程序,了解 Azure Cache for Redis 服务上下文中的故障转移非常重要。To build resilient and successful client applications, it's critical to understand failover in the context of the Azure Cache for Redis service. 故障转移可以是计划的管理运营的一部分,也可能是非计划的硬件或网络故障造成的。A failover can be a part of planned management operations, or might be caused by unplanned hardware or network failures. 当管理服务修补 Azure Cache for Redis 二进制文件时,往往会使用缓存故障转移。A common use of cache failover comes when the management service patches the Azure Cache for Redis binaries. 本文介绍故障转移的概念、如何在修补过程中进行故障转移,以及如何构建可复原的客户端应用程序。This article covers what a failover is, how it occurs during patching, and how to build a resilient client application.

什么是故障转移?What is a failover?

让我们从 Azure Cache for Redis 故障转移的概述开始。Let's start with an overview of failover for Azure Cache for Redis.

缓存体系结构的快速摘要A quick summary of cache architecture

缓存是由多个具有不同专用 IP 地址的虚拟机构造而成的。A cache is constructed of multiple virtual machines with separate, private IP addresses. 每个虚拟机(也称为节点)通过单个虚拟 IP 地址连接到一个共享的负载均衡器。Each virtual machine, also known as a node, is connected to a shared load balancer with a single virtual IP address. 每个节点运行 Redis 服务器进程,可通过主机名和 Redis 端口进行访问。Each node runs the Redis server process and is accessible by means of the host name and the Redis ports. 每个节点被视为主节点或副本节点。Each node is considered either a primary or a replica node. 当客户端应用程序连接到缓存时,其流量将流经此负载均衡器,并自动路由到主节点。When a client application connects to a cache, its traffic goes through this load balancer and is automatically routed to the primary node.

在基本缓存中,单一节点始终为主节点。In a Basic cache, the single node is always a primary. 标准或高级缓存中有两个节点,其中一个为主节点,另一个为副本节点。In a Standard or Premium cache, there are two nodes: one is chosen as the primary and the other is the replica. 由于标准缓存和高级缓存有多个节点,其中一个节点可处于不可用状态,而其他节点可继续处理请求。Because Standard and Premium caches have multiple nodes, one node might be unavailable while the other continues to process requests. 群集缓存由许多分片组成,每个分片具有不同的主节点和副本节点。Clustered caches are made of many shards, each with distinct primary and replica nodes. 一个分片可处于关闭状态,其他节点仍保持可用。One shard might be down while the others remain available.


基本缓存没有多个节点,并且不提供可用性方面的服务级别协议 (SLA)。A Basic cache doesn't have multiple nodes and doesn't offer a service-level agreement (SLA) for its availability. 基本缓存仅建议用于开发和测试目的。Basic caches are recommended only for development and testing purposes. 对多节点部署使用标准或高级缓存可提高可用性。Use a Standard or Premium cache for a multi-node deployment, to increase availability.

故障转移的说明Explanation of a failover

当某个副本节点将其自身提升为主节点,且旧主节点关闭现有连接时,将发生故障转移。A failover occurs when a replica node promotes itself to become a primary node, and the old primary node closes existing connections. 主节点重新启动后,它会注意到角色的变化,从而将自身降级为副本节点。After the primary node comes back up, it notices the change in roles and demotes itself to become a replica. 然后,它将连接到新的主节点并同步数据。It then connects to the new primary and synchronizes data. 故障转移可以是计划性的,也可以是非计划的。A failover might be planned or unplanned.

计划性故障转移发生在系统更新(例如 Redis 修补或 OS 升级)和管理操作(例如缩放和重启)过程中。 A planned failover takes place during system updates, such as Redis patching or OS upgrades, and management operations, such as scaling and rebooting. 由于节点会提前收到更新通知,因此它们可以协作交换角色,并在更改后快速更新负载均衡器。Because the nodes receive advance notice of the update, they can cooperatively swap roles and quickly update the load balancer of the change. 计划性故障转移通常可在 1 秒内完成。A planned failover typically finishes in less than 1 second.

发生非计划性故障转移的可能原因是硬件故障、网络故障或主节点的其他意外中断。An unplanned failover might happen because of hardware failure, network failure, or other unexpected outages to the primary node. 副本节点可将自身提升为主节点,但该过程需要更长时间。The replica node promotes itself to primary, but the process takes longer. 副本节点必须先检测到其主节点不可用,然后才能启动故障转移过程。A replica node must first detect that its primary node is not available before it can initiate the failover process. 副本节点还必须验证此非计划性故障不是暂时性的或局部性的,以避免不必要的故障转移。The replica node must also verify that this unplanned failure is not transient or local, to avoid an unnecessary failover. 检测时出现的这种延迟意味着非计划性故障转移通常要在 10 到 15 秒内完成。This delay in detection means that an unplanned failover typically finishes within 10 to 15 seconds.

修补是如何进行的?How does patching occur?

Azure Cache for Redis 服务定期使用最新的平台功能和修补程序更新缓存。The Azure Cache for Redis service regularly updates your cache with the latest platform features and fixes. 该服务遵循以下步骤来修补缓存:To patch a cache, the service follows these steps:

  1. 管理服务选择一个要修补的节点。The management service selects one node to be patched.
  2. 如果所选的节点是主节点,则相应的副本节点将以协作方式提升自身。If the selected node is a primary node, the corresponding replica node cooperatively promotes itself. 这种升级被视为计划性故障转移。This promotion is considered a planned failover.
  3. 所选节点将重新启动以获取新的更改,然后以副本节点的角色重新启动。The selected node reboots to take the new changes and comes back up as a replica node.
  4. 副本节点连接到主节点并同步数据。The replica node connects to the primary node and synchronizes data.
  5. 数据同步完成后,将对剩余的节点重复修补过程。When the data sync is complete, the patching process repeats for the remaining nodes.

因为修补属于计划性故障转移,所以副本节点会快速将自身提升为主节点,并开始为请求和新连接提供服务。Because patching is a planned failover, the replica node quickly promotes itself to become a primary and begins servicing requests and new connections. 基本缓存没有副本节点,在更新完成之前不可用。Basic caches don't have a replica node and are unavailable until the update is complete. 群集缓存的每个分片单独进行修补,不会关闭与另一个分片的连接。Each shard of a clustered cache is patched separately and won't close connections to another shard.


每次修补一个节点以防数据丢失。Nodes are patched one at a time to prevent data loss. 基本缓存会发生数据丢失。Basic caches will have data loss. 每次修补缓存群集的一个分片。Clustered caches are patched one shard at a time.

同一资源组和区域中的多个缓存也是每次修补一个。Multiple caches in the same resource group and region are also patched one at a time. 不同资源组或区域中的缓存可以同时修补。Caches that are in different resource groups or different regions might be patched simultaneously.

由于完全数据同步在该过程重复之前发生,因此,在使用标准或高级缓存时不太可能发生数据丢失。Because full data synchronization happens before the process repeats, data loss is unlikely to occur when you use a Standard or Premium cache. 可以使用导出数据功能并启用持久性来进一步防止数据丢失。You can further guard against data loss by exporting data and enabling persistence.

额外的缓存负载Additional cache load

每当发生故障转移时,标准和高级缓存需要将数据从一个节点复制到另一个节点。Whenever a failover occurs, the Standard and Premium caches need to replicate data from one node to the other. 这种复制会导致负载消耗的服务器内存和 CPU 增大。This replication causes some load increase in both server memory and CPU. 如果缓存实例的负载已很繁重,客户端应用程序遇到的延迟可能会增大。If the cache instance is already heavily loaded, client applications might experience increased latency. 在极端情况下,客户端应用程序可能会收到超时异常。In extreme cases, client applications might receive time-out exceptions. 若要帮助减轻此额外负载造成的影响,请配置缓存的 maxmemory-reserved 设置。To help mitigate the impact of this additional load, configure the cache's maxmemory-reserved setting.

故障转移如何影响我的客户端应用程序?How does a failover affect my client application?

客户端应用程序遇到的错误数目取决于故障转移时该连接上挂起的操作数目。The number of errors seen by the client application depends on how many operations were pending on that connection at the time of the failover. 通过关闭连接的节点路由的任何连接将遇到错误。Any connection that's routed through the node that closed its connections will see errors. 在连接中断时,许多客户端库可能会引发不同类型的错误,包括超时异常、连接异常或套接字异常。Many client libraries can throw different types of errors when connections break, including time-out exceptions, connection exceptions, or socket exceptions. 异常的数目和类型取决于当缓存关闭其连接时,请求在代码路径中所处的位置。The number and type of exceptions depends on where in the code path the request is when the cache closes its connections. 例如,在发生故障转移时发送了请求但未收到响应的操作可能会收到超时异常。For instance, an operation that sends a request but hasn't received a response when the failover occurs might get a time-out exception. 对关闭的连接对象发出的新请求将收到连接异常,直到重新连接成功为止。New requests on the closed connection object receive connection exceptions until the reconnection happens successfully.

大多数客户端库会尝试重新连接到缓存(如果采用此配置)。Most client libraries attempt to reconnect to the cache if they're configured to do so. 但是,不可预测的 bug 偶尔会将库对象置于不可恢复状态。However, unforeseen bugs can occasionally place the library objects into an unrecoverable state. 如果出错的持续时间超过了预先配置的时间,则应重新创建连接对象。If errors persist for longer than a preconfigured amount of time, the connection object should be recreated. 在 Microsoft.NET 和其他面向对象的语言中,可以使用 Lazy<T> 模式来重新创建连接,而无需重启应用程序。In Microsoft.NET and other object-oriented languages, recreating the connection without restarting the application can be accomplished by using a Lazy<T> pattern.

如何使应用程序能够复原?How do I make my application resilient?

由于故障转移不可完全避免,因此,编写的客户端应用程序应该能够弹性应对连接中断和请求失败。Because you can't avoid failovers completely, write your client applications for resiliency to connection breaks and failed requests. 尽管大多数客户端库可自动重新连接到缓存终结点,但有少量的客户端库会尝试重试失败的请求。Although most client libraries automatically reconnect to the cache endpoint, few of them attempt to retry failed requests. 根据具体的应用方案,使用支持退让的重试逻辑可能有作用。Depending on the application scenario, it might make sense to use retry logic with backoff.

若要测试客户端应用程序的复原能力,请使用重新启动作为连接中断时的手动触发器。To test a client application's resiliency, use a reboot as a manual trigger for connection breaks. 此外,我们建议针对缓存计划更新Additionally, we recommend that you schedule updates on a cache. 告知管理服务在指定的每周时段应用 Redis 运行时修补程序。Tell the management service to apply Redis runtime patches during specified weekly windows. 通常,这些时段是客户端应用程序流量较低的时段,目的是避免潜在的事件。These windows are typically periods when client application traffic is low, to avoid potential incidents.

能否在执行计划维护之前通知我?Can I be notified in advance of a planned maintenance?

现在,Azure Cache for Redis 会在执行计划更新前约 30 秒,在名为 AzureRedisEvents 的发布/订阅渠道上发布通知。Azure Cache for Redis now publishes notifications on a publish/subscribe channel called AzureRedisEvents around 30 seconds before planned updates. 这些都是运行时通知,专为可以使用断路器绕过缓存或缓冲区命令的应用程序而构建,例如在计划的更新过程中。These are runtime notifications, and they're built especially for applications that can use circuit breakers to bypass the cache or buffer commands, for example, during planned updates. 这不是一种可以提前几天或几小时通知你的机制。It's not a mechanism that can notify you days or hours in advance.

客户端网络配置更改Client network-configuration changes

某些客户端网络配置更改可能会触发“无可用连接”错误。Certain client-side network-configuration changes can trigger "No connection available" errors. 此类更改可能包括:Such changes might include:

  • 在过渡槽与生产槽之间交换客户端应用程序的虚拟 IP 地址。Swapping a client application's virtual IP address between staging and production slots.
  • 缩放应用程序实例的大小或数量。Scaling the size or number of instances of your application.

此类更改可能会导致持续一分钟以下的连接问题。Such changes can cause a connectivity issue that lasts less than one minute. 客户端应用程序除了断开与 Azure Cache for Redis 服务的连接以外,还可能会断开与其他外部网络资源的连接。Your client application will probably lose its connection to other external network resources in addition to the Azure Cache for Redis service.

后续步骤Next steps