Azure Durable Functions 中的灾难恢复和异地分布Disaster recovery and geo-distribution in Azure Durable Functions

Microsoft 致力于确保 Azure 服务一直可用。Microsoft strives to ensure that Azure services are always available. 不过,可能会发生计划外服务中断。However, unplanned service outages may occur. 如果你的应用程序需要复原,Microsoft 建议将应用配置为异地冗余。If your application requires resiliency, Microsoft recommends configuring your app for geo-redundancy. 此外,客户还应制定用于处理区域服务中断的灾难恢复计划。Additionally, customers should have a disaster recovery plan in place for handling a regional service outage. 灾难恢复计划的一个重要组成部分是,准备在主要副本不可用时将故障转移到应用的次要副本。An important part of a disaster recovery plan is preparing to fail over to the secondary replica of your app and storage in the event that the primary replica becomes unavailable.

在 Durable Functions 中,默认情况下所有状态都保存在 Azure 存储中。In Durable Functions, all state is persisted in Azure Storage by default. 任务中心是用于业务流程实体的 Azure 存储资源的逻辑容器。A task hub is a logical container for Azure Storage resources that are used for orchestrations and entities. 只有当业务流程协调程序、活动与实体函数属于同一任务中心时,它们才能彼此进行交互。Orchestrator, activity, and entity functions can only interact with each other when they belong to the same task hub. 本文档在说明保持这些 Azure 存储资源高度可用的方案时,将引用任务中心。This document will refer to task hubs when describing scenarios for keeping these Azure Storage resources highly available.

可以使用通过 HTTP 或其他受支持的 Azure Functions 触发器类型之一触发的客户端函数来触发业务流程和实体。Orchestrations and entities can be triggered using client functions that are themselves triggered via HTTP or one of the other supported Azure Functions trigger types. 还可以使用内置 HTTP API 来触发它们。They can also be triggered using built-in HTTP APIs. 为简单起见,本文将重点介绍涉及 Azure 存储和基于 HTTP 的函数触发器的方案,以及在灾难恢复活动期间增加可用性和最大限度地减少停机时间的选项。For simplicity, this article will focus on scenarios involving Azure Storage and HTTP-based function triggers, and options to increase availability and minimize downtime during disaster recovery activities. 本文不会明确涉及其他触发器类型(如服务总线或 Cosmos DB 触发器)。Other trigger types, such as Service Bus or Cosmos DB triggers, will not be explicitly covered.

以下方案基于“主动-被动”配置,因为它们是使用 Azure 存储引导执行的。The following scenarios are based on Active-Passive configurations, since they are guided by the usage of Azure Storage. 此模式的操作包括将一个后备(被动)函数应用部署到不同的区域。This pattern consists of deploying a backup (passive) function app to a different region. 流量管理器将监视主要(主动)函数应用的 HTTP 可用性。Traffic Manager will monitor the primary (active) function app for HTTP availability. 如果主要应用发生故障,流量管理器会故障转移到后备函数应用。It will fail over to the backup function app if the primary fails. 有关详细信息,请参阅 Azure 流量管理器优先级流量路由方法For more information, see Azure Traffic Manager's Priority Traffic-Routing Method.


  • 提议的“主动-被动”配置可确保客户端始终能够通过 HTTP 触发新业务流程。The proposed Active-Passive configuration ensures that a client is always able to trigger new orchestrations via HTTP. 但是,由于两个函数应用在存储中共享同一个任务中心,因此某些后台存储事务将分散在两者之间。However, as a consequence of having two function apps sharing the same task hub in storage, some background storage transactions will be distributed between both of them. 因此,对辅助函数应用使用此配置会产生一些额外的传出费用。This configuration therefore incurs some added egress costs for the secondary function app.
  • 基础存储帐户和任务中心在主要区域中创建,并由两个函数应用共享。The underlying storage account and task hub are created in the primary region, and are shared by both function apps.
  • 如果通过 HTTP 激活,冗余部署的所有函数应用都必须共享相同的函数访问密钥。All function apps that are redundantly deployed must share the same function access keys in the case of being activated via HTTP. Functions 运行时公开管理 API,让使用者以编程方式添加、删除和更新函数密钥。The Functions Runtime exposes a management API that enables consumers to programmatically add, delete, and update function keys. 还可以使用 Azure 资源管理器 API 实现密钥管理。Key management is also possible using Azure Resource Manager APIs.

方案 1 - 使用共享存储进行负载均衡的计算Scenario 1 - Load balanced compute with shared storage

如果 Azure 中的计算基础结构出现故障,函数应用可能不可用。If the compute infrastructure in Azure fails, the function app may become unavailable. 为了尽量减少出现这种停机的可能性,此方案使用了不同区域中部署的两个函数应用。To minimize the possibility of such downtime, this scenario uses two function apps deployed to different regions. 流量管理器配置为检测主要函数应用中的问题,并自动将流量重定向到次要区域中的函数应用。Traffic Manager is configured to detect problems in the primary function app and automatically redirect traffic to the function app in the secondary region. 此函数应用共享相同的 Azure 存储帐户和任务中心。This function app shares the same Azure Storage account and Task Hub. 因此,函数应用的状态不会丢失,并且可以正常恢复工作。Therefore, the state of the function apps isn't lost and work can resume normally. 主要区域中的运行状况恢复后,Azure 流量管理器会自动开始将请求路由到该函数应用。Once health is restored to the primary region, Azure Traffic Manager will start routing requests to that function app automatically.

显示方案 1 的示意图。

使用此部署方案可获得多种好处:There are several benefits when using this deployment scenario:

  • 如果计算基础结构出现故障,可以在故障转移区域中恢复工作,且不会丢失数据。If the compute infrastructure fails, work can resume in the failover region without data loss.
  • 流量管理器负责自动故障转移到正常的函数应用。Traffic Manager takes care of the automatic failover to the healthy function app automatically.
  • 服务中断得到解决后,流量管理器会自动在主要函数应用中重建流量。Traffic Manager automatically re-establishes traffic to the primary function app after the outage has been corrected.

但是,使用此方案时需要考虑以下问题:However, using this scenario consider:

  • 如果函数应用是使用专用应用服务计划部署的,则在故障转移数据中心复制计算基础结构会增加成本。If the function app is deployed using a dedicated App Service plan, replicating the compute infrastructure in the failover datacenter increases costs.
  • 此方案考虑到了计算基础结构的中断,但存储帐户仍旧是函数应用的单一故障点。This scenario covers outages at the compute infrastructure, but the storage account continues to be the single point of failure for the function App. 如果出现存储中断,应用程序会遭遇停机。If a Storage outage occurs, the application suffers downtime.
  • 如果函数应用已故障转移,则延迟会增大,因为它会跨区域访问其存储帐户。If the function app is failed over, there will be increased latency since it will access its storage account across regions.
  • 由于网络出口流量方面的原因,从其他区域(不是函数应用所在的区域)访问存储服务会产生更高的成本。Accessing the storage service from a different region where it's located incurs in higher cost due to network egress traffic.
  • 此方案依赖于流量管理器。This scenario depends on Traffic Manager. 考虑到流量管理器的工作原理,使用 Durable Function 的客户端应用程序可能在一段时间之后才需要再次从流量管理器查询函数应用地址。Considering how Traffic Manager works, it may be some time until a client application that consumes a Durable Function needs to query again the function app address from Traffic Manager.


从 Durable Functions 扩展的 v2.3.0 开始,可以使用相同的存储帐户和任务中心配置同时安全地运行两个函数应用。Starting in v2.3.0 of the Durable Functions extension, two function apps can be run safely at the same time with the same storage account and task hub configuration. 要启动的第一个应用将获取应用程序级 Blob 租约,以防止其他应用从任务中心队列中窃取消息。The first app to start will acquire an application-level blob lease that prevents other apps from stealing messages from the task hub queues. 如果第一个应用停止运行,其租约将过期,并可由第二个应用获取,然后该应用将继续处理任务中心消息。If this first app stops running, its lease will expire and can be acquired by a second app, which will then proceed to process task hub messages.

在 v2.3.0 之前,配置为使用相同的存储帐户的函数应用将同时处理消息并更新存储项目,从而导致更高的总体延迟和传出费用。Prior to v2.3.0, function apps that are configured to use the same storage account will process messages and update storage artifacts concurrently, resulting in much higher overall latencies and egress costs. 如果主应用和复制应用曾经部署过不同的代码(即使是暂时的),则业务流程也可能由于两个应用中的业务流程协调程序函数不一致而无法正确执行。If the primary and replica apps ever have different code deployed to them, even temporarily, then orchestrations could also fail to execute correctly because of orchestrator function inconsistencies across the two apps. 因此,建议所有需要用于灾难恢复目的的地理分布的应用都使用 Durable 扩展的 v2.3.0 或更高版本。It is therefore recommended that all apps that require geo-distribution for disaster recovery purposes use v2.3.0 or higher of the Durable extension.

方案 2 - 使用区域存储进行负载均衡的计算Scenario 2 - Load balanced compute with regional storage

前面的方案仅考虑到了计算基础结构发生故障的情况。The preceding scenario covers only the case of failure in the compute infrastructure. 如果存储服务出现故障,将会导致函数应用中断。If the storage service fails, it will result in an outage of the function app. 为确保 Durable Functions 的持续运行,此方案在函数应用部署到的每个区域中使用了一个本地存储帐户。To ensure continuous operation of the durable functions, this scenario uses a local storage account on each region to which the function apps are deployed.

显示方案 2 的示意图。

此方法在前一方案的基础上做了改进:This approach adds improvements on the previous scenario:

  • 如果函数应用出现故障,流量管理器会负责故障转移到次要区域。If the function app fails, Traffic Manager takes care of failing over to the secondary region. 但是,由于函数应用依赖于其自身的存储帐户,因此 Durable Functions 可继续工作。However, because the function app relies on its own storage account, the durable functions continue to work.
  • 在故障转移期间,故障转移区域中的延迟不会增大,因为函数应用和存储帐户是并置的。During a failover, there is no additional latency in the failover region since the function app and the storage account are colocated.
  • 存储层故障会导致 Durable Functions 故障,从而又会触发到故障转移区域的重定向。Failure of the storage layer will cause failures in the durable functions, which in turn will trigger a redirection to the failover region. 同样,由于函数应用和存储已按区域隔离,因此 Durable Functions 可继续工作。Again, since the function app and storage are isolated per region, the durable functions will continue to work.

此方案的重要注意事项:Important considerations for this scenario:

  • 如果函数应用是使用专用应用服务计划部署的,则在故障转移数据中心复制计算基础结构会增加成本。If the function app is deployed using a dedicated App Service plan, replicating the compute infrastructure in the failover datacenter increases costs.
  • 当前状态不是故障转移,这意味着现有业务流程和实体将有效暂停并不可用,直到主要区域恢复。Current state isn't failed over, which implies that existing orchestrations and entities will be effectively paused and unavailable until the primary region recovers.

总之,第一个和第二个方案之间的权衡是保留延迟和降低传出费用,但现有业务流程和实体在停机期间将不可用。To summarize, the tradeoff between the first and second scenario is that latency is preserved and egress costs are minimized but existing orchestrations and entities will be unavailable during the downtime. 这些权衡是否可接受取决于应用程序的要求。Whether these tradeoffs are acceptable depends on the requirements of the application.

方案 3 - 使用 GRS 共享存储进行负载均衡的计算Scenario 3 - Load balanced compute with GRS shared storage

此方案在第一种方案的基础上做了修改,它实施共享存储帐户。This scenario is a modification over the first scenario, implementing a shared storage account. 主要区别在于,存储帐户是在启用异地复制的情况下创建的。The main difference is that the storage account is created with geo-replication enabled. 从功能上讲,此方案提供的优势与方案 1 相同,但它可以实现更大的数据恢复优势:Functionally, this scenario provides the same advantages as Scenario 1, but it enables additional data recovery advantages:

  • 异地冗余存储 (GRS) 和读取访问 GRS (RA-GRS) 可最大程度地提高存储帐户的可用性。Geo-redundant storage (GRS) and Read-access GRS (RA-GRS) maximize availability for your storage account.
  • 如果存储服务的区域中断,你可以手动启动对次要副本的故障转移。If there is a regional outage of the Storage service, you can manually initiate a failover to the secondary replica. 在由于重大灾难而导致区域丢失的极端情况下,Microsoft 可能会启动区域故障转移。In extreme circumstances where a region is lost due to a significant disaster, Microsoft may initiate a regional failover. 在此情况下,不需要采取任何操作。In this case, no action on your part is required.
  • 发生故障转移时,Durable Functions 的状态最长可保留到上次复制存储帐户为止(存储帐户通常每隔几分钟就会复制一次)。When a failover happens, state of the durable functions will be preserved up to the last replication of the storage account, which typically occurs every few minutes.

与其他方案一样,需注意以下重要事项:As with the other scenarios, there are important considerations:

  • 故障转移到副本可能需要一段时间。A failover to the replica may take some time. 在故障转移完成和 Azure 存储 DNS 记录更新之前,函数应用将发生中断。Until the failover completes and Azure Storage DNS records have been updated, the function app will suffer an outage.
  • 使用异地复制的存储帐户不会增加成本。There is an increased cost for using geo-replicated storage accounts.
  • GRS 复制以异步方式复制数据。GRS replication copies your data asynchronously. 由于复制过程中的延迟,某些最新事务可能会丢失。Some of the latest transactions might be lost because of the latency of the replication process.

显示方案 3 的示意图。


如方案 1 中所述,强烈建议通过此策略部署的函数应用使用 v2.3.0 或更高版本的 Durable Functions 扩展。As described in Scenario 1, it is strongly recommended that function apps deployed with this strategy use v2.3.0 or higher of the Durable Functions extension.

有关详细信息,请参阅 Azure 存储灾难恢复和存储帐户故障转移文档。For more information, see the Azure Storage disaster recovery and storage account failover documentation.

后续步骤Next steps