Azure Durable Functions 中的灾难恢复和异地分布Disaster recovery and geo-distribution in Azure Durable Functions

在 Durable Functions 中,所有状态保存在 Azure 存储中。In Durable Functions, all state is persisted in Azure Storage. 任务中心是用于业务流程的 Azure 存储资源的逻辑容器。A task hub is a logical container for Azure Storage resources that are used for orchestrations. 只有当业务流程协调程序函数与活动函数属于同一任务中心时,它们才能彼此进行交互。Orchestrator and activity functions can only interact with each other when they belong to the same task hub. 所述的方案会提议一些部署选项,以提高可用性,并在灾难恢复活动期间尽量减少停机时间。The described scenarios propose deployment options to increase availability and minimize downtime during disaster recovery activities.

必须注意,这些方案基于“主动-被动”配置,因为它们是使用 Azure 存储引导执行的。It's important to notice that these scenarios are based on Active-Passive configurations, since they are guided by the usage of Azure Storage. 此模式的操作包括将一个后备(被动)函数应用部署到不同的区域。This pattern consists of deploying a backup (passive) function app to a different region. 流量管理器将监视主要(主动)函数应用的可用性。Traffic Manager will monitor the primary (active) function app for availability. 如果主要应用发生故障,流量管理器会故障转移到后备函数应用。It will fail over to the backup function app if the primary fails. 有关详细信息,请参阅流量管理器优先级流量路由方法For more information, see Traffic Manager's Priority Traffic-Routing Method.


  • 提议的“主动-被动”配置可确保客户端始终能够通过 HTTP 触发新业务流程。The proposed Active-Passive configuration ensures that a client is always able to trigger new orchestrations via HTTP. 但是,由于两个函数应用共享同一个存储,因此后台处理负载将分散在两者之间,导致争用相同队列中的消息。However, as a consequence of having two function apps sharing the same storage, background processing will be distributed between both of them, competing for messages on the same queues. 对辅助函数应用使用此配置会产生额外的传出费用。This configuration incurs in added egress costs for the secondary function app.
  • 基础存储帐户和任务中心在主要区域中创建,并由两个函数应用共享。The underlying storage account and task hub are created in the primary region, and are shared by both function apps.
  • 如果通过 HTTP 激活,冗余部署的所有函数应用必须共享相同的函数访问密钥。All function apps that are redundantly deployed, must share the same function access keys in the case of being activated via HTTP. Functions 运行时公开管理 API,让使用者以编程方式添加、删除和更新函数密钥。The Functions Runtime exposes a management API that enables consumers to programmatically add, delete, and update function keys.

方案 1 - 使用共享存储进行负载均衡的计算Scenario 1 - Load balanced compute with shared storage

如果 Azure 中的计算基础结构出现故障,函数应用可能不可用。If the compute infrastructure in Azure fails, the function app may become unavailable. 为了尽量减少出现这种停机的可能性,此方案使用了不同区域中部署的两个函数应用。To minimize the possibility of such downtime, this scenario uses two function apps deployed to different regions. 流量管理器配置为检测主要函数应用中的问题,并自动将流量重定向到次要区域中的函数应用。Traffic Manager is configured to detect problems in the primary function app and automatically redirect traffic to the function app in the secondary region. 此函数应用共享相同的 Azure 存储帐户和任务中心。This function app shares the same Azure Storage account and Task Hub. 因此,函数应用的状态不会丢失,并且可以正常恢复工作。Therefore, the state of the function apps isn't lost and work can resume normally. 主要区域中的运行状况恢复后,Azure 流量管理器会自动开始将请求路由到该函数应用。Once health is restored to the primary region, Azure Traffic Manager will start routing requests to that function app automatically.

显示方案 1 的示意图。

使用此部署方案可获得多种好处:There are several benefits when using this deployment scenario:

  • 如果计算基础结构出现故障,可以在故障转移区域中恢复工作,且不丢失状态。If the compute infrastructure fails, work can resume in the fail over region without state loss.
  • 流量管理器负责自动故障转移到正常的函数应用。Traffic Manager takes care of the automatic fail over to the healthy function app automatically.
  • 服务中断得到解决后,流量管理器会自动在主要函数应用中重建流量。Traffic Manager automatically re-establishes traffic to the primary function app after the outage has been corrected.

但是,使用此方案时需要考虑以下问题:However, using this scenario consider:

  • 如果函数应用是使用专用应用服务计划部署的,则在故障转移数据中心复制计算基础结构会增加成本。If the function app is deployed using a dedicated App Service plan, replicating the compute infrastructure in the fail over datacenter increases costs.
  • 此方案考虑到了计算基础结构的中断,但存储帐户仍旧是函数应用的单一故障点。This scenario covers outages at the compute infrastructure, but the storage account continues to be the single point of failure for the function App. 如果出现存储中断,应用程序会遭遇停机。If there is a Storage outage, the application suffers a downtime.
  • 如果函数应用已故障转移,则延迟会增大,因为它会跨区域访问其存储帐户。If the function app is failed over, there will be increased latency since it will access its storage account across regions.
  • 由于网络出口流量方面的原因,从其他区域(不是函数应用所在的区域)访问存储服务会产生更高的成本。Accessing the storage service from a different region where it's located incurs in higher cost due to network egress traffic.
  • 此方案依赖于流量管理器。This scenario depends on Traffic Manager. 考虑到流量管理器的工作原理,使用 Durable Function 的客户端应用程序可能在一段时间之后才需要再次从流量管理器查询函数应用地址。Considering how Traffic Manager works, it may be some time until a client application that consumes a Durable Function needs to query again the function app address from Traffic Manager.

方案 2 - 使用区域存储进行负载均衡的计算Scenario 2 - Load balanced compute with regional storage

前面的方案仅考虑到了计算基础结构发生故障的情况。The preceding scenario covers only the case of failure in the compute infrastructure. 如果存储服务出现故障,将会导致函数应用中断。If the storage service fails, it will result in an outage of the function app. 为确保 Durable Functions 的持续运行,此方案在函数应用部署到的每个区域中使用了一个本地存储帐户。To ensure continuous operation of the durable functions, this scenario uses a local storage account on each region to which the function apps are deployed.

显示方案 2 的示意图。

此方法在前一方案的基础上做了改进:This approach adds improvements on the previous scenario:

  • 如果函数应用出现故障,流量管理器会负责故障转移到次要区域。If the function app fails, Traffic Manager takes care of failing over to the secondary region. 但是,由于函数应用依赖于其自身的存储帐户,因此 Durable Functions 可继续工作。However, because the function app relies on its own storage account, the durable functions continue to work.
  • 在故障转移期间,故障转移区域中的延迟不会增大,因为函数应用和存储帐户是并置的。During a fail over, there is no additional latency in the fail over region, since the function app and the storage account are co-located.
  • 存储层故障会导致 Durable Functions 故障,从而又会触发到故障转移区域的重定向。Failure of the storage layer will cause failures in the durable functions, which, in turn, will trigger a redirection to the fail over region. 同样,由于函数应用和存储已按区域隔离,因此 Durable Functions 可继续工作。Again, since the function app and storage are isolated per region, the durable functions will continue to work.

此方案的重要注意事项:Important considerations for this scenario:

  • 如果函数应用是使用专用应用服务计划部署的,则在故障转移数据中心复制计算基础结构会增加成本。If the function app is deployed using a dedicated AppService plan, replicating the compute infrastructure in the fail over datacenter increases costs.
  • 当前状态不会故障转移,这意味着,执行和检查点函数会失败。Current state isn't failed over, which implies that executing and checkpointed functions will fail. 客户端应用程序负责重试/重启工作。It's up to the client application to retry/restart the work.

方案 3 - 使用 GRS 共享存储进行负载均衡的计算Scenario 3 - Load balanced compute with GRS shared storage

此方案在第一种方案的基础上做了修改,它实施共享存储帐户。This scenario is a modification over the first scenario, implementing a shared storage account. 主要区别在于,存储帐户是在启用异地复制的情况下创建的。The main difference that the storage account is created with geo-replication enabled. 从功能上讲,此方案提供的优势与方案 1 相同,但它可以实现更大的数据恢复优势:Functionally, this scenario provides the same advantages as Scenario 1, but it enables additional data recovery advantages:

  • 异地冗余存储 (GRS) 和读取访问 GRS (RA-GRS) 可最大程度地提高存储帐户的可用性。Geo-redundant storage (GRS) and Read-access GRS (RA-GRS) maximize availability for your storage account.
  • 如果存储服务发生区域中断,一种可能的解决方法是让数据中心运营部门确定是否必须将存储故障转移到次要区域。If there is a region outage of the storage service, one of the possibilities is that the datacenter operations determine that storage must be failed over to the secondary region. 在这种情况下,存储帐户访问权限将以透明方式定向到存储帐户的异地复制副本,且无需用户干预。In this case, storage account access will be redirected transparently to the geo-replicated copy of the storage account, without user intervention.
  • 在这种情况下,Durable Functions 的状态最长可保留到上次复制存储帐户为止(存储帐户每隔几分钟就会复制一次)。In this case, state of the durable functions will be preserved up to the last replication of the storage account, which occurs every few minutes.

与其他方案一样,需注意以下重要事项:As with the other scenarios, there are important considerations:

  • 故障转移到副本由数据中心操作员完成,可能需要一段时间。Fail over to the replica is done by datacenter operators and it may take some time. 在完成之前,函数应用会遭遇中断。Until that time, the function app will suffer an outage.
  • 使用异地复制的存储帐户不会增加成本。There is an increased cost for using geo-replicated storage accounts.
  • GRS 以异步方式进行。GRS occurs asynchronously. 由于复制过程中的延迟,某些最新事务可能会丢失。Some of the latest transactions might be lost because of the latency of the replication process.

显示方案 3 的示意图。

后续步骤Next steps

详细了解如何使用 RA-GRS 设计高度可用的应用程序You can read more about Designing Highly Available Applications using RA-GRS