使用 Azure SQL 数据库设计全球可用的服务Designing globally available services using Azure SQL Database

适用于:是Azure SQL 数据库 APPLIES TO: yesAzure SQL Database

通过 Azure SQL 数据库生成和部署云服务时,可使用活动异地复制自动故障转移组在发生区域性中断和灾难性故障时进行复原。When building and deploying cloud services with Azure SQL Database, you use active geo-replication or auto-failover groups to provide resilience to regional outages and catastrophic failures. 通过此功能,还可创建针对数据的本地访问进行了优化的全球分布式应用程序。The same feature allows you to create globally distributed applications optimized for local access to the data. 本文讨论了常见的应用程序模式,包括每种模式的优势和考量因素。This article discusses common application patterns, including the benefits and trade-offs of each option.

应用场景 1:使用两个 Azure 区域来实现业务连续性,同时将停机时间减至最小Scenario 1: Using two Azure regions for business continuity with minimal downtime

在此方案中,应用程序具有以下特征:In this scenario, the applications have the following characteristics:

  • 应用程序在一个 Azure 区域中处于活动状态Application is active in one Azure region
  • 所有数据库会话需要数据读取和写入权限 (RW)All database sessions require read and write access (RW) to data
  • 必须并置 Web 层和数据层以减少延迟和流量成本Web tier and data tier must be collocated to reduce latency and traffic cost
  • 从根本上讲,相比数据丢失,停机时间对于那些应用程序来说是更高的业务风险Fundamentally, downtime is a higher business risk for these applications than data loss

在这种情况下,当所有应用程序组件需要一同进行故障转移时,将针对处理区域灾难对应用程序部署拓扑进行优化。In this case, the application deployment topology is optimized for handling regional disasters when all application components need to fail over together. 下图展示了此拓扑。The diagram below shows this topology. 为了实现地理冗余,应用程序的资源会部署到区域 A 和 B。但是,只有当区域 A 故障后才会利用区域 B 中的资源。For geographic redundancy, the application's resources are deployed to Region A and B. However, the resources in Region B are not utilized until Region A fails. 两个区域之间会配置故障转移组,用于管理数据库连接、复制和故障转移。A failover group is configured between the two regions to manage database connectivity, replication and failover. 两个区域中的 Web 服务配置为通过读写侦听器 <failover-group-name>.database.chinacloudapi.cn 访问数据库 (1)。The web service in both regions is configured to access the database via the read-write listener <failover-group-name>.database.chinacloudapi.cn (1). Azure 流量管理器设置为使用优先级路由方法 (2)。Azure Traffic Manager is set up to use priority routing method (2).  


Azure 流量管理器在本文中仅供说明之用。Azure Traffic Manager is used throughout this article for illustration purposes only. 可以使用任何支持优先级路由方法的负载均衡解决方案。You can use any load-balancing solution that supports priority routing method.

下图显示了在发生服务中断之前的此配置:The following diagram shows this configuration before an outage:

方案 1.

主要区域服务中断后,SQL 数据库会检测到主数据库不可访问,并基于自动故障转移策略 (1) 的参数触发到次要区域的故障转移。After an outage in the primary region, SQL Database detects that the primary database is not accessible and triggers failover to the secondary region based on the parameters of the automatic failover policy (1). 可以配置一个宽限期来控制断电检测和故障转移本身之间的时间,具体取决于应用程序 SLA。Depending on your application SLA, you can configure a grace period that controls the time between the detection of the outage and the failover itself. Azure 流量管理器可能会在故障转移组触发数据库的故障转移前启动终结点故障转移。It is possible that Azure Traffic Manager initiates the endpoint failover before the failover group triggers the failover of the database. 在这种情况下,Web 应用程序无法立即重新连接到数据库。In that case the web application cannot immediately reconnect to the database. 但在数据库故障转移完成后,会立即自动实现重新连接。But the reconnections will automatically succeed as soon as the database failover completes. 当失败的区域还原并恢复联机状态时,旧的主数据库自动作为新的辅助数据库进行重新连接。When the failed region is restored and back online, the old primary automatically reconnects as a new secondary. 下图显示故障转移后的配置。The diagram below illustrates the configuration after failover.


故障转移后提交的事物会在重新连接时丢失。All transactions committed after the failover are lost during the reconnection. 故障转移完成后,区域 B 中的应用程序能重新连接并重新开始处理用户请求。After the failover is completed, the application in region B is able to reconnect and restart processing the user requests. 现在 Web 应用程序和主数据库均位于区域 B 并始终存在于同一位置。Both the web application and the primary database are now in region B and remain co-located.

方案 1.

如果区域 B 中发生中断,那么主数据库和辅助数据库之间的复制进程会挂起,但是两者之间的链接不会受影响 (1)。If an outage happens in region B, the replication process between the primary and the secondary database gets suspended but the link between the two remains intact (1). 流量管理器检测到区域 B 的连接中断,并将终结点 Web 应用 2 标记为“降级”(2)。Traffic Manager detects that connectivity to Region B is broken and marks the endpoint web app 2 as Degraded (2). 在这种情况下应用程序性能不会受影响,但数据库已暴露,所以如果区域 A 跟着失败时会产生更高的数据丢失风险。The application's performance is not impacted in this case, but the database becomes exposed and therefore at higher risk of data loss in case region A fails in succession.


对于灾难恢复,建议将应用程序部署配置限于两个区域。For disaster recovery, we recommend the configuration with application deployment limited to two regions. 这是因为大多数 Azure 地理位置仅有两个区域。This is because most of the Azure geographies have only two regions. 如果两个区域同时发生灾难性故障,此配置不会为你的应用程序提供保护。This configuration does not protect your application from a simultaneous catastrophic failure of both regions. 在此类失败的不可能事件中,可以使用异地还原操作在第三个区域中恢复数据库。In an unlikely event of such a failure, you can recover your databases in a third region using geo-restore operation.

中断问题缓解后,辅助数据库会立即自动重新与主数据库同步。Once the outage is mitigated, the secondary database automatically resynchronizes with the primary. 同步期间,主数据库的性能可能受影响。During synchronization, performance of the primary can be impacted. 具体影响取决于新主数据库自故障转移开始后获取的数据量。The specific impact depends on the amount of data the new primary acquired since the failover.


缓解服务中断后,流量管理器会开始将连接路由到区域 A 中的应用程序,以用作优先级更高的终结点。After the outage is mitigated, Traffic Manager will start routing the connections to the application in Region A as a higher priority end-point. 如果打算让主要数据库保留在区域 B 中一段时间,则应该相应地更改流量管理器配置文件中的优先级表。If you intend to keep the primary in Region B for a while, you should change the priority table in the Trafic Manager profile accordingly.

下图说明了次要区域中的服务中断:The following diagram illustrates an outage in the secondary region:

方案 1.

此设计模式的主要 优点 是:The key advantages of this design pattern are:

  • 将同一 Web 应用程序部署到两个区域中时无需任何特定于区域的配置,也无需使用更多逻辑来管理故障转移。The same web application is deployed to both regions without any region-specific configuration and doesn't require additional logic to manage failover.
  • 应用程序性能不受故障转移影响,因为 Web 应用程序和数据库始终共存。Application performance is not impacted by failover as the web application and the database are always co-located.

区域 B 中的应用程序资源大多时间利用不足,这是需要进行权衡的主要考量。The main tradeoff is that the application resources in Region B are underutilized most of the time.

应用场景 2:可实现业务连续性并提供最高数据保存性能的 Azure 区域Scenario 2: Azure regions for business continuity with maximum data preservation

此选项最适合具有以下特征的应用程序:This option is best suited for applications with the following characteristics:

  • 任何数据丢失都具有高业务风险。Any data loss is high business risk. 如果中断是由灾难性故障引起的,则数据库故障转移只能作为最后考虑的方法。The database failover can only be used as a last resort if the outage is caused by a catastrophic failure.
  • 应用程序支持只读和读写操作模式,可在“只读模式”下运行一段时间。The application supports read-only and read-write modes of operations and can operate in "read-only mode" for a period of time.

此模式下,读写连接开始出现超时错误时,应用程序会切换到只读模式。In this pattern, the application switches to read-only mode when the read-write connections start getting time-out errors. Web 应用程序会部署到这两个区域,并包含一个读写侦听器终结点连接和另一个不同的只读侦听器终结点连接 (1)。The web application is deployed to both regions and includes a connection to the read-write listener endpoint and different connection to the read-only listener endpoint (1). 流量管理器配置文件应使用优先级路由The Traffic Manager profile should use priority routing. 应为每个区域中的应用程序终结点启用终结点监视 (2)。End point monitoring should be enabled for the application endpoint in each region (2).

下图说明了在发生服务中断之前的此配置:The following diagram illustrates this configuration before an outage:

方案 2.

流量管理器检测到区域 A 的连接故障时,会自动将用户流量切换到区域 B 中的应用程序实例。此模式下,必须将数据丢失宽限期设置为足够大的值,例如 24 小时。When Traffic Manager detects a connectivity failure to region A, it automatically switches user traffic to the application instance in region B. With this pattern, it is important that you set the grace period with data loss to a sufficiently high value, for example 24 hours. 如果在该时间段内解决了中断问题,该措施可确保防止数据丢失。It ensures that data loss is prevented if the outage is mitigated within that time. 区域 B 中的 Web 应用程序激活时,读写操作会失败。When the web application in region B is activated the read-write operations start failing. 此时,应切换到只读模式 (1)。At that point, it should switch to the read-only mode (1). 此模式下,请求会自动路由至辅助数据库。In this mode the requests are automatically routed to the secondary database. 如果中断是由灾难性故障引起的,则很难在宽限期内缓解故障。If the outage is caused by a catastrophic failure, most likely it cannot be mitigated within the grace period. 超过宽限期时,故障转移组会触发故障转移。When it expires the failover group triggers the failover. 之后,读写侦听器变为可用状态,其连接恢复正常 (2)。After that the read-write listener becomes available and the connections to it stop failing (2). 下图显示恢复过程的两个阶段。The following diagram illustrates the two stages of the recovery process.


如果在宽限期内解决了主要区域中的中断问题,流量管理器会检测到主要区域的连接恢复,并将用户流量切换回区域 A 中的应用程序实例。如上图所示,此应用程序实例使用区域 A 中的主数据库在读写模式下进行恢复和运行。If the outage in the primary region is mitigated within the grace period, Traffic Manager detects the restoration of connectivity in the primary region and switches user traffic back to the application instance in region A. That application instance resumes and operates in read-write mode using the primary database in region A as illustrated by the previous diagram.

方案 2.

如果区域 B 中发生中断,流量管理器检测到区域 B 中的终结点 web-app-2 故障并将其标记为“降级”(1)。If an outage happens in region B, Traffic Manager detects the failure of the end point web-app-2 in region B and marks it degraded (1). 与此同时,故障转移组将只读侦听器切换到区域 A (2)。In the meantime, the failover group switches the read-only listener to region A (2). 此中断不会影响最终用户体验,但是中断期间主数据库会暴露。This outage does not impact the end-user experience but the primary database is exposed during the outage. 下图说明了次要区域中的失败:The following diagram illustrates a failure in the secondary region:

方案 2.

解决中断问题后,辅助数据库立即与主数据库同步,只读侦听器切换回区域 B 中的辅助数据库。同步期间,主数据库性能可能略受影响,具体取决于需同步的数据量。Once the outage is mitigated, the secondary database is immediately synchronized with the primary and the read-only listener is switched back to the secondary database in region B. During synchronization performance of the primary could be slightly impacted depending on the amount of data that needs to be synchronized.

此设计模式具有多个 优点This design pattern has several advantages:

  • 它在临时服务中断期间可避免数据丢失。It avoids data loss during the temporary outages.
  • 停机时间仅取决于流量管理器检测到连接故障的速度,此速度是可配置的。Downtime depends only on how quickly Traffic Manager detects the connectivity failure, which is configurable.

权衡是应用程序必须能够在只读模式下运行。The tradeoff is that the application must be able to operate in read-only mode.

应用场景 3:应用程序重新定位到其他地理位置而不发生数据丢失,且停机时间几乎为零Scenario 3: Application relocation to a different geography without data loss and near zero downtime

在此方案中,应用程序具有以下特征:In this scenario the application has the following characteristics:

  • 最终用户从不同的地理位置访问应用程序The end users access the application from different geographies
  • 应用程序包含只读工作负载,这些工作负载不依赖于与最新更新的完全同步The application includes read-only workloads that do not depend on full synchronization with the latest updates
  • 应针对大多用户支持同一地理位置的数据写入访问权限Write access to data should be supported in the same geography for majority of the users
  • 读取延迟对最终用户体验而言至关重要Read latency is critical for the end-user experience

若要满足这些需求,需要保证用户设备始终连接至部署到同一地理位置的应用程序以实现只读操作,例如浏览数据、分析等。然而,大多数时候都在同一地理位置处理 OLTP 操作。In order to meet these requirements you need to guarantee that the user device always connects to the application deployed in the same geography for the read-only operations, such as browsing data, analytics etc. Whereas the OLTP operations are processed in the same geography most of the time. 例如,工作时间在同一个地理位置处理 OLTP 操作,而非工作时间可能会在另一个地理位置处理这些操作。For example, during the day time OLTP operations are processed in the same geography, but during the off hours they could be processed in a different geography. 如果最终用户活动大多发生在工作时间,那么可保证大多数时间,对于大多数用户,均可实现最佳性能。If the end-user activity mostly happens during the working hours, you can guarantee the optimal performance for most of the users most of the time. 下图显示了此拓扑。The following diagram shows this topology.

应用程序的资源应部署到每个有大量使用需求的地理位置。The application's resources should be deployed in each geography where you have substantial usage demand. 例如,如果在美国、欧盟和东南亚,应用程序使用率很高,则应在所有这些区域部署该应用程序。For example, if your application is actively used in the United States, European Union and South East Asia the application should be deployed to all of these geographies. 主数据库应在工作时间结束时从一个地理区域动态转至下一个区域。The primary database should be dynamically switched from one geography to the next at the end of the working hours. 此方法称为“循日”。This method is called "follow the sun". OLTP 工作负载始终通过读写侦听器 <failover-group-name>.database.chinacloudapi.cn 连接到数据库 (1)。The OLTP workload always connects to the database via the read-write listener <failover-group-name>.database.chinacloudapi.cn (1). 只读工作负载直接使用数据库服务器终结点 <server-name>.database.chinacloudapi.cn 连接到本地数据库 (2)。The read-only workload connects to the local database directly using the databases server endpoint <server-name>.database.chinacloudapi.cn (2). 使用性能路由方法配置流量管理器。Traffic Manager is configured with the performance routing method. 它确保最终用户的设备连接到最近区域的 Web 服务。It ensures that the end-user's device is connected to the web service in the closest region. 设置流量管理器时应为每个 Web 服务终结点启用终结点监视 (3)。Traffic Manager should be set up with end point monitoring enabled for each web service end point (3).


故障转移组配置定义要用于故障转移的区域。The failover group configuration defines which region is used for failover. 由于新的主区域位于另一个地理位置,所以对于 OLTP 和只读工作负载,故障转移会导致更长的延迟,直到受影响的区域恢复联机状态为止。Because the new primary is in a different geography the failover results in longer latency for both OLTP and read-only workloads until the impacted region is back online.

方案 3.

一天结束时(例如当地时间晚上 11 点),应将活动数据库切换至下一个区域(北欧)。At the end of the day, for example at 11 PM local time, the active databases should be switched to the next region (North Europe). 此任务可通过使用 Azure 逻辑应用实现完全自动化。This task can be fully automated by using Azure Logic Apps. 此任务涉及以下步骤:The task involves the following steps:

  • 使用友好故障转移将故障转移组中的主服务器切换至北欧 (1)Switch primary server in the failover group to North Europe using friendly failover (1)
  • 删除美国东部和北欧之间的故障转移组Remove the failover group between East US and North Europe
  • 使用同一名称在北欧和亚太之间创建一个新的故障转移组 (2)。Create a new failover group with the same name but between North Europe and East Asia (2).
  • 将北欧的主服务器和亚太的辅助服务器添加到此故障转移组 (3)。Add the primary in North Europe and secondary in East Asia to this failover group (3).

下图说明了在计划故障转移后的新配置:The following diagram illustrates the new configuration after the planned failover:

方案 3.

假如,北欧发生中断,故障转移组启动自动数据库故障转移,可有效将应用程序提前移至下一个区域 (1)。If an outage happens in North Europe for example, the automatic database failover is initiated by the failover group, which effectively results in moving the application to the next region ahead of schedule (1). 在此情况下,在北欧回到联机状态前,美国东部是唯一的辅助服务器区域。In that case the US East is the only remaining secondary region until North Europe is back online. 剩下两个区域通过转换角色为三个地理区域中的所有用户提供服务。The remaining two regions serve the customers in all three geographies by switching roles. 必须相应地调整 Azure 逻辑应用。Azure Logic Apps has to be adjusted accordingly. 由于剩余的区域从欧洲获取额外的用户流量,所以应用程序性能不仅受额外延迟的影响,还受增加的最终用户连接的影响。Because the remaining regions get additional user traffic from Europe, the application's performance is impacted not only by additional latency but also by an increased number of end-user connections. 北欧的中断问题缓解后,当地的辅助数据库会立即与当前主数据库同步。Once the outage is mitigated in North Europe, the secondary database there is immediately synchronized with the current primary. 下图说明了北欧的服务中断:The following diagram illustrates an outage in North Europe:

方案 3.


可减少欧洲最终用户的体验因长时间延迟而降级的时间。You can reduce the time when the end user's experience in Europe is degraded by the long latency. 为此,应该积极部署应用程序副本并在另一个本地区域(西欧)创建辅助数据库,作为北欧脱机应用程序实例的替换方案。To do that you should proactively deploy an application copy and create the secondary database(s) in another local region (West Europe) as a replacement of the offline application instance in North Europe. 当后者回到联机状态时,可以决定是继续使用西欧还是删除当地应用程序副本并重新使用北欧。When the latter is back online you can decide whether to continue using West Europe or to remove the copy of the application there and switch back to using North Europe.

此设计的关键优势是:The key benefits of this design are:

  • 只读应用程序工作负载可以随时访问最近区域的数据。The read-only application workload accesses data in the closets region at all times.
  • 读写应用程序工作负载在每个区域活动最频繁的时段可访问最近区域的数据The read-write application workload accesses data in the closest region during the period of the highest activity in each geography
  • 由于应用程序被部署到多个区域,所以能够在一个区域中断后继续运行,而不产生显著的停运时间。Because the application is deployed to multiple regions, it can survive a loss of one of the regions without any significant downtime.

但存在一些需要权衡的考量因素:But there are some tradeoffs:

  • 区域中断导致地理位置受延迟的影响时间更长。A regional outage results in the geography to be impacted by longer latency. 读写工作负载和只读工作负载均由另一个地理位置的应用程序提供。Both read-write and read-only workloads are served by the application in a different geography.
  • 只读工作负载须连接到每个区域中的另一个终结点。The read-only workloads must connect to a different end point in each region.

业务连续性规划:选择用于云灾难恢复的应用程序设计Business continuity planning: Choose an application design for cloud disaster recovery

特定的云灾难恢复策略可组合或扩展这些设计模式以最好地满足应用程序需求。Your specific cloud disaster recovery strategy can combine or extend these design patterns to best meet the needs of your application. 如前所述,所选的策略基于要提供给客户的 SLA 和应用程序部署拓扑。As mentioned earlier, the strategy you choose is based on the SLA you want to offer to your customers and the application deployment topology. 为了帮助用户进行决策,下表基于恢复点目标 (RPO) 和估计的恢复时间 (ERT) 比较了相关选项。To help guide your decision, the following table compares the choices based on recovery point objective (RPO) and estimated recovery time (ERT).

使用归置的数据库访问权限进行灾难恢复的主动-被动部署Active-passive deployment for disaster recovery with co-located database access 读写访问 < 5 秒Read-write access < 5 sec 故障检测时间 + DNS TTLFailure detection time + DNS TTL
实现应用程序负载均衡的主动-主动部署Active-active deployment for application load balancing 读写访问 < 5 秒Read-write access < 5 sec 故障检测时间 + DNS TTLFailure detection time + DNS TTL
实现保留数据的主动-被动部署Active-passive deployment for data preservation 只读访问 < 5 秒Read-only access < 5 sec 只读访问 = 0Read-only access = 0
读写访问 = 0Read-write access = zero 读写访问 = 故障检测时间 + 数据丢失宽限期Read-write access = Failure detection time + grace period with data loss

后续步骤Next steps