使用 Azure SQL 数据库确保业务连续性的相关概述Overview of business continuity with Azure SQL Database

Azure SQL 数据库中的业务连续性是指在遇到中断(尤其是计算基础结构的中断)时,使企业能够继续运营的机制、策略和过程。Business continuity in Azure SQL Database refers to the mechanisms, policies, and procedures that enable your business to continue operating in the face of disruption, particularly to its computing infrastructure. 在大多数情况下,Azure SQL 数据库将会处理云环境中可能发生的中断事件,并让应用程序和业务流程保持运行。In the most of the cases, Azure SQL Database will handle the disruptive events that might happen in the cloud environment and keep your applications and business processes running. 但是,SQL 数据库无法自动处理某些中断事件,例如:However, there are some disruptive events that cannot be handled by SQL Database automatically such as:

  • 用户意外删除或更新了表中的某行。User accidentally deleted or updated a row in a table.
  • 恶意攻击者成功删除了数据或数据库。Malicious attacker succeeded to delete data or drop a database.
  • 地震导致断电和暂时性的数据中心停运。Earthquake caused a power outage and temporary disabled data-center.

本概述介绍 Azure SQL 数据库针对业务连续性和灾难恢复所提供的功能。This overview describes the capabilities that Azure SQL Database provides for business continuity and disaster recovery. 了解用于从干扰性事件恢复的选项、建议和教程,这些事件可能导致数据丢失或者数据库和应用程序无法使用。Learn about options, recommendations, and tutorials for recovering from disruptive events that could cause data loss or cause your database and application to become unavailable. 了解对一些情况的处理方式,包括用户或应用程序错误影响数据完整性、Azure 区域发生服务中断,或者应用程序需要维护。Learn what to do when a user or application error affects data integrity, an Azure region has an outage, or your application requires maintenance.

有利于业务连续性的 SQL 数据库功能SQL Database features that you can use to provide business continuity

从数据库角度看,主要有四种潜在的中断场景:From a database perspective, there are four major potential disruption scenarios:

  • 影响数据库节点的本地硬件或软件故障,例如磁盘驱动器故障。Local hardware or software failures affecting the database node such as a disk-drive failure.
  • 通常由应用程序 bug 或人为失误导致的数据损坏或删除。Data corruption or deletion typically caused by an application bug or human error. 此类故障特定于应用程序,通常无法通过数据库服务检测到。Such failures are application-specific and typically cannot be detected by the database service.
  • 可能由自然灾害造成的数据中心服务中断。Datacenter outage, possibly caused by a natural disaster. 针对这种情况,需要采取某种程度的异地冗余,并将应用程序故障转移到备用数据中心。This scenario requires some level of geo-redundancy with application failover to an alternate datacenter.
  • 升级或维护错误 - 在执行计划内基础结构维护或升级期间发生意外的问题时,可能需要快速回退到以前的数据库状态。Upgrade or maintenance errors, unanticipated issues that occur during planned infrastructure maintenance or upgrades may require rapid rollback to a prior database state.

为了缓解本地硬件和软件故障问题,SQL 数据库包含一个高可用性体系结构,可以保证在出现这些故障时自动进行恢复,其可用性 SLA 高达 99.995%。To mitigate the local hardware and software failures, SQL Database includes a high availability architecture, which guarantees automatic recovery from these failures with up to 99.995% availability SLA.

为了防止企业丢失数据,SQL 数据库每周自动创建完整数据库备份,每 12 小时自动创建差异数据库备份,每隔 5 - 10 分钟自动创建事务日志备份。To protect your business from data loss, SQL Database automatically creates full database backups weekly, differential database backups every 12 hours, and transaction log backups every 5 - 10 minutes . 所有服务层级的备份在 RA-GRS 存储中存储至少 7 天。The backups are stored in RA-GRS storage for at least 7 days for all service tiers. 对于时间点还原,所有服务层级(“基本”除外)都支持可配置的备份保留期,最长为 35 天。All service tiers except Basic support configurable backup retention period for point-in-time restore, up to 35 days.

SQL 数据库还提供多种业务连续性功能,用于缓解各种计划外方案问题。SQL Database also provides several business continuity features, that you can use to mitigate various unplanned scenarios.

恢复同一 Azure 区域内的数据库Recover a database within the same Azure region

可以使用自动数据库备份将数据库还原到过去的某个时间点。You can use automatic database backups to restore a database to a point in time in the past. 可以通过这种方式从人为错误导致的数据损坏中恢复。This way you can recover from data corruptions caused by human errors. 可以通过时间点还原在同一服务器中创建一个新数据库,代表损坏事件发生之前的数据状态。The poin-in-time restore allows you to create a new database in the same server that represents the state of data prior to the corrupting event. 大多数数据库的还原操作需要不到 12 小时的时间。For most databases the restore operations takes less than 12 hours. 恢复非常大或十分活跃的数据库可能需要更长时间。It may take longer to recover a very large or very active database. 有关恢复时间的详细信息,请参阅数据库恢复时间For more information about recovery time, see database recovery time.

如果支持的最长时间点还原 (PITR) 备份保留期对你的应用程序而言不足,可以通过为数据库配置长期保留 (LTR) 策略来延长保留期。If the maximum supported backup retention period for point-in-time restore (PITR) is not sufficient for your application, you can extend it by configuring a long-term retention (LTR) policy for the database(s). 有关详细信息,请参阅长期备份保留For more information, see Long-term backup retention.

将异地复制与故障转移组进行比较Compare geo-replication with failover groups

自动故障转移组简化了异地复制的部署和使用,并添加了其他功能,如下表中所述:Auto-failover groups simplify the deployment and usage of geo-replication and add the additional capabilities as described in the following table:

异地复制Geo-replication 故障转移组Failover groups
自动故障转移Automatic failover No Yes
同时故障转移多个数据库Fail over multiple databases simultaneously No Yes
在故障转移后更新连接字符串Update connection string after failover Yes No
支持托管实例Managed instance supported No Yes
可以与主服务器位于同一区域Can be in same region as primary Yes No
多个副本Multiple replicas Yes No
支持读取缩放Supports read-scale Yes Yes
     

将数据库恢复到现有服务器Recover a database to the existing server

Azure 数据中心会罕见地发生中断。Although rare, an Azure data center can have an outage. 发生中断时,业务可能仅中断几分钟,也可能持续数小时。When an outage occurs, it causes a business disruption that might only last a few minutes or might last for hours.

  • 用户可以选择等待数据中心中断结束,数据库重新联机。One option is to wait for your database to come back online when the data center outage is over. 这适用于可以容忍数据库脱机的应用程序。This works for applications that can afford to have the database offline. 例如无需一直处理的开发项目或免费试用版。For example, a development project or free trial you don't need to work on constantly. 数据中心中断时,不知道中断会持续多久,因此该选项仅适用于暂时不需要数据库的情况。When a data center has an outage, you do not know how long the outage might last, so this option only works if you don't need your database for a while.
  • 另一个选项是使用异地冗余数据库备份(异地还原)来还原任何 Azure 区域中任何服务器上的数据库。Another option is to restore a database on any server in any Azure region using geo-redundant database backups (geo-restore). 异地还原使用异地冗余备份作为源,即使由于停电而无法访问数据库或数据中心,也依然能够使用它来恢复数据库。Geo-restore uses a geo-redundant backup as its source and can be used to recover a database even if the database or datacenter is inaccessible due to an outage.
  • 最后,如果你使用活动异地复制配置了异地辅助数据库或者为数据库配置了自动故障转移组,则可以快速从中断中恢复。Finally, you can quickly recover from an outage if you have configured either geo-secondary using active geo-replication or an auto-failover group for your database or databases. 你可以使用手动或自动故障转移,具体取决于你选择的这些技术。Depending on your choice of these technologies, you can use either manual or automatic failover. 虽然故障转移本身只需几秒钟,但服务至少需要 1 小时才能激活它。While failover itself takes only a few seconds, the service will take at least 1 hour to activate it. 这是确保故障转移符合中断规模所必需的。This is necessary to ensure that the failover is justified by the scale of the outage. 此外,由于异步复制的性质,故障转移可能导致少量数据丢失。Also, the failover may result in small data loss due to the nature of asynchronous replication.

制定业务连续性计划时,需了解应用程序在中断事件发生后完全恢复之前的最大可接受时间。As you develop your business continuity plan, you need to understand the maximum acceptable time before the application fully recovers after the disruptive event. 应用程序完全恢复所需的时间称为恢复时间目标 (RTO)。The time required for application to fully recover is known as Recovery time objective (RTO). 此外,还需要了解从计划外中断事件恢复时,应用程序可忍受最近数据更新丢失的最长期限(时间间隔)。You also need to understand the maximum period of recent data updates (time interval) the application can tolerate losing when recovering from an unplanned disruptive event. 可能的数据丢失称为恢复点目标 (RPO)。The potential data loss is known as Recovery point objective (RPO).

不同的恢复方法提供不同级别的 RPO 和 RTO。Different recovery methods offer different levels of RPO and RTO. 可以选择特定的恢复方法,也可以使用各种方法的组合,以便实现应用程序的完全恢复。You can choose a specific recovery method, or use a combination of methods to achieve full application recovery. 下表比较了每个恢复选项的 RPO 和 RTO。The following table compares RPO and RTO of each recovery option. 自动故障转移组简化了异地复制的部署和使用,并添加了其他功能,如下表所述。Auto-failover groups simplify the deployment and usage of geo-replication and adds the additional capabilities as described in the following table.

恢复方法Recovery method RTORTO RPORPO
从异地复制的备份执行异地还原Geo-restore from geo-replicated backups 12 小时12 h 1 小时1 h
自动故障转移组Auto-failover groups 1 小时1 h 5 秒5 s
手动数据库故障转移Manual database failover 30 秒30 s 5 秒5 s

Note

“手动数据库故障转移” 是指使用计划外模式将单一数据库故障转移到其异地复制的辅助数据库。Manual database failover refers to failover of a single database to its geo-replicated secondary using the unplanned mode. 有关自动故障转移 RTO 和 RPO 的详细信息,请参阅本文中前面的表。See the table earlier in this article for details of the auto-failover RTO and RPO.

如果应用程序符合以下任意条件,请使用自动故障转移组:Use auto-failover groups if your application meets any of these criteria:

  • 是任务关键型应用程序。Is mission critical.
  • 具有服务级别协议 (SLA),不允许停机 12 小时或更长时间。Has a service level agreement (SLA) that does not allow for 12 hours or more of downtime.
  • 停机可能导致财务责任。Downtime may result in financial liability.
  • 具有很高的数据更改率,而且不接受丢失 1 小时的数据。Has a high rate of data change and 1 hour of data loss is not acceptable.
  • 活动异地复制的额外成本低于潜在财务责任和相关业务损失所付出的代价。The additional cost of active geo-replication is lower than the potential financial liability and associated loss of business.

可以根据应用程序需求,选择结合使用数据库备份与活动异地复制。You may choose to use a combination of database backups and active geo-replication depending upon your application requirements. 若要探讨使用这些业务连续性功能为独立数据库和弹性池进行设计时的注意事项,请参阅设计用于云灾难恢复的应用程序弹性池灾难恢复策略For a discussion of design considerations for stand-alone databases and for elastic pools using these business continuity features, see Design an application for cloud disaster recovery and Elastic pool disaster recovery strategies.

以下各节概述使用数据库备份或活动异地复制进行恢复的步骤。The following sections provide an overview of the steps to recover using either database backups or active geo-replication. 若要了解包括计划需求在内的详细步骤、恢复后步骤,以及如何模拟中断以执行灾难恢复演练,请参阅在中断时恢复 SQL 数据库For detailed steps including planning requirements, post recovery steps, and information about how to simulate an outage to perform a disaster recovery drill, see Recover a SQL Database from an outage.

为中断做好准备Prepare for an outage

无论使用哪种业务连续性功能,都必须:Regardless of the business continuity feature you use, you must:

  • 识别并准备目标服务器,包括服务器级 IP 防火墙规则、登录名和 master 数据库级权限。Identify and prepare the target server, including server-level IP firewall rules, logins, and master database level permissions.
  • 确定如何将客户端和客户端应用程序重定向到新服务器Determine how to redirect clients and client applications to the new server
  • 记录其他依赖项,例如审核设置和警报Document other dependencies, such as auditing settings and alerts

如果准备不适当,故障转移或恢复数据库后,需要花更多时间让应用程序联机,还有可能要在这种情况下进行故障排除 — 可谓雪上加霜。If you do not prepare properly, bringing your applications online after a failover or a database recovery takes additional time and likely also require troubleshooting at a time of stress - a bad combination.

故障转移到异地复制的辅助数据库Fail over to a geo-replicated secondary database

如果你使用活动异地复制或自动故障转移组作为恢复机制,则可以配置自动故障转移策略,或使用手动计划外故障转移If you are using active geo-replication or auto-failover groups as your recovery mechanism, you can configure an automatic failover policy or use manual unplanned failover. 一旦启用,辅助数据库通过故障转移提升为新的主数据库,它可以记录新事务并响应查询 - 仅丢失极少尚未复制的数据。Once initiated, the failover causes the secondary to become the new primary and ready to record new transactions and respond to queries - with minimal data loss for the data not yet replicated. 有关设计故障转移过程的信息,请参阅设计用于云灾难恢复的应用程序For information on designing the failover process, see Design an application for cloud disaster recovery.

Note

当数据中心恢复联机,旧主数据库自动重新连接至新主数据库,且其自身转为辅助数据库。When the data center comes back online the old primaries automatically reconnect to the new primary and become secondary databases. 如果需要将主数据库重定位至原始区域,可以手动启动计划的故障转移(故障回复)。If you need to relocate the primary back to the original region, you can initiate a planned failover manually (failback).

执行异地还原Perform a geo-restore

如果要将自动备份用于异地冗余存储(默认情况下已启用),则可以使用异地还原恢复数据库。If you are using the automated backups with geo-redundant storage (enabled by default), you can recover the database using geo-restore. 恢复通常在 12 小时内进行 - 最多丢失 1 小时的数据,具体取决于上次日志备份的执行和复制时间。Recovery usually takes place within 12 hours - with data loss of up to one hour determined by when the last log backup was taken and replicated. 在恢复完成之前,数据库无法记录任何事务或响应任何查询。Until the recovery completes, the database is unable to record any transactions or respond to any queries. 请注意,异地还原仅将数据库还原到上一个可用时间点。Note, geo-restore only restores the database to the last available point in time.

Note

如果数据中心在应用程序切换到恢复的数据库之前就重新联机,可以取消恢复。If the data center comes back online before you switch your application over to the recovered database, you can cancel the recovery.

执行故障转移/恢复后任务Perform post failover / recovery tasks

从任一恢复机制恢复后,都必须执行以下附加任务,用户和应用程序才能重新运行:After recovery from either recovery mechanism, you must perform the following additional tasks before your users and applications are back up and running:

  • 将客户端和客户端应用程序重定向到新服务器和还原的数据库Redirect clients and client applications to the new server and restored database
  • 确保设置适当的服务器级 IP 防火墙规则,以便用户能够连接或使用数据库级防火墙,启用合适的规则。Ensure appropriate server-level IP firewall rules are in place for users to connect or use database-level firewalls to enable appropriate rules.
  • 确保设置适当的登录名和 master 数据库级权限(或使用包含用户Ensure appropriate logins and master database level permissions are in place (or use contained users)
  • 视情况配置审核Configure auditing, as appropriate
  • 视情况配置警报Configure alerts, as appropriate

Note

如果使用故障转移组并使用读写侦听器连接到数据库,则在故障转移后应用程序将自动透明地进行重定向。If you are using a failover group and connect to the databases using the read-write lstener, the redirection after failover will happen automatically and transparently to the application.

在停机时间最短的情况下升级应用程序Upgrade an application with minimal downtime

有时,应用程序由于计划内维护(例如应用程序升级)而必须脱机。Sometimes an application must be taken offline because of planned maintenance such as an application upgrade. 管理应用程序升级介绍如何使用活动异地复制滚动升级云应用程序,将升级时的停机时间缩到最短,并在发生错误时提供恢复路径。Manage application upgrades describes how to use active geo-replication to enable rolling upgrades of your cloud application to minimize downtime during upgrades and provide a recovery path if something goes wrong.

后续步骤Next steps

若要探讨为独立数据库和弹性池设计应用程序时的注意事项,请参阅设计用于云灾难恢复的应用程序弹性池灾难恢复策略For a discussion of application design considerations for stand-alone databases and for elastic pools, see Design an application for cloud disaster recovery and Elastic pool disaster recovery strategies.