使用 Azure SQL 数据库确保业务连续性的相关概述Overview of business continuity with Azure SQL Database

Azure SQL 数据库中的业务连续性是指在遇到中断(尤其是计算基础结构的中断)时,使企业能够继续运营的机制、策略和过程。Business continuity in Azure SQL Database refers to the mechanisms, policies, and procedures that enable your business to continue operating in the face of disruption, particularly to its computing infrastructure. 在大多数情况下,Azure SQL 数据库将会处理云环境中可能发生的中断事件,并让应用程序和业务流程保持运行。In the most of the cases, Azure SQL Database will handle the disruptive events that might happen in the cloud environment and keep your applications and business processes running. 但是,SQL 数据库无法处理某些中断事件,例如:However, there are some disruptive events that cannot be handled by SQL Database such as:

  • 用户意外删除或更新了表中的某行。User accidentally deleted or updated a row in a table.
  • 恶意攻击者成功删除了数据或数据库。Malicious attacker succeeded to delete data or drop a database.
  • 地震导致断电和暂时性的数据中心停运。Earthquake caused a power outage and temporary disabled data-center.

Azure SQL 数据库无法控制这些情况,因此,需要使用 SQL 数据库中的业务连续性功能来恢复数据和保持应用程序运行。These cases cannot be controlled by Azure SQL Database, so you would need to use the business continuity features in SQL Database that enables you to recover your data and keep your applications running.

本概述介绍 Azure SQL 数据库针对业务连续性和灾难恢复所提供的功能。This overview describes the capabilities that Azure SQL Database provides for business continuity and disaster recovery. 了解用于从干扰性事件恢复的选项、建议和教程,这些事件可能导致数据丢失或者数据库和应用程序无法使用。Learn about options, recommendations, and tutorials for recovering from disruptive events that could cause data loss or cause your database and application to become unavailable. 了解对一些情况的处理方式,包括用户或应用程序错误影响数据完整性、Azure 区域发生服务中断,或者应用程序需要维护。Learn what to do when a user or application error affects data integrity, an Azure region has an outage, or your application requires maintenance.

有利于业务连续性的 SQL 数据库功能SQL Database features that you can use to provide business continuity

从数据库角度看,主要有四种潜在的中断场景:From a database perspective, there are four major potential disruption scenarios:

  • 影响数据库节点的本地硬件或软件故障,例如磁盘驱动器故障。Local hardware or software failures affecting the database node such as a disk-drive failure.
  • 通常由应用程序 bug 或人为失误导致的数据损坏或删除。Data corruption or deletion typically caused by an application bug or human error. 此类故障在本质上与特定的应用程序相关,一般而言,基础结构无法检测或缓解它们。Such failures are intrinsically application-specific and cannot as a rule be detected or mitigated automatically by the infrastructure.
  • 可能由自然灾害造成的数据中心服务中断。Datacenter outage, possibly caused by a natural disaster. 针对这种情况,需要采取某种程度的异地冗余,并将应用程序故障转移到备用数据中心。This scenario requires some level of geo-redundancy with application failover to an alternate datacenter.
  • 升级或维护错误、计划内升级或维护期间应用程序或数据库发生的意外问题,可能需要快速回退到以前的数据库状态。Upgrade or maintenance errors, unanticipated issues that occur during planned upgrades or maintenance to an application or database may require rapid rollback to a prior database state.

SQL 数据库提供多种可以缓解此类情况的业务连续性功能,包括自动备份和可选的数据库复制。SQL Database provides several business continuity features, including automated backups and optional database replication that can mitigate these scenarios. 首先,需要了解 SQL 数据库高可用性体系结构如何针对可能影响业务流程的某些中断事件提供 99.99% 的可用性和复原能力。First, you need to understand how SQL Database high availability architecture provides 99.99% availability and resiliency to some disruptive events that might affect your business process. 然后,可以了解能够使用其他哪些机制,在发生 SQL 数据库高可用性体系结构无法处理的中断事件时进行恢复,例如:Then, you can learn about the additional mechanisms that you can use to recover from the disruptive events that cannot be handled by SQL Database high availability architecture, such as:

对于最近发生的事务,每种功能在估计恢复时间 (ERT) 和可能丢失的数据方面都有不同的特性。Each has different characteristics for estimated recovery time (ERT) and potential data loss for recent transactions. 了解这些选项后,便可从中进行选择 — 在大多数方案中,可以针对不同方案将其搭配使用。Once you understand these options, you can choose among them - and, in most scenarios, use them together for different scenarios. 制定业务连续性计划时,需了解应用程序在中断事件发生后完全恢复之前的最大可接受时间。As you develop your business continuity plan, you need to understand the maximum acceptable time before the application fully recovers after the disruptive event. 应用程序完全恢复所需的时间称为恢复时间目标 (RTO)。The time required for application to fully recover is known as recovery time objective (RTO). 此外,还需要了解从中断事件恢复时,应用程序可忍受最近数据更新丢失的最长期限(时间间隔)。You also need to understand the maximum period of recent data updates (time interval) the application can tolerate losing when recovering after the disruptive event. 可以承受更新丢失的时限称为恢复点目标 (RPO)。The time period of updates that you might afford to lose is known as recovery point objective (RPO).

下表针对最常见方案比较了每个服务层的 ERT 和 RPO。The following table compares the ERT and RPO for each service tier for the most common scenarios.

功能Capability 基本Basic 标准Standard 高级Premium 常规用途General Purpose 业务关键Business Critical
从备份执行时间点还原Point in Time Restore from backup 7 天内的任何还原点Any restore point within seven days 35 天内的任何还原点Any restore point within 35 days 35 天内的任何还原点Any restore point within 35 days 所配置的时间段(最长为 35 天)内的任何还原点Any restore point within configured period (up to 35 days) 所配置的时间段(最长为 35 天)内的任何还原点Any restore point within configured period (up to 35 days)
从异地复制的备份执行异地还原Geo-restore from geo-replicated backups ERT < 12 小时ERT < 12 h
RPO < 1 小时RPO < 1 h
ERT < 12 小时ERT < 12 h
RPO < 1 小时RPO < 1 h
ERT < 12 小时ERT < 12 h
RPO < 1 小时RPO < 1 h
ERT < 12 小时ERT < 12 h
RPO < 1 小时RPO < 1 h
ERT < 12 小时ERT < 12 h
RPO < 1 小时RPO < 1 h
自动故障转移组Auto-failover groups RTO = 1 小时RTO = 1 h
RPO < 5 秒RPO < 5s
RTO = 1 小时RTO = 1 h
RPO < 5 秒RPO < 5 s
RTO = 1 小时RTO = 1 h
RPO < 5 秒RPO < 5 s
RTO = 1 小时RTO = 1 h
RPO < 5 秒RPO < 5 s
RTO = 1 小时RTO = 1 h
RPO < 5 秒RPO < 5 s
手动数据库故障转移Manual database failover ERT = 30 秒ERT = 30 s
RPO < 5 秒RPO < 5s
ERT = 30 秒ERT = 30 s
RPO < 5 秒RPO < 5 s
ERT = 30 秒ERT = 30 s
RPO < 5 秒RPO < 5 s
ERT = 30 秒ERT = 30 s
RPO < 5 秒RPO < 5 s
ERT = 30 秒ERT = 30 s
RPO < 5 秒RPO < 5 s

Note

“手动数据库故障转移”是指使用计划外模式将单一数据库故障转移到其异地复制的辅助数据库。Manual database failover refers to failover of a single database to its geo-replicated secondary using the unplanned mode.

将数据库恢复到现有服务器Recover a database to the existing server

SQL 数据库每周自动执行完整数据库备份,每隔 12 小时通常自动执行差异数据库备份,每隔 5 - 10 分钟自动执行事务日志备份,以防止企业丢失数据。SQL Database automatically performs a combination of full database backups weekly, differential database backups generally taken every 12 hours, and transaction log backups every 5 - 10 minutes to protect your business from data loss. 所有服务层的备份在 RA-GRS 存储中存储 35 天;但基本 DTU 服务层除外,其中的备份将存储 7 天。The backups are stored in RA-GRS storage for 35 days for all service tiers except Basic DTU service tiers where the backups are stored for 7 days. 有关更多详细信息,请参阅自动数据库备份For more information, see automatic database backups. 可以使用 Azure 门户、PowerShell 或 REST API,基于自动备份将现有数据库还原到早期的时间点,作为同一 SQL 数据库服务器上的新数据库。You can restore an existing database form the automated backups to an earlier point in time as a new database on the same SQL Database server using the Azure portal, PowerShell, or the REST API. 有关详细信息,请参阅时间点还原For more information, see Point-in-time restore.

如果支持的最长时间点还原 (PITR) 保留期对你的应用程序而言不足,可以通过为数据库配置长期保留 (LTR) 策略来延长保留期。If the maximum supported point-in-time restore (PITR) retention period is not sufficient for your application, you can extend it by configuring a long-term retention (LTR) policy for the database(s). 有关详细信息,请参阅长期备份保留For more information, see Long-term backup retention.

通过这些自动数据库备份,可以在自己的数据中心以及其他数据中心内,从各种干扰性事件中恢复数据库。You can use these automatic database backups to recover a database from various disruptive events, both within your data center and to another data center. 恢复时间通常少于 12 小时。The recovery time is usually less than 12 hours. 恢复非常大或活动的数据库可能需要更长时间。It may take longer to recover a very large or active database. 使用自动数据库备份时,估计恢复时间取决于若干因素,包括在相同区域同时进行恢复的数据库总数、数据库大小、事务日志大小以及网络带宽。Using automatic database backups, the estimated time of recovery depends on several factors including the total number of databases recovering in the same region at the same time, the database size, the transaction log size, and network bandwidth. 有关恢复时间的详细信息,请参阅数据库恢复时间For more information about recovery time, see database recovery time. 恢复到另一个数据区域时,由于使用异地冗余备份,因此最多只可能丢失 1 小时的数据。When recovering to another data region, the potential data loss is limited to 1 hour with use of geo-redundant backups.

如果应用程序符合以下条件,则可将自动备份和时间点还原用作业务连续性和恢复机制:Use automated backups and point-in-time restore as your business continuity and recovery mechanism if your application:

  • 不是任务关键型应用程序。Is not considered mission critical.
  • 没有约束性的 SLA,因此,停机 24 小时或更长时间不会导致财务责任。Doesn't have a binding SLA - a downtime of 24 hours or longer does not result in financial liability.
  • 数据更改率低(每小时事务数少),并且最多可接受丢失一小时的数据更改。Has a low rate of data change (low transactions per hour) and losing up to an hour of change is an acceptable data loss.
  • 成本有限。Is cost sensitive.

如需更快速的恢复,请使用活动异地复制自动故障转移组If you need faster recovery, use active geo-replication or auto-failover groups. 如果需要从 35 天前的一个时间段恢复数据,请使用长期保留If you need to be able to recover data from a period older than 35 days, use Long-term retention.

将数据库恢复到另一区域Recover a database to another region

Azure 数据中心会罕见地发生中断。Although rare, an Azure data center can have an outage. 发生中断时,业务可能仅中断几分钟,也可能持续数小时。When an outage occurs, it causes a business disruption that might only last a few minutes or might last for hours.

  • 用户可以选择等待数据中心中断结束,数据库重新联机。One option is to wait for your database to come back online when the data center outage is over. 这适用于可以容忍数据库脱机的应用程序。This works for applications that can afford to have the database offline. 例如无需一直处理的开发项目或免费试用版。For example, a development project or free trial you don't need to work on constantly. 数据中心中断时,不知道中断会持续多久,因此该选项仅适用于暂时不需要数据库的情况。When a data center has an outage, you do not know how long the outage might last, so this option only works if you don't need your database for a while.
  • 另一个选项是使用异地冗余数据库备份(异地还原)来还原任何 Azure 区域中任何服务器上的数据库。Another option is to restore a database on any server in any Azure region using geo-redundant database backups (geo-restore). 异地还原使用异地冗余备份作为源,即使由于停电而无法访问数据库或数据中心,也依然能够使用它来恢复数据库。Geo-restore uses a geo-redundant backup as its source and can be used to recover a database even if the database or datacenter is inaccessible due to an outage.
  • 最后,如果你使用活动异地复制配置了异地辅助数据库或者为数据库配置了自动故障转移组,则可以快速从中断中恢复。Finally, you can quickly recover from an outage if you have configured either geo-secondary using active geo-replication or an auto-failover group for your database or databases. 你可以使用手动或自动故障转移,具体取决于你选择的这些技术。Depending on your choice of these technologies, you can use either manual or automatic failover. 虽然故障转移本身只需几秒钟,但服务至少需要 1 小时才能激活它。While failover itself takes only a few seconds, the service will take at least 1 hour to activate it. 这是确保故障转移符合中断规模所必需的。This is necessary to ensure that the failover is justified by the scale of the outage. 此外,由于异步复制的性质,故障转移可能导致少量数据丢失。Also, the failover may result in small data loss due to the nature of asynchronous replication. 有关自动故障转移 RTO 和 RPO 的详细信息,请参阅本文中前面的表。See the table earlier in this article for details of the auto-failover RTO and RPO.

Important

若要使用活动异地复制和自动故障转移组,则必须是订阅所有者或在 SQL Server 中拥有管理权限。To use active geo-replication and auto-failover groups, you must either be the subscription owner or have administrative permissions in SQL Server. 可通过 Azure 订阅的权限使用 Azure 门户、PowerShell 或 REST API 进行配置和故障转移,也可以通过 SQL Server 中的权限使用 Transact-SQL 进行配置和故障转移。You can configure and fail over using the Azure portal, PowerShell, or the REST API using Azure subscription permissions or using Transact-SQL with SQL Server permissions.

如果应用程序符合以下任意条件,请使用自动故障转移组:Use auto-failover groups if your application meets any of these criteria:

  • 是任务关键型应用程序。Is mission critical.
  • 具有服务级别协议 (SLA),不允许停机 12 小时或更长时间。Has a service level agreement (SLA) that does not allow for 12 hours or more of downtime.
  • 停机可能导致财务责任。Downtime may result in financial liability.
  • 具有很高的数据更改率,而且不接受丢失 1 小时的数据。Has a high rate of data change and 1 hour of data loss is not acceptable.
  • 活动异地复制的额外成本低于潜在财务责任和相关业务损失所付出的代价。The additional cost of active geo-replication is lower than the potential financial liability and associated loss of business.

执行操作时,恢复所需的时间以及数据丢失量会有所不同,这具体取决于用户要在应用程序中使用哪些业务连续性功能。When you take action, how long it takes you to recover, and how much data loss you incur depends upon how you decide to use these business continuity features in your application. 事实上,可以根据应用程序需求,选择结合使用数据库备份与活动异地复制。Indeed, you may choose to use a combination of database backups and active geo-replication depending upon your application requirements. 若要探讨使用这些业务连续性功能为独立数据库和弹性池设计应用程序时的注意事项,请参阅设计用于云灾难恢复的应用程序弹性池灾难恢复策略For a discussion of application design considerations for stand-alone databases and for elastic pools using these business continuity features, see Design an application for cloud disaster recovery and Elastic pool disaster recovery strategies.

以下各节概述使用数据库备份或活动异地复制进行恢复的步骤。The following sections provide an overview of the steps to recover using either database backups or active geo-replication. 若要了解包括计划需求在内的详细步骤、恢复后步骤,以及如何模拟中断以执行灾难恢复演练,请参阅在中断时恢复 SQL 数据库For detailed steps including planning requirements, post recovery steps, and information about how to simulate an outage to perform a disaster recovery drill, see Recover a SQL Database from an outage.

为中断做好准备Prepare for an outage

无论使用哪种业务连续性功能,都必须:Regardless of the business continuity feature you use, you must:

  • 识别并准备目标服务器,包括服务器级 IP 防火墙规则、登录名和 master 数据库级权限。Identify and prepare the target server, including server-level IP firewall rules, logins, and master database level permissions.
  • 确定如何将客户端和客户端应用程序重定向到新服务器Determine how to redirect clients and client applications to the new server
  • 记录其他依赖项,例如审核设置和警报Document other dependencies, such as auditing settings and alerts

如果准备不适当,故障转移或恢复数据库后,需要花更多时间让应用程序联机,还有可能要在这种情况下进行故障排除 — 可谓雪上加霜。If you do not prepare properly, bringing your applications online after a failover or a database recovery takes additional time and likely also require troubleshooting at a time of stress - a bad combination.

故障转移到异地复制的辅助数据库Fail over to a geo-replicated secondary database

如果你使用活动异地复制或自动故障转移组作为恢复机制,则可以配置自动故障转移策略,或使用手动计划外故障转移If you are using active geo-replication or auto-failover groups as your recovery mechanism, you can configure an automatic failover policy or use manual unplanned failover. 一旦启用,辅助数据库通过故障转移提升为新的主数据库,它可以记录新事务并响应查询 - 仅丢失极少尚未复制的数据。Once initiated, the failover causes the secondary to become the new primary and ready to record new transactions and respond to queries - with minimal data loss for the data not yet replicated. 有关设计故障转移过程的信息,请参阅设计用于云灾难恢复的应用程序For information on designing the failover process, see Design an application for cloud disaster recovery.

Note

当数据中心恢复联机,旧主数据库自动重新连接至新主数据库,且其自身转为辅助数据库。When the data center comes back online the old primaries automatically reconnect to the new primary and become secondary databases. 如果需要将主数据库重定位至原始区域,可以手动启动计划的故障转移(故障回复)。If you need to relocate the primary back to the original region, you can initiate a planned failover manually (failback).

执行异地还原Perform a geo-restore

如果要将自动备份用于异地冗余存储(默认情况下已启用),则可以使用异地还原恢复数据库。If you are using the automated backups with geo-redundant storage (enabled by default), you can recover the database using geo-restore. 恢复通常在 12 小时内进行 - 最多丢失 1 小时的数据,具体取决于上次日志备份的执行和复制时间。Recovery usually takes place within 12 hours - with data loss of up to one hour determined by when the last log backup was taken and replicated. 在恢复完成之前,数据库无法记录任何事务或响应任何查询。Until the recovery completes, the database is unable to record any transactions or respond to any queries. 请注意,异地还原仅将数据库还原到上一个可用时间点。Note, geo-restore only restores the database to the last available point in time.

Note

如果数据中心在应用程序切换到恢复的数据库之前就重新联机,可以取消恢复。If the data center comes back online before you switch your application over to the recovered database, you can cancel the recovery.

执行故障转移/恢复后任务Perform post failover / recovery tasks

从任一恢复机制恢复后,都必须执行以下附加任务,用户和应用程序才能重新运行:After recovery from either recovery mechanism, you must perform the following additional tasks before your users and applications are back up and running:

  • 将客户端和客户端应用程序重定向到新服务器和还原的数据库Redirect clients and client applications to the new server and restored database
  • 确保设置适当的服务器级 IP 防火墙规则,以便用户能够连接或使用数据库级防火墙,启用合适的规则。Ensure appropriate server-level IP firewall rules are in place for users to connect or use database-level firewalls to enable appropriate rules.
  • 确保设置适当的登录名和 master 数据库级权限(或使用包含用户Ensure appropriate logins and master database level permissions are in place (or use contained users)
  • 视情况配置审核Configure auditing, as appropriate
  • 视情况配置警报Configure alerts, as appropriate

Note

如果使用故障转移组并使用读写侦听器连接到数据库,则在故障转移后应用程序将自动透明地进行重定向。If you are using a failover group and connect to the databases using the read-write lstener, the redirection after failover will happen automatically and transparently to the application.

在停机时间最短的情况下升级应用程序Upgrade an application with minimal downtime

有时,应用程序由于计划内维护(例如应用程序升级)而必须脱机。Sometimes an application must be taken offline because of planned maintenance such as an application upgrade. 管理应用程序升级介绍如何使用活动异地复制滚动升级云应用程序,将升级时的停机时间缩到最短,并在发生错误时提供恢复路径。Manage application upgrades describes how to use active geo-replication to enable rolling upgrades of your cloud application to minimize downtime during upgrades and provide a recovery path if something goes wrong.

后续步骤Next steps

若要探讨为独立数据库和弹性池设计应用程序时的注意事项,请参阅设计用于云灾难恢复的应用程序弹性池灾难恢复策略For a discussion of application design considerations for stand-alone databases and for elastic pools, see Design an application for cloud disaster recovery and Elastic pool disaster recovery strategies.