Azure SQL 中的加速数据库恢复Accelerated Database Recovery in Azure SQL

适用于:是 Azure SQL 数据库 是Azure SQL 托管实例 APPLIES TO: yesAzure SQL Database yesAzure SQL Managed Instance

加速的数据库恢复 (ADR) 是一项 SQL Server 数据库引擎功能,通过重新设计 SQL Server 数据库引擎恢复过程,极大地提高数据库可用性(尤其是存在长期运行的事务时)。Accelerated Database Recovery (ADR) is a SQL Server database engine feature that greatly improves database availability, especially in the presence of long running transactions, by redesigning the SQL Server database engine recovery process. ADR 目前可用于 Azure SQL 数据库、Azure SQL 托管实例、Azure VM 上的 SQL Server 以及 Azure Synapse Analytics(目前为预览版)中的数据库。ADR is currently available for Azure SQL Database, Azure SQL Managed Instance, SQL Server on Azure VM, and databases in Azure Synapse Analytics (currently in preview). ADR 的主要优势在于:The primary benefits of ADR are:

  • 快速且一致的数据库恢复Fast and consistent database recovery

    使用 ADR,长时间运行的事务不会影响整体恢复时间,且无论系统中活动事务的数量或大小如何,都可以实现快速且一致的数据库恢复。With ADR, long running transactions do not impact the overall recovery time, enabling fast and consistent database recovery irrespective of the number of active transactions in the system or their sizes.

  • 即时事务回滚Instantaneous transaction rollback

    使用 ADR,事务回滚是即时的,与事务处于活动状态的时间或已执行的更新次数无关。With ADR, transaction rollback is instantaneous, irrespective of the time that the transaction has been active or the number of updates that has performed.

  • 主动日志截断Aggressive log truncation

    即使存在长时间运行的活动事务,ADR 也会主动截断事务日志,这可以防止其增长失控。With ADR, the transaction log is aggressively truncated, even in the presence of active long-running transactions, which prevents it from growing out of control.

标准数据库恢复过程Standard database recovery process

数据库恢复遵循 ARIES 恢复模式,由三个阶段组成,如下图所示,并在该示意图下附有详细的说明。Database recovery follows the ARIES recovery model and consists of three phases, which are illustrated in the following diagram and explained in more detail following the diagram.

当前恢复过程

  • 分析阶段Analysis phase

    从最后一个成功检查点(或最早的脏页 LSN)的开头向前扫描事务日志直至结束,以确定数据库停止时每个事务的状态。Forward scan of the transaction log from the beginning of the last successful checkpoint (or the oldest dirty page LSN) until the end, to determine the state of each transaction at the time the database stopped.

  • 重做阶段Redo phase

    从最早的未提交事务开始向前扫描事务日志直至结束,通过恢复所有提交的操作将数据库恢复到故障时的状态。Forward scan of the transaction log from the oldest uncommitted transaction until the end, to bring the database to the state it was at the time of the crash by redoing all committed operations.

  • 撤消阶段Undo phase

    对于在故障时处于活动状态的每个事务,向后遍历日志,撤消该事务执行的操作。For each transaction that was active as of the time of the crash, traverses the log backwards, undoing the operations that this transaction performed.

基于此设计,SQL Server 数据库引擎从意外重启中恢复所需的时间(大致)与故障时系统中时间最长的活动事务的大小成正比。Based on this design, the time it takes the SQL Server database engine to recover from an unexpected restart is (roughly) proportional to the size of the longest active transaction in the system at the time of the crash. 恢复需要回滚所有未完成的事务。Recovery requires a rollback of all incomplete transactions. 所需的时间长度与事务已执行的工作及其处于活动状态的时间成正比。The length of time required is proportional to the work that the transaction has performed and the time it has been active. 因此,存在长期运行的事务(例如对大型表的大批量插入操作或索引生成操作)时,恢复过程可能需要很长时间。Therefore, the recovery process can take a long time in the presence of long-running transactions (such as large bulk insert operations or index build operations against a large table).

此外,基于此设计,取消/回滚大型事务也可能需要很长时间,因为它使用与上述流程相同的撤消恢复阶段。Also, cancelling/rolling back a large transaction based on this design can also take a long time as it is using the same Undo recovery phase as described above.

此外,存在长期运行的事务时,SQL Server 数据库引擎无法截断事务日志,因为恢复和回退过程需要相应的日志记录。In addition, the SQL Server database engine cannot truncate the transaction log when there are long-running transactions because their corresponding log records are needed for the recovery and rollback processes. 由于 SQL Server 数据库引擎的这种设计,一些客户过去常常面临事务日志变得非常大并占用大量驱动器空间的问题。As a result of this design of the SQL Server database engine, some customers used to face the problem that the size of the transaction log grows very large and consumes huge amounts of drive space.

加速的数据库恢复过程The Accelerated Database Recovery process

ADR 通过完全重新设计 SQL Server 数据库引擎恢复过程来解决上述问题,具体内容如下:ADR addresses the above issues by completely redesigning the SQL Server database engine recovery process to:

  • 通过避免以最早的活动事务为起始点/结束点扫描日志,使其保持恒定时间/即时状态。Make it constant time/instant by avoiding having to scan the log from/to the beginning of the oldest active transaction. 使用 ADR,事务日志仅从最后一个成功检查点(或最早的脏页日志序列号 (LSN))开始处理。With ADR, the transaction log is only processed from the last successful checkpoint (or oldest dirty page Log Sequence Number (LSN)). 因此,恢复时间不受长时间运行的事务影响。As a result, recovery time is not impacted by long running transactions.
  • 由于不再需要为整个事务处理日志,因此可最大程度地减少所需的事务日志空间。Minimize the required transaction log space since there is no longer a need to process the log for the whole transaction. 当检查点和备份出现时,可以主动截断事务日志。As a result, the transaction log can be truncated aggressively as checkpoints and backups occur.

从较高层次来看,ADR 可通过对所有物理数据库修改进行版本控制并仅撤消逻辑操作(逻辑操作比较有限,且几乎可以立即撤消)来实现快速数据库恢复。At a high level, ADR achieves fast database recovery by versioning all physical database modifications and only undoing logical operations, which are limited and can be undone almost instantly. 在故障时处于活动状态的任何事务都被标记为已中止,因此,并发用户查询可以忽略这些事务生成的任何版本。Any transaction that was active as of the time of a crash are marked as aborted and, therefore, any versions generated by these transactions can be ignored by concurrent user queries.

ADR 恢复过程与当前恢复过程具有相同的三个阶段。The ADR recovery process has the same three phases as the current recovery process. 下图说明了这些阶段在 ADR 中的运作方式,并在该示意图后附带了详细的说明。How these phases operate with ADR is illustrated in the following diagram and explained in more detail following the diagram.

ADR 恢复过程

  • 分析阶段Analysis phase

    除了为非版本控制操作重构 sLog 和复制日志记录,此过程与以前的恢复过程相同。The process remains the same as before with the addition of reconstructing sLog and copying log records for non-versioned operations.

  • 重做阶段Redo phase

    分为两个阶段 (P)Broken into two phases (P)

    • 阶段 1Phase 1

      从 sLog 重做(从最早的未提交事务到上一个检查点)。Redo from sLog (oldest uncommitted transaction up to last checkpoint). 重做是一种快速操作,因为它仅需要处理 sLog 中的一些记录。Redo is a fast operation as it only needs to process a few records from the sLog.

    • 阶段 2Phase 2

      从事务日志开始恢复,从最后一个检查点(而不是最早的未提交事务)开始Redo from Transaction Log starts from last checkpoint (instead of oldest uncommitted transaction)

  • 撤消阶段Undo phase

    使用 ADR 的撤销阶段几乎是瞬间完成的,方法是使用 sLog 通过逻辑还原撤消非版本化操作和持久版本存储 (PVS) 以在行级别执行基于版本的撤消。The Undo phase with ADR completes almost instantaneously by using sLog to undo non-versioned operations and Persisted Version Store (PVS) with Logical Revert to perform row level version-based Undo.

ADR 恢复组件ADR recovery components

ADR 的四个关键组件是:The four key components of ADR are:

  • 持久版本存储 (PVS)Persisted version store (PVS)

    持久版本存储是一种新的 SQL Server 数据库引擎机制,用于持久保存在数据库本身生成生成(而不是在传统的 tempdb 版本存储中生成)的行版本。The persisted version store is a new SQL Server database engine mechanism for persisting the row versions generated in the database itself instead of the traditional tempdb version store. PVS 支持资源隔离,并提高可读辅助数据库的可用性。PVS enables resource isolation as well as improves availability of readable secondaries.

  • 逻辑还原Logical revert

    逻辑还原是一种异步过程,负责在行级别执行基于版本的撤消 - 为所有版本化操作实现即时事务回滚和撤消功能。Logical revert is the asynchronous process responsible for performing row-level version-based Undo - providing instant transaction rollback and undo for all versioned operations. 逻辑还原通过以下方式来完成:Logical revert is accomplished by:

    • 跟踪所有已中止事务,并将它们标记为对其他事务不可见。Keeping track of all aborted transactions and marking them invisible to other transactions.
    • 使用 PVS 执行所有用户事务的回滚操作,而不是通过物理方式扫描事务日志并逐一撤消更改。Performing rollback by using PVS for all user transactions, rather than physically scanning the transaction log and undoing changes one at a time.
    • 在事务中止后立即释放所有锁定。Releasing all locks immediately after transaction abort. 由于中止涉及到直接在内存中标记更改,此过程很高效,因此不需长时间维持锁定状态。Since abort involves simply marking changes in memory, the process is very efficient and therefore locks do not have to be held for a long time.
  • sLogsLog

    sLog 是一个辅助数据库内存中日志流,用于存储非版本控制操作(如元数据缓存无效、锁获取等)的日志记录。sLog is a secondary in-memory log stream that stores log records for non-versioned operations (such as metadata cache invalidation, lock acquisitions, and so on). sLog 具有以下特性:The sLog is:

    • 低容量和内存中Low volume and in-memory
    • 通过在检查点过程中序列化保留在磁盘上Persisted on disk by being serialized during the checkpoint process
    • 提交事务时定期被截断Periodically truncated as transactions commit
    • 通过仅处理非版本控制操作来加速重做和撤消Accelerates redo and undo by processing only the non-versioned operations
    • 通过仅保留所需的日志记录来实现主动事务日志截断Enables aggressive transaction log truncation by preserving only the required log records
  • 清理器Cleaner

    清理器是定期唤醒并清除不需要的页面版本的异步过程。The cleaner is the asynchronous process that wakes up periodically and cleans page versions that are not needed.

加速数据库恢复模式Accelerated Database Recovery Patterns

以下类型的工作负载最受益于 ADR:The following types of workloads benefit most from ADR:

  • 具有长期运行的事务的工作负载。Workloads with long-running transactions.
  • 出现活动事务导致事务日志显著增长情况的工作负载。Workloads that have seen cases where active transactions are causing the transaction log to grow significantly.
  • 由于长时间运行的恢复(如意外的服务重启或手动事务回滚)而经历了数据库长时间不可用的工作负载。Workloads that have experienced long periods of database unavailability due to long running recovery (such as unexpected service restart or manual transaction rollback).