高可用性和 Azure SQL 数据库High-availability and Azure SQL Database

Azure SQL 数据库中的高可用性体系结构的目标是保证数据库在至少 99.99% 的时间内可正常运行(若要详细了解不同层级的特定 SLA,请参阅 Azure SQL 数据库的 SLA),无需让用户担心维护操作所造成的影响和服务中断。The goal of the High Availability architecture in Azure SQL Database is to guarantee that your database is up and running minimum of 99.99% of time (For more information regarding specific SLA for different tiers, Please refer SLA for Azure SQL Database), without worrying about the impact of maintenance operations and outages. Azure 会自动处理关键的维护任务(例如修补、备份、Windows 和 SQL 升级),以及底层硬件、软件或网络故障等计划外的事件。Azure automatically handles critical servicing tasks, such as patching, backups, Windows and SQL upgrades, as well as unplanned events such as underlying hardware, software, or network failures. 修补或故障转移底层 SQL 实例时,如果在应用中使用重试逻辑,则停机不会产生明显影响。When the underlying SQL instance is patched or fails over, the downtime is not noticeable if you employ retry logic in your app. 即使出现最严重的问题,Azure SQL 数据库也能快速恢复,确保数据始终可用。Azure SQL Database can quickly recover even in the most critical circumstances ensuring that your data is always available.

高可用性解决方案旨在确保提交的数据永远不会由于故障而丢失,维护操作不会影响工作负荷,且数据库不会成为软件体系结构中的单一故障点。The high availability solution is designed to ensure that committed data is never lost due to failures, that maintenance operations do not affect your workload, and that the database will not be a single point of failure in your software architecture. 在升级或维护数据库期间,维护或停机时段都不需要停止工作负荷。There are no maintenance windows or downtimes that should require you to stop the workload while the database is upgraded or maintained.

Azure SQL 数据库中使用两种高可用性体系结构模型:There are two high-availability architectural models that are used in Azure SQL Database:

  • 基于计算和存储隔离的标准可用性模型。Standard availability model that is based on a separation of compute and storage. 该模型依赖于远程存储层的高可用性和可靠性。It relies on high availability and reliability of the remote storage tier. 此体系结构面向可以容忍在维护活动期间出现一定程度的性能下降的预算导向型业务应用程序。This architecture targets budget-oriented business applications that can tolerate some performance degradation during maintenance activities.
  • 基于数据库引擎进程群集的高级可用性模型。Premium availability model that is based on a cluster of database engine processes. 该模型依赖于以下事实:始终存在可用数据库引擎节点的仲裁。It relies on the fact that there is always a quorum of available database engine nodes. 此体系结构面向具有较高 IO 性能和较高事务处理率的任务关键型应用程序,保证维护活动期间尽量减小对工作负荷性能造成的影响。This architecture targets mission critical applications with high IO performance, high transaction rate and guarantees minimal performance impact to your workload during maintenance activities.

Azure SQL 数据库在最新稳定版本的 SQL Server 数据库引擎和 Windows OS 上运行,大多数用户不会察觉到正在持续执行升级。Azure SQL Database runs on the latest stable version of SQL Server Database Engine and Windows OS, and most users would not notice that upgrades are performed continuously.

“基本”、“标准”和“常规用途”服务层级可用性Basic, Standard, and General Purpose service tier availability

这些服务层级利用标准的可用性体系结构。These service tiers leverage the standard availability architecture. 下图显示了具有隔离的计算和存储层的四个不同节点。The following figure shows four different nodes with the separated compute and storage layers.

计算和存储隔离

标准可用性模型包括两个层:The standard availability model includes two layers:

  • 无状态计算层:运行 sqlservr.exe 进程,仅包含暂时性的缓存数据,例如在附加的 SSD 上的 TempDB、模型数据库,内存中的计划缓存、缓冲池和列存储池。A stateless compute layer that runs the sqlservr.exe process and contains only transient and cached data, such as TempDB, model databases on the attached SSD, and plan cache, buffer pool, and columnstore pool in memory. 此无状态节点由 Azure Service Fabric 运行。Service Fabric 初始化 sqlservr.exe、控制节点运行状况,并根据需要执行到另一节点的故障转移。This stateless node is operated by Azure Service Fabric that initializes sqlservr.exe, controls health of the node, and performs failover to another node if necessary.
  • 有状态数据层:包含存储在 Azure Blob 存储中的数据库文件 (.mdf/.ldf)。A stateful data layer with the database files (.mdf/.ldf) that are stored in Azure Blob storage. Azure Blob 存储具有内置的数据可用性和冗余功能。Azure blob storage has built-in data availability and redundancy feature. 它可以保证即使 SQL Server 进程崩溃,日志文件中的每条记录或者数据文件中的页面也仍会得到保留。It guarantees that every record in the log file or page in the data file will be preserved even if SQL Server process crashes.

每当升级数据库引擎或操作系统,或者检测到故障时,Azure Service Fabric 会将无状态 SQL Server 进程移到具有足够可用容量的另一个无状态计算节点。Whenever the database engine or the operating system is upgraded, or a failure is detected, Azure Service Fabric will move the stateless SQL Server process to another stateless compute node with sufficient free capacity. Azure Blob 存储中的数据不受移动操作的影响,数据/日志文件将附加到新初始化的 SQL Server 进程。Data in Azure Blob storage is not affected by the move, and the data/log files are attached to the newly initialized SQL Server process. 此过程保证 99.99% 的可用性,但在过渡期间,繁重工作负荷的性能可能会有一定程度的下降,因为新的 SQL Server 实例是使用冷缓存启动的。This process guarantees 99.99% availability, but a heavy workload may experience some performance degradation during the transition since the new SQL Server instance starts with cold cache.

“高级”或“业务关键”服务层级可用性Premium and Business Critical service tier availability

高级和业务关键型服务层级利用高级可用性模型,该模型与单个节点上的计算资源(SQL Server 数据库引擎进程)和存储(本地附加的 SSD)相集成。Premium and Business Critical service tiers leverage the Premium availability model, which integrates compute resources (SQL Server Database Engine process) and storage (locally attached SSD) on a single node. 实现高可用性的方式是将计算和存储资源复制到其他节点,从而建立由三到四个节点组成的群集。High availability is achieved by replicating both compute and storage to additional nodes creating a three to four-node cluster.

数据库引擎节点群集

底层数据库文件 (.mdf/.ldf) 放在附加的 SSD 存储中,以便为工作负荷提供延迟极低的 IO。The underlying database files (.mdf/.ldf) are placed on the attached SSD storage to provide very low latency IO to your workload. 高可用性是使用类似于 SQL Server Always On 可用性组的技术来实现的。High availability is implemented using a technology similar to SQL Server Always On Availability Groups. 群集包含可供读/写客户工作负荷访问的单个主要副本(SQL Server 进程),最多包含三个次要副本(计算和存储),这些副本包含数据的副本。The cluster includes a single primary replica (SQL Server process) that is accessible for read-write customer workloads, and up to three secondary replicas (compute and storage) containing copies of data. 主要节点不断地将更改按顺序推送到辅助节点,在提交每个事务之前,它可确保数据已至少同步到一个次要副本。The primary node constantly pushes changes to the secondary nodes in order and ensures that the data is synchronized to at least one secondary replica before committing each transaction. 此过程可以保证当主要节点出于任何原因而崩溃时,始终可以故障转移到某个完全同步的节点。This process guarantees that if the primary node crashes for any reason, there is always a fully synchronized node to fail over to. 故障转移由 Azure Service Fabric 启动。The failover is initiated by the Azure Service Fabric. 次要副本变成新的主要节点后,会创建另一个次要副本,以确保群集中有足够的节点(仲裁集)。Once the secondary replica becomes the new primary node, another secondary replica is created to ensure the cluster has enough nodes (quorum set). 故障转移后,SQL 连接会自动重定向到新的主要节点。Once failover is complete, SQL connections are automatically redirected to the new primary node.

作为一项额外的优势,高级可用性模型提供用于将只读 SQL 连接重定向到某个次要副本的功能。As an extra benefit, the premium availability model includes the ability to redirect read-only SQL connections to one of the secondary replicas. 此功能称为读取扩展。它通过主要副本免费提供 100% 的额外计算容量,以减轻分析工作负荷等只读操作的负担。This feature is called Read Scale-Out. It provides 100% additional compute capacity at no extra charge to off-load read-only operations, such as analytical workloads, from the primary replica.

“超大规模”服务层级可用性Hyperscale service tier availability

“超大规模”服务层级体系结构在分布式函数体系结构中进行了介绍。The Hyperscale service tier architecture is described in Distributed functions architecture.

“超大规模”函数体系结构

“超大规模”中的可用性模型包括四个层:The availability model in Hyperscale includes four layers:

  • 无状态计算层:运行 sqlservr.exe 进程,仅包含暂时性的缓存数据,例如在附加的 SSD 的上非覆盖性 RBPEX 缓存、TempDB、模型数据库等,在内存中的计划缓存、缓冲池和列存储池。A stateless compute layer that runs the sqlservr.exe processes and contains only transient and cached data, such as non-covering RBPEX cache, TempDB, model database, etc. on the attached SSD, and plan cache, buffer pool, and columnstore pool in memory. 此无状态层包括主要计算副本,并且可以包括许多能够用作故障转移目标的次要计算副本。This stateless layer includes the primary compute replica and optionally a number of secondary compute replicas that can serve as failover targets.
  • 由页服务器组成的无状态存储层。A stateless storage layer formed by page servers. 此层是在计算副本上运行的 sqlservr.exe 进程的分布式存储引擎。This layer is the distributed storage engine for the sqlservr.exe processes running on the compute replicas. 每个页面服务器仅包含暂时性的缓存数据,例如附加的 SSD 上的覆盖性 RBPEX 缓存、在内存中缓存的数据页。Each page server contains only transient and cached data, such as covering RBPEX cache on the attached SSD, and data pages cached in memory. 每个页服务器在主动-主动配置中都有一个配对的页服务器,用于提供负载均衡、冗余性和高可用性。Each page server has a paired page server in an active-active configuration to provide load balancing, redundancy, and high availability.
  • 一个有状态事务日志存储层,包含运行日志服务进程的计算节点、事务日志登陆区域,以及事务日志长期存储。A stateful transaction log storage layer formed by the compute node running the Log service process, the transaction log landing zone, and transaction log long term storage. 登陆区域和长期存储使用 Azure 存储,后者提供事务日志所需的可用性和冗余性,确保已提交事务的数据持久性。Landing zone and long term storage use Azure Storage, which provides availability and redundancy for transaction log, ensuring data durability for committed transactions.
  • 有状态数据存储层,包含的数据库文件 (.mdf/.ndf) 存储在 Azure 存储中并通过页服务器进行更新。A stateful data storage layer with the database files (.mdf/.ndf) that are stored in Azure Storage and are updated by page servers. 此层使用 Azure 存储的数据可用性和冗余功能。This layer uses data availability and redundancy features of Azure Storage. 它保证保存数据文件中的每个页面,即使“超大规模”体系结构的其他层中的进程崩溃或计算节点故障,也是如此。It guarantees that every page in a data file will be preserved even if processes in other layers of Hyperscale architecture crash, or if compute nodes fail.

所有“超大规模”层中的计算节点都运行在 Azure Service Fabric 上,后者控制每个节点的运行状况,并在必要时将数据故障转移到可用的健康节点。Compute nodes in all Hyperscale layers run on Azure Service Fabric, which controls health of each node and performs failovers to available healthy nodes as necessary.

若要详细了解超大规模中的高可用性,请参阅超大规模中的数据库高可用性For more information on high availability in Hyperscale, see Database High Availability in Hyperscale.

加速的数据库恢复 (ADR)Accelerated Database Recovery (ADR)

加速的数据库恢复 (ADR) 是一项新的 SQL 数据库引擎功能,极大地提高数据库可用性(尤其是存在长期运行的事务时)。Accelerated Database Recovery (ADR) is a new SQL database engine feature that greatly improves database availability, especially in the presence of long running transactions. ADR 目前可用于单个数据库、弹性池和 Azure SQL 数据仓库。ADR is currently available for single databases, elastic pools, and Azure SQL Data Warehouse.

测试应用程序的故障复原能力Testing application fault resiliency

高可用性是 Azure SQL 数据库平台的基本功能,其运作对数据库应用程序透明。High availability is a fundamental part of Azure SQL Database platform that works transparently for your database application. 不过,我们认识到,你可能需要先测试在计划内或计划外事件期间启动的自动故障转移操作对应用程序的具体影响,然后才会将其部署到生产环境。However, we recognize that you may want to test how the automatic failover operations initiated during planned or unplanned events would impact the application before you deploy it to production. 可以调用一个特殊的 API 来重启数据库或弹性池,从而触发故障转移。You can call a special API to restart a database or an elastic pool, which will in turn trigger a failover. 由于重启操作会干扰系统,其数量过多可能会对平台造成压力,因此每个数据库或弹性池每 30 分钟只能进行一次故障转移调用。Because the restart operation is intrusive and a large number of them could stress the platform, only one failover call is allowed every 30 minutes for each database or elastic pool.

可以使用 REST API 或 PowerShell 启动故障转移。A failover can be initiated using REST API or PowerShell. 有关 REST API 的信息,请参阅数据库故障转移弹性池故障转移For REST API, see Database failover and Elastic pool failover. 有关 PowerShell 的信息,请参阅 Invoke-AzSqlDatabaseFailoverInvoke-AzSqlElasticPoolFailoverFor PowerShell, see Invoke-AzSqlDatabaseFailover and Invoke-AzSqlElasticPoolFailover. 也可使用 az rest 命令从 Azure CLI 进行 REST API 调用。The REST API calls can also be made from Azure CLI using az rest command.

Important

目前未为“超大规模”服务层级和托管实例提供故障转移命令。The Failover command is currently not available in the Hyperscale service tier and for Managed Instance.

结论Conclusion

Azure SQL 数据库提供与 Azure 平台深度集成的内置高可用性解决方案。Azure SQL Database features a built-in high availability solution, that is deeply integrated with the Azure platform. 它依赖于使用 Service Fabric 来执行故障检测和恢复,依赖于 Azure Blob 存储来实现数据保护,并依赖于可用性区域来提高容错能力。It is dependent on Service Fabric for failure detection and recovery, on Azure Blob storage for data protection, and on Availability Zones for higher fault tolerance. 此外,Azure SQL 数据库利用 SQL Server 的 Always On 可用性组技术来执行复制和故障转移。In addition, Azure SQL database leverages the Always On Availability Group technology from SQL Server for replication and failover. 将这些技术相结合,应用程序可完全实现混合存储模型的优势并支持最严格的 SLA。The combination of these technologies enables applications to fully realize the benefits of a mixed storage model and support the most demanding SLAs.

后续步骤Next steps