Azure Database for PostgreSQL(单一服务器)中的高可用性High availability in Azure Database for PostgreSQL - Single Server

Azure Database for PostgreSQL(单一服务器)服务提供有保证的高级别可用性,即,提供正常运行时间占比为 99.99% 且具有财务支持的服务级别协议 (SLA)。The Azure Database for PostgreSQL - Single Server service provides a guaranteed high level of availability with the financially backed service level agreement (SLA) of 99.99% uptime. Azure Database for PostgreSQL 在发生计划内事件(例如用户发起的缩放计算操作)期间提供高可用性,并且还在发生基础硬件、软件或网络故障等计划外事件时提供高可用性。Azure Database for PostgreSQL provides high availability during planned events such as user-initated scale compute operation, and also when unplanned events such as underlying hardware, software, or network failures occur. Azure Database for PostgreSQL 在发生大多数严重状况时都可以快速恢复,确保用户在使用此服务时应用程序几乎不会停机。Azure Database for PostgreSQL can quickly recover from most critical circumstances, ensuring virtually no application down time when using this service.

Azure Database for PostgreSQL 适合运行对正常运行时间要求很高的关键数据库。Azure Database for PostgreSQL is suitable for running mission critical databases that require high uptime. 该服务基于 Azure 体系结构构建,具有固有的高可用性、冗余性和复原能力,可以缓解计划内和计划外中断造成的数据库停机,不需要你配置任何其他组件。Built on Azure architecture, the service has inherent high availability, redundancy, and resiliency capabilities to mitigate database downtime from planned and unplanned outages, without requiring you to configure any additional components.

Azure Database for PostgreSQL(单一服务器)中的组件Components in Azure Database for PostgreSQL - Single Server

组件Component 说明Description
PostgreSQL 数据库服务器PostgreSQL Database Server Azure Database for PostgreSQL 为数据库服务器提供安全性、隔离、资源保护和快速重启功能。Azure Database for PostgreSQL provides security, isolation, resource safeguards, and fast restart capability for database servers. 这些功能有助于在发生中断后的几秒钟内执行缩放操作和数据库服务器恢复操作等操作。These capabilities facilitate operations such as scaling and database server recovery operation after an outage to happen in seconds.
数据库服务器中的数据修改通常发生在数据库事务的上下文中。Data modifications in the database server typically occur in the context of a database transaction. 所有数据库更改都以预写日志 (WAL) 的形式同步记录在 Azure 存储上,该存储附加到数据库服务器。All database changes are recorded synchronously in the form of write ahead logs (WAL) on Azure Storage - which is attached to the database server. 在数据库检查点过程中,数据库服务器内存中的数据页也会刷新到存储中。During the database checkpoint process, data pages from the database server memory are also flushed to the storage.
远程存储Remote Storage 所有 PostgreSQL 物理数据文件和 WAL 文件都存储在 Azure 存储中,该存储设计为在一个区域中存储数据的三个副本,以确保数据冗余、可用性和可靠性。All PostgreSQL physical data files and WAL files are stored on Azure Storage, which is architected to store three copies of data within a region to ensure data redundancy, availability, and reliability. 存储层还独立于数据库服务器。The storage layer is also independent of the database server. 它可以在几秒内从发生故障的数据库服务器分离并重新附加到新的数据库服务器。It can be detached from a failed database server and reattached to a new database server within few seconds. 此外,Azure 存储还会持续监视是否存在任何存储故障。Also, Azure Storage continuously monitors for any storage faults. 如果检测到块损坏,则会通过实例化新的存储副本来自动修复。If a block corruption is detected, it is automatically fixed by instantiating a new storage copy.
网关Gateway 网关充当数据库代理,将所有客户端连接路由到数据库服务器。The Gateway acts as a database proxy, routes all client connections to the database server.

缓解计划内停机Planned downtime mitigation

Azure Database for PostgreSQL 设计为在计划内停机操作期间提供高可用性。Azure Database for PostgreSQL is architected to provide high availability during planned downtime operations.

Azure PostgreSQL 中的弹性缩放的视图

  1. 在数秒内纵向扩展和缩减 PostgreSQL 数据库服务器Scale up and down PostgreSQL database servers in seconds
  2. 充当代理的网关,可以将客户端连接路由到适当的数据库服务器Gateway that acts as a proxy to route client connects to the proper database server
  3. 可以在不停机的情况下纵向扩展存储。Scaling up of storage can be performed without any downtime. 远程存储支持在故障转移后进行快速拆离/重新附加操作。Remote storage enables fast detach/re-attach after the failover. 下面是一些计划内维护方案:Here are some planned maintenance scenarios:
方案Scenario 说明Description
计算纵向扩展/缩减Compute scale up/down 当用户执行计算纵向扩展/缩减操作时,将使用缩放的计算配置来预配新的数据库服务器。When the user performs compute scale up/down operation, a new database server is provisioned using the scaled compute configuration. 在旧的数据库服务器中,将允许处于活动状态的检查点完成,客户端连接将排空,所有未提交的事务将取消,然后将关闭该服务器。In the old database server, active checkpoints are allowed to complete, client connections are drained, any uncommitted transactions are canceled, and then it is shut down. 然后会从旧数据库服务器分离存储并将其附加到新的数据库服务器。The storage is then detached from the old database server and attached to the new database server. 当客户端应用程序重试连接或尝试建立新连接时,网关会将连接请求定向到新的数据库服务器。When the client application retries the connection, or tries to make a new connection, the Gateway directs the connection request to the new database server.
纵向扩展存储Scaling Up Storage 纵向扩展存储是一种联机操作,不会中断数据库服务器。Scaling up the storage is an online operation and does not interrupt the database server.
新软件部署 (Azure)New Software Deployment (Azure) 新功能的推出或 bug 修复作为服务的计划内维护的一部分自动发生。New features rollout or bug fixes automatically happen as part of service's planned maintenance. 有关详细信息,请参阅文档并检查你的门户For more information, refer to the documentation, and also check your portal.
次要版本升级Minor version upgrades Azure Database for PostgreSQL 会自动将数据库服务器修补到 Azure 确定的次要版本。Azure Database for PostgreSQL automatically patches database servers to the minor version determined by Azure. 这是在服务的计划内维护过程中发生的。It happens as part of service's planned maintenance. 这会导致短暂的停机(以秒为单位),并且会自动重启装有新次要版本的数据库服务器。This would incur a short downtime in terms of seconds, and the database server is automatically restarted with the new minor version. 有关详细信息,请参阅文档并检查你的门户For more information, refer to the documentation, and also check your portal.

缓解计划外停机Unplanned downtime mitigation

意外的故障(包括基础硬件故障、网络问题和软件 bug)可能会导致计划外停机。Unplanned downtime can occur as a result of unforeseen failures, including underlying hardware fault, networking issues, and software bugs. 如果数据库服务器意外关闭,则会在数秒内自动预配一个新的数据库服务器。If the database server goes down unexpectedly, a new database server is automatically provisioned in seconds. 远程存储会自动附加到新的数据库服务器。The remote storage is automatically attached to the new database server. PostgreSQL 引擎使用 WAL 和数据库文件执行恢复操作,并打开数据库服务器以允许客户端进行连接。PostgreSQL engine performs the recovery operation using WAL and database files, and opens up the database server to allow clients to connect. 未提交的事务将丢失,并且必须由应用程序重试。Uncommitted transactions are lost, and they have to be retried by the application. 虽然计划外停机无法避免,但 Azure Database for PostgreSQL 可以通过在数据库服务器和存储层上自动执行恢复操作来减少停机时间,无需人工干预。While an unplanned downtime cannot be avoided, Azure Database for PostgreSQL mitigates the downtime by automatically performing recovery operations at both database server and storage layers without requiring human intervention.

Azure PostgreSQL 中的高可用性的视图

  1. 具有快速缩放功能的 Azure PostgreSQL 服务器。Azure PostgreSQL servers with fast-scaling capabilities.
  2. 充当代理的网关,可以将客户端连接路由到适当的数据库服务器。Gateway that acts as a proxy to route client connections to the proper database server
  3. Azure 存储,具有三个副本,可确保可靠性、可用性和冗余。Azure storage with three copies for reliability, availability, and redundancy.
  4. 远程存储还支持在服务器故障转移后进行快速拆离/重新附加操作。Remote storage also enables fast detach/re-attach after the server failover.

计划外停机:故障场景和服务恢复Unplanned downtime: failure scenarios and service recovery

下面介绍了一些故障场景以及 Azure Database for PostgreSQL 如何自动恢复:Here are some failure scenarios and how Azure Database for PostgreSQL automatically recovers:

方案Scenario 自动恢复Automatic recovery
数据库服务器故障Database server failure 如果数据库服务器由于某些基础硬件故障而关闭,则会丢弃处于活动状态的连接,并中止任何正在进行的事务。If the database server is down because of some underlying hardware fault, active connections are dropped, and any inflight transactions are aborted. 将自动部署新的数据库服务器,并将远程数据存储附加到新的数据库服务器。A new database server is automatically deployed, and the remote data storage is attached to the new database server. 在数据库恢复完成后,客户端可以通过网关连接到新的数据库服务器。After the database recovery is complete, clients can connect to the new database server through the Gateway.

恢复时间 (RTO) 取决于各种因素,包括发生故障时的活动,例如,在数据库服务器启动过程中需执行的大型事务和恢复量。The recovery time (RTO) is dependent on various factors including the activity at the time of fault such as large transaction and the amount of recovery to be performed during the database server startup process.

所构建的使用 PostgreSQL 数据库的应用程序需要能够检测并重试丢弃的连接和失败的事务。Applications using the PostgreSQL databases need to be built in a way that they detect and retry dropped connections and failed transactions. 当应用程序重试时,网关会将连接透明地重定向到新创建的数据库服务器。When the application retries, the Gateway transparently redirects the connection to the newly created database server.
存储故障Storage failure 对于任何与存储相关的问题(例如磁盘故障或物理块损坏),应用程序看不到任何影响。Applications do not see any impact for any storage-related issues such as a disk failure or a physical block corruption. 由于数据存储在 3 个副本中,因此将由未发生故障的存储提供数据的副本。As the data is stored in 3 copies, the copy of the data is served by the surviving storage. 块损坏会自动修复。Block corruptions are automatically corrected. 如果丢失了数据的副本,则会自动创建数据的新副本。If a copy of data is lost, a new copy of the data is automatically created.

下面是需要用户执行操作来进行恢复的一些故障场景:Here are some failure scenarios that require user action to recover:

方案Scenario 恢复计划Recovery plan
区域故障 Region failure 区域故障非常少见。Failure of a region is a rare event. 但是,如果需要在发生区域故障时获得保护,则可在其他区域中配置一个或多个用于灾难恢复 (DR) 的只读副本。However, if you need protection from a region failure, you can configure one or more read replicas in other regions for disaster recovery (DR). (请参阅此文,详细了解如何创建和管理只读副本)。(See this article about creating and managing read replicas for details). 如果出现区域级故障,可以手动将其他区域上配置的只读副本提升为生产数据库服务器。In the event of a region-level failure, you can manually promote the read replica configured on the other region to be your production database server.
逻辑/用户错误 Logical/user errors 在发生用户错误(例如,意外删除了表或错误地更新了数据)后进行的恢复涉及到执行时间点恢复 (PITR),方法是将数据还原并恢复到发生错误之前的那个时间点。Recovery from user errors, such as accidentally dropped tables or incorrectly updated data, involves performing a point-in-time recovery (PITR), by restoring and recovering the data until the time just before the error had occurred.

如果只需还原部分数据库或特定的表,而不是还原数据库服务器中的所有数据库,则可在新实例中还原数据库服务器,通过 pg_dump 导出表,然后使用 pg_restore 将这些表还原到数据库中。If you want to restore only a subset of databases or specific tables rather than all databases in the database server, you can restore the database server in a new instance, export the table(s) via pg_dump, and then use pg_restore to restore those tables into your database.

摘要Summary

Azure Database for PostgreSQL 提供了数据库服务器快速重启功能、冗余存储和网关的高效路由。Azure Database for PostgreSQL provides fast restart capability of database servers, redundant storage, and efficient routing from the Gateway. 为了进一步进行数据保护,你可以将备份配置为异地复制的备份,同时在其他区域中部署一个或多个只读副本。For additional data protection, you can configure backups to be geo-replicated, and also deploy one or more read replicas in other regions. 利用固有的高可用性功能,Azure Database for PostgreSQL 保护数据库免受最常见的服务中断影响,并提供行业领先且具有财务支持的正常运行时间占比为 99.99% 的 SLAWith inherent high availability capabilities, Azure Database for PostgreSQL protects your databases from most common outages, and offers an industry leading, finance-backed 99.99% of uptime SLA. 所有这些可用性和可靠性功能使得 Azure 成为运行关键应用程序的理想平台。All these availability and reliability capabilities enable Azure to be the ideal platform to run your mission-critical applications.

后续步骤Next steps