处理 Azure Database for MariaDB 的暂时性连接错误Handling of transient connectivity errors for Azure Database for MariaDB

本文介绍如何处理连接 Azure Database for MariaDB 时出现的暂时性错误。This article describes how to handle transient errors connecting to Azure Database for MariaDB.

暂时性错误Transient errors

暂时性错误也称为暂时性故障,是一种可以自行解决的错误。A transient error, also known as a transient fault, is an error that will resolve itself. 这些错误往往表现为与数据库服务器的连接断开。Most typically these errors manifest as a connection to the database server being dropped. 此外,无法与服务器建立的新连接。Also new connections to a server can't be opened. 例如,在发生硬件或网络故障时,可能会出现暂时性错误。Transient errors can occur for example when hardware or network failure happens. 另一个可能的原因是正在推出 PaaS 服务的新版本。系统在 60 秒以内可自动解决其中的大部分事件。Another reason could be a new version of a PaaS service that is being rolled out. Most of these events are automatically mitigated by the system in less than 60 seconds. 设计和开发云中的应用程序时,预料到会出现暂时性错误是最佳做法。A best practice for designing and developing applications in the cloud is to expect transient errors. 假设这些错误随时可能在任意组件中发生,并部署相应的逻辑来应对这种情况。Assume they can happen in any component at any time and to have the appropriate logic in place to handle these situations.

处理暂时性错误Handling transient errors

应使用重试逻辑来处理暂时性错误。Transient errors should be handled using retry logic. 必须考虑的情况包括:Situations that must be considered:

  • 尝试打开连接时出错An error occurs when you try to open a connection
  • 服务器端的空闲连接断开。An idle connection is dropped on the server side. 尝试发出的命令无法执行When you try to issue a command it can't be executed
  • 当前正在用于执行命令的活动连接断开。An active connection that currently is executing a command is dropped.

第一种和第二种情况的处理方式非常直接。The first and second case are fairly straight forward to handle. 可以再试打开连接。Try to open the connection again. 如果连接成功,则表示系统解决了暂时性错误。When you succeed, the transient error has been mitigated by the system. 可以再次使用 Azure Database for MariaDB。You can use your Azure Database for MariaDB again. 我们建议在重试连接之前等待一段时间。We recommend having waits before retrying the connection. 如果初始重试失败,则回退。Back off if the initial retries fail. 这样,系统便可以使用所有可用资源来解决错误局面。This way the system can use all resources available to overcome the error situation. 遵循的良好模式是:A good pattern to follow is:

  • 在首次重试之前等待 5 秒。Wait for 5 seconds before your first retry.
  • 对于每次后续重试,以指数级增大等待时间,最长为 60 秒。For each following retry, the increase the wait exponentially, up to 60 seconds.
  • 设置最大重试次数,达到该次数时,应用程序认为操作失败。Set a max number of retries at which point your application considers the operation failed.

活动事务的连接失败时,适当地处理恢复会更困难。When a connection with an active transaction fails, it is more difficult to handle the recovery correctly. 存在两种情况:如果事务在性质上是只读的,则可以安全地重新打开连接并重试事务。There are two cases: If the transaction was read-only in nature, it is safe to reopen the connection and to retry the transaction. 但是,如果事务也在写入数据库,则必须确定事务在发生暂时性错误之前是已回滚还是已成功。If however if the transaction was also writing to the database, you must determine if the transaction was rolled back, or if it succeeded before the transient error happened. 在这种情况下,你可能尚未从数据库服务器收到提交确认。In that case, you might not have received the commit acknowledgment from the database server.

解决此问题的方法之一是,在客户端上生成一个用于所有重试的唯一 ID。One way of doing this, is to generate a unique ID on the client that is used for all the retries. 将此唯一 ID 作为事务的一部分传递给服务器,并将其存储在具有唯一约束的列中。You pass this unique ID as part of the transaction to the server and to store it in a column with a unique constraint. 这样,便可以安全重试事务。This way you can safely retry the transaction. 如果前一事务已回滚,并且客户端生成的唯一 ID 在系统中尚不存在,则重试将会成功。It will succeed if the previous transaction was rolled back and the client generated unique ID does not yet exist in the system. 如果之前已存储该唯一 ID(因为前一事务已成功完成),则重试将会失败,并指示重复键冲突。It will fail indicating a duplicate key violation if the unique ID was previously stored because the previous transaction completed successfully.

如果程序通过第三方中间件来与 Azure Database for MariaDB 通信,请咨询供应商该中间件是否包含暂时性错误的重试逻辑。When your program communicates with Azure Database for MariaDB through third-party middleware, ask the vendor whether the middleware contains retry logic for transient errors.

请务必测试重试逻辑。Make sure to test you retry logic. 例如,尝试在纵向扩展或缩减 Azure Database for MariaDB 服务器的计算资源时执行代码。For example, try to execute your code while scaling up or down the compute resources of you Azure Database for MariaDB server. 应用程序应可处理此操作期间遇到的短暂停机,而不会出现任何问题。Your application should handle the brief downtime that is encountered during this operation without any problems.

后续步骤Next steps