MySQL Database on Azure service continuity solutions

This article summarizes several typical break-event scenarios and MySQL Database on Azure solutions that secure service continuity, and it introduces steps and indicators for regional disaster recovery.

Overview

Service continuity refers to the flexible design, deployment, and execution of applications to avoid their being either temporarily or permanently unable to perform their service functions as a result of planned or unplanned interruption events.

The goal of service continuity is to reduce the effect of interruptions on application services and minimize the duration of the effects and avoid data loss.

Unplanned interruption events include human error, temporary or permanent interruptions, and even regional disasters (which might cause an Azure region to suffer a large-scale loss of functionality).

Planned events include redeploying applications in different regions and upgrading applications. Before we look at service continuity solutions, it is important to be familiar with the following concepts:

  • Recovery time objective (RTO): The maximum acceptable time after an interruption incident occurs before the application fully recovers. RTO is used to measure the maximum availability loss during the outage.
  • Recovery point objective (RPO): The maximum number of recent updates that might be lost (time intervals) after an interruption incident occurs and before the application fully recovers. RPO is used to measure the maximum data loss during the outage.
  • Estimated recovery time (ERT): The estimated length of time after a restore or failover request is sent and before the database is fully available.

Service continuity solutions

This article introduces several main scenarios and corresponding solutions.

Scenario Description Workaround
Recover from service interruptions caused by human error An operational error by a database administrator in a production environment that causes the loss of some important data and requires rapid recovery. MySQL Database on Azure supports rollback to any time within the last seven days. For details of the specific steps involved, please refer to Backup and restore MySQL Database on Azure—restore the database to any time point.
Recovering from service disruptions caused by a particular upgrade The failure of a particular upgrade within a production environment causes compatibility problems, resulting in services not operating normally. The system automatically creates snapshot backups of the database. If there is a problem during the upgrade, you can quickly restore a complete backup to a new instance. For the specific procedure, please refer to Backup and restore MySQL Database on Azure—restore the database to any time point.
Recover from a regional disaster If a particular regional disaster causes wide-area service disruptions, you need to rapidly perform the remote recovery procedure. You can choose from three types of recovery solution based on the severity of the situation. The key points are explained later on.

Recover from a regional disaster

MySQL Database on Azure takes advantage of Geo Restore and Geo Replication features, helping you to keep service continuity in case of a regional disaster (for example, a large area power loss in machine room, fire, earthquake, or other events of force majeure).

Geo Restore

The MySQL Database on Azure’s base level will store the user data in an Azure Storage Blob, and it will use read-only access to cross-regional redundancy levels to ensure that there are three data copies both locally and offsite, and that the offsite copies are set to read-only access. When you perform an offsite restore operation, assuming that local storage is available, MySQL Database on Azure will copy the data from the local end to the target region based on the user’s designated point in time and then perform database recovery. If local storage is not available, MySQL Database on Azure will perform the database recovery offsite by using the data nearest to the designated point in time after storage-layer replication.

Through Geo Restore, you can select PaaS service or self-service solution to realize disaster recovery. The performance indicator for Geo Restore is ERT < 3 hr, RPO < 1 hr.

Geo Replication

If you create a geo-subordinate instance through the Azure portal in advance, promote this subordinate instance for service switching in case of regional disaster to ensure service continuity. The performance indicator for Geo Replication is ERT < 30 seconds, RPO < 10 seconds.

Note

ERT, RTO, and RPO are project indicators that are intended only for reference purposes. These indicators appear only in regional disasters and are not part of the MySQL database service’s service level agreement (SLA).

Disaster recovery solutions

Three disaster-recovery solutions are available:

  • MySQL PaaS service restore solution (use Geo Restore) When a disaster occurs, MySQL Database on Azure services will initiate an emergency response and determine whether the cause of the disaster can be quickly restored (in a shorter time than the RPO). If it is not possible to perform a quick restore, MySQL Database on Azure will perform an offsite database restore on all affected instances. The point in time to be restored will be the closest possible restore point in time to the time at which the fault occurred.

Note

The PaaS service restore solution will designate, by default, a point in time when the data that is affected by the interruption was intact. MySQL on Azure will then do everything possible to restore data based on this point in time or the closest restorable point in time for the instance’s designated region.

  • Self-service restore solution (use Geo Restore) If you are using a production environment with higher requirements in terms of recovery times, you can use the PowerShell command line to manually restore the affected instances offsite.

User self-service description

If the Management Portal is accessible when a disaster occurs, you can follow the remote recovery procedure provided in Backup and restore MySQL Database on Azure. However, if regional disasters occur frequently, it will not be possible to obtain correct information on the instance in the Azure Portal. In such a situation, we recommend that you perform an offsite restore operation on the instance by using PowerShell:

New-AzureRmResource -ResourceType "Microsoft.MySql/servers" -ResourceName <ResourceName> -ApiVersion 2015-09-01 -ResourceGroupName <ResourceGroupName> -Location <TargetLocation> -SkuObject @{name=<targetSKU>} -Properties @{creationSource=@{server='<SourceServerName>';region='<SourceLocation>';timepoint='<TimeTag>'};version = '<version number>'}

Note

  1. Timepoint is an optional value. If you do not enter this value, the default is the current point in time. If you want to enter this value, you must use the date-time format in JSON to fill, for example, 2016-05-06T08:00:00. We use UTC time universally.
  2. Instances that were created by using the Azure portal will be allocated to the “Default-MySQL-ChinaNorth” and “Default-MySQL-ChinaEast” resource groups by default on the basis of geographic location.

After the restored instance has been created, you need to add the current IP to the firewall whitelist for the new instance. You will also need to manually update the connection strings for the applications and database with the hostname of the new instance to restore the application-layer services.

  • Geo-subordinate instance restore solution (use Geo Replication) If you create a geo-subordinate instance in advance, when the master instance is not available, you can switch to the geo-subordinate instance through the Azure portal (manual) or apps (automatic) to realize rapid disaster recovery.

To use geo-subordinate instance restore, do the following:

  1. Log in to the Management Portal, then select the remote subordinate instance.
  2. Promote the geo-subordinate instance to be a single instance. For more information, see the “Promote subordinate instance” section in MySQL Master-Slave replication and read-only instances.
  3. To connect to the promoted new instance, modify the connection address of the application client.