MySQL Database on Azure service continuity solutions

This article summarizes several typical break-event scenarios and MySQL Database on Azure solutions that secure service continuity, and it introduces steps and indicators for regional disaster recovery.

Overview

Service continuity refers to the flexible design, deployment, and execution of applications to avoid their being either temporarily or permanently unable to perform their service functions because of planned or unplanned interruption events.

The goal of service continuity is to reduce the effect of interruptions on application services and minimize the duration of the effects and avoid data loss.

Unplanned interruption events include human error, temporary or permanent interruptions, and even regional disasters (which might cause an Azure region to suffer a large-scale loss of functionality).

Planned events include redeploying applications in different regions and upgrading applications.

Before we look at service continuity solutions, it is important to be familiar with the following concepts:

  • Recovery time objective (RTO): The maximum acceptable time after an interruption incident occurs before the application fully recovers. RTO is used to measure the maximum availability loss during the outage.
  • Recovery point objective (RPO): The maximum number of recent updates that might be lost (time intervals) after an interruption incident occurs and before the application fully recovers. RPO is used to measure the maximum data loss during the outage.
  • Estimated recovery time (ERT): The estimated length of time after a restore or failover request is sent and before the database is fully available.

Service continuity solutions

This article introduces several main scenarios and corresponding solutions.

Scenario Description Workaround
Recover from service interruptions caused by human error An operational error by a database administrator in a production environment that causes the loss of some important data and requires rapid recovery. MySQL Database on Azure supports rollback to any time within the last seven days. For details about the specific steps involved, see Backup and restore MySQL Database on Azure—restore the database to any time point.
Recover from service interruptions caused by a particular upgrade The failure of a particular upgrade within a production environment causes compatibility problems, resulting in services not operating normally. The system automatically creates snapshot backups of the database. If there is a problem during the upgrade, you can quickly restore a complete backup to a new instance. For the specific procedure, please refer to Backup and restore MySQL Database on Azure—restore the database to any time point.
Recover from a regional disaster A regional disaster that causes widespread service interruptions for which you need to quickly perform offsite recovery. You can choose from three types of recovery solution based on the severity of the situation. The key points are explained later in this article.

Recover from a regional disaster

MySQL Database on Azure takes advantage of Geo Restore and Geo Replication features, helping you to keep service continuity in case of a regional disaster (for example, a large area power loss in machine room, fire, earthquake, or other events of force majeure).

Geo Restore

The MySQL Database on Azure base level stores the user data in Azure Blob storage, and it uses read-only access to cross-regional redundancy levels to ensure that there are three data copies both locally and offsite, and that the offsite copies are set to read-only access.

When you perform an offsite restore operation, if local storage is available, MySQL Database on Azure copies the data from the local end to the target region based on the user’s designated point in time and then performs database recovery.

If local storage is unavailable, MySQL Database on Azure performs the database recovery offsite by using the data nearest to the designated point in time after storage-layer replication.

Through Geo Restore, you can select PaaS service or self-service solution to realize disaster recovery. The performance indicator for Geo Restore is ERT < 3 hr, RPO < 1 hr.

Geo Replication

If you create a geo-subordinate instance through the Azure portal in advance, promote this subordinate instance for service switching in case of regional disaster to ensure service continuity. The performance indicator for Geo Replication is ERT < 30 seconds, RPO < 10 seconds.

Note

ERT, RTO, and RPO are project indicators that are intended only for reference purposes. These indicators appear only in regional disasters and are not part of the MySQL database service’s service level agreement (SLA).

Disaster recovery solutions

Three disaster-recovery solutions are available:

  • MySQL PaaS service restore solution (use Geo Restore): When a disaster occurs, MySQL Database on Azure services initiates an emergency response and determines whether the cause of the disaster can be quickly restored (in a shorter time than the RPO). If it is not possible to perform a quick restore, MySQL Database on Azure performs an offsite database restore on all affected instances. The point in time to be restored will be the closest possible restore point in time to the time at which the fault occurred.

    Note

    The PaaS solution for service restore designates, by default, a point in time when the data that is affected by the interruption was intact. MySQL on Azure then does everything possible to restore data based on this point in time or the closest restorable point in time for the instance’s designated region.

  • Self-service restore solution (use Geo Restore): If you are using a production environment with higher requirements in terms of recovery times, you can use the PowerShell command line to manually restore the affected instances offsite.

    If the Azure portal is accessible when a disaster occurs, you can follow the remote recovery procedure provided in Backup and restore MySQL Database on Azure. However, if regional disasters occur frequently, it will not be possible to obtain correct information on the instance in the Azure portal. In such a situation, we recommend that you perform an offsite restore operation on the instance by using PowerShell:

    New-AzureRmResource -ResourceType "Microsoft.MySql/servers" -ResourceName <ResourceName> -ApiVersion 2015-09-01 -ResourceGroupName <ResourceGroupName> -Location <TargetLocation> -SkuObject @{name=<targetSKU>} -Properties @{creationSource=@{server='<SourceServerName>';region='<SourceLocation>';timepoint='<TimeTag>'};version = '<version number>'}
    

    Note

    • Timepoint is an optional value. If you do not enter the value, the default is the current point in time. If you want to enter the value, you must use the date-time format in JSON to fill, for example, 2016-05-06T08:00:00. We use UTC time universally.
    • Instances that were created by using the Azure portal are allocated to the Default-MySQL-ChinaNorth and Default-MySQL-ChinaEast resource groups by default based on geographical location.

    After the restored instance has been created, you must add the current IP to the firewall whitelist for the new instance. You also must manually update the connection strings for the applications and database with the hostname of the new instance to restore the application-layer services.

  • Geo-subordinate instance restore solution (use Geo Replication): If you create a geo-subordinate instance in advance, when the master instance is not available, you can switch to the geo-subordinate instance through the Azure portal (manual) or apps (automatic) to realize rapid disaster recovery.

    To use the geo-subordinate instance restore solution, do the following:

    1. Sign in to the Azure portal, and then select the remote subordinate instance.
    2. Promote the geo-subordinate instance to a single instance. For more information, see the “Promote subordinate instance” section in MySQL master-subordinate replication and read-only instances.
    3. To connect to the promoted new instance, modify the connection address of the application client.