Azure storage disaster recovery planning and failover
We strive to ensure that Azure services are always available. However, unplanned service outages might occasionally occur. Key components of a good disaster recovery plan include strategies for:
- Data protection
- Backup and restore
- Data redundancy
- Failover
- Designing applications for high availability
This article describes the options available for geo-redundant storage accounts, and provides recommendations for developing highly available applications and testing your disaster recovery plan.
Choose the right redundancy option
Azure Storage maintains multiple copies of your storage account to ensure that availability and durability targets are met, even in the face of failures. The way in which data is replicated provides differing levels of protection. Each option offers its own benefits, so the option you choose depends upon the degree of resiliency your applications require.
Locally redundant storage (LRS), the lowest-cost redundancy option, automatically stores and replicates three copies of your storage account within a single datacenter. Although LRS protects your data against server rack and drive failures, it doesn't account for disasters such as fire or flooding within a datacenter. In the face of such disasters, all replicas of a storage account configured to use LRS might be lost or unrecoverable.
By comparison, zone-redundant storage (ZRS) retains a copy of a storage account and replicates it in each of three separate availability zones within the same region. For more information about availability zones, see Azure availability zones.
Geo-redundant storage and failover
Geo-redundant storage (GRS), geo-zone-redundant storage (GZRS), and read-access geo-zone-redundant storage (RA-GZRS) are examples of geographically redundant storage options. When configured to use geo-redundant storage (GRS, GZRS, and RA-GZRS), Azure copies your data asynchronously to a secondary geographic region. These regions are located hundreds, or even thousands of miles away. This level of redundancy allows you to recover your data if there's an outage throughout the entire primary region.
Unlike LRS and ZRS, geo-redundant storage also provides support for an unplanned failover to a secondary region if there's an outage in the primary region. During the failover process, DNS (Domain Name System) entries for your storage account service endpoints are automatically updated such that the secondary region's endpoints become the new primary endpoints. Once the unplanned failover is complete, clients can begin writing to the new primary endpoints.
Read-access geo-redundant storage (RA-GRS) and read-access geo-zone-redundant storage (RA-GZRS) also provide geo-redundant storage, but offer the added benefit of read access to the secondary endpoint. These options are ideal for applications designed for high availability business-critical applications. If the primary endpoint experiences an outage, applications configured for read access to the secondary region can continue to operate. Azure recommends RA-GZRS for maximum availability and durability of your storage accounts.
For more information about redundancy for Azure Storage, see Azure Storage redundancy.
Plan for failover
Azure Storage accounts support two types of failover:
- Customer-managed (unplanned) failover - Customers can manage storage account failover if there's an unexpected service outage.
- Azure-managed failover - Potentially initiated by Azure due to a severe disaster in the primary region. 1,2
1Azure-managed failover can't be initiated for individual storage accounts, subscriptions, or tenants. For more information, see Azure-managed failover.
2 Use customer-managed failover options to develop, test, and implement your disaster recovery plans. Do not rely on Azure-managed failover, which would only be used in extreme circumstances.
Each type of failover has a unique set of use cases, corresponding expectations for data loss, and support for accounts with a hierarchical namespace enabled (Azure Data Lake Storage). This table summarizes those aspects of each type of failover:
Type | Failover Scope | Use case | Expected data loss | Hierarchical Namespace (HNS) supported |
---|---|---|---|---|
Customer-managed (unplanned) failover | Storage account | The storage service endpoints for the primary region become unavailable, but the secondary region is available. You received an Azure Advisory in which Azure advises you to perform a failover operation of storage accounts potentially affected by an outage. |
Yes | No |
Azure-managed | Entire region | The primary region becomes unavailable due to a significant disaster, but the secondary region is available. | Yes | Yes |
Customer-managed (unplanned) failover
If the data endpoints for the storage services in your storage account become unavailable in the primary region, you can initiate an unplanned failover to the secondary region. After the failover is complete, the secondary region becomes the new primary and users can proceed to access data there.
To understand the effect of this type of failover on your users and applications, it's helpful to know what happens during every step of the unplanned failover and failback process. For details about how the process works, see How customer-managed (unplanned) failover works.
Azure-managed failover
Azure might initiate a regional failover in extreme circumstances, such as a catastrophic disaster that impacts an entire geo region. During these events, no action on your part is required. If your storage account is configured for RA-GRS or RA-GZRS, your applications can read from the secondary region during a Azure-managed failover. However, you don't have write access to your storage account until the failover process is complete.
Important
Use customer-managed failover options to develop, test, and implement your disaster recovery plans. Do not rely on Azure-managed failover, which might only be used in extreme circumstances. A Azure-managed failover would be initiated for an entire physical unit, such as a region or a datacenter. It can't be initiated for individual storage accounts, subscriptions, or tenants. If you need the ability to selectively failover your individual storage accounts, use customer-managed (unplanned) failover.
Anticipate data loss and inconsistencies
Caution
Customer-managed unplanned failover usually involves some amount of data loss, and can also potentially introduce file and data inconsistencies. In your disaster recovery plan, it's important to consider the impact that an account failover would have on your data before initiating one.
Because data is written asynchronously from the primary region to the secondary region, there's always a delay before a write to the primary region is copied to the secondary. If the primary region becomes unavailable, it's possible that the most recent writes might not yet be copied to the secondary.
When an unplanned failover occurs, all data in the primary region is lost as the secondary region becomes the new primary. All data already copied to the secondary region is maintained when the failover happens. However, any data written to the primary that doesn't yet exist within the secondary region is lost permanently.
The new primary region is configured to be locally redundant (LRS) after the failover.
You also might experience file or data inconsistencies if your storage accounts have one or more of the following enabled:
Last sync time
The Last Sync Time property indicates the most recent time at which data from the primary region was also written to the secondary region. For accounts that have a hierarchical namespace, the same Last Sync Time property also applies to the metadata managed by the hierarchical namespace, including access control lists (ACLs). All data and metadata written before the last sync time is available on the secondary. By contrast, data and metadata written after the last sync time might not yet be copied to the secondary and could potentially be lost. During an outage, use this property to estimate the amount of data loss you might incur when initiating an account failover.
As a best practice, design your application so that you can use Last Sync Time to evaluate expected data loss. For example, logging all write operations allows you to compare the times of your last write operation to the last sync time. This method enables you to determine which writes aren't yet synced to the secondary and are in danger of being lost.
For more information about checking the Last Sync Time property, see Check the Last Sync Time property for a storage account.
File consistency for Azure Data Lake Storage
Replication for storage accounts with a hierarchical namespace enabled (Azure Data Lake Storage) occurs at the file level. Because replication occurs at this level, an outage in the primary region might prevent some of the files within a container or directory from successfully replicating to the secondary region. Consistency for all files within a container or directory after a storage account failover isn't guaranteed.
Change feed and blob data inconsistencies
Customer-managed (unplanned) failover of storage accounts with change feed enabled could result in inconsistencies between the change feed logs and the blob data and/or metadata. Such inconsistencies can result from the asynchronous nature of change log updates and data replication between the primary and secondary regions. You can avoid inconsistencies by taking the following precautions:
- Ensure that all log records are flushed to the log files.
- Ensure that all storage data is replicated from the primary to the secondary region.
For more information about change feed, see How the change feed works.
Keep in mind that other storage account features also require the change feed to be enabled. These features include operational backup of Azure Blob Storage, Object replication, and Point-in-time restore for block blobs.
Point-in-time restore inconsistencies
Customer-managed failover is supported for general-purpose v2 standard tier storage accounts that include block blobs. However, performing a customer-managed failover on a storage account resets the earliest possible restore point for the account. Data for Point-in-time restore for block blobs is only consistent up to the failover completion time. As a result, you can only restore block blobs to a point in time no earlier than the failover completion time. You can check the failover completion time in the redundancy tab of your storage account in the Azure portal.
The time and cost of failing over
The time it takes for a customer-managed failover to complete after being initiated can vary, although it typically takes less than one hour.
Initiating a customer-managed unplanned failover automatically converts your storage account to locally redundant storage (LRS) within a new primary region, and deletes the storage account in the original primary region.
You can re-enable geo-redundant storage (GRS) or read-access geo-redundant storage (RA-GRS) for the account, but re-replicating data to the new secondary region incurs a charge. Additionally, any archived blobs need to be rehydrated to an online tier before the account can be reconfigured for geo-redundancy. This rehydration also incurs an extra charge. For more information about pricing, see:
After you re-enable GRS for your storage account, Azure begins replicating the data in your account to the new secondary region. The amount of time it takes for replication to complete depends on several factors. These factors include:
- The number and size of the objects in the storage account. Replicating many small objects can take longer than replicating fewer and larger objects.
- The available resources for background replication, such as CPU, memory, disk, and WAN capacity. Live traffic takes priority over geo replication.
- The number of snapshots per blob, if applicable.
- The data partitioning strategy, if your storage account contains tables. The replication process can't scale beyond the number of partition keys that you use.
Supported storage account types
All geo-redundant offerings support Azure-managed failover. In addition, some account types support customer-managed account failover, as shown in the following table:
Type of failover | GRS/RA-GRS | GZRS/RA-GZRS |
---|---|---|
Customer-managed (unplanned) failover | General-purpose v2 accounts General-purpose v1 accounts Legacy Blob Storage accounts |
General-purpose v2 accounts |
Azure-managed failover | All account types | General-purpose v2 accounts |
Classic storage accounts
Important
Customer-managed failover is only supported for storage accounts deployed using the Azure Resource Manager (ARM) deployment model. The Azure Service Manager (ASM) deployment model, also known as the classic model, isn't supported. To make classic storage accounts eligible for customer-managed account failover, they must first be migrated to the ARM model. Your storage account must be accessible to perform the upgrade, so the primary region can't currently be in a failed state.
During a disaster that affects the primary region, Azure will manage the failover for classic storage accounts. For more information, see Azure-managed failover.
Hierarchical namespace (HNS)
Customer-managed account failover is not yet supported in accounts that have a hierarchical namespace enabled (Azure Data Lake Storage Gen2). To learn more, see Blob storage features available in Azure Data Lake Storage Gen2.
Unsupported features and services
The following features and services aren't supported for customer-managed failover:
- Azure File Sync doesn't support customer-managed account failover. Storage accounts used as cloud endpoints for Azure File Sync shouldn't be failed over. Failover disrupts file sync and might cause the unexpected data loss of newly tiered files. For more information, see Best practices for disaster recovery with Azure File Sync for details.
- A storage account containing premium block blobs can't be failed over. Storage accounts that support premium block blobs don't currently support geo-redundancy.
- Customer-managed failover isn't supported for either the source or the destination account in an object replication policy.
- To failover an account with SSH File Transfer Protocol (SFTP) enabled, you must first disable SFTP for the account. If you want to resume using SFTP after the failover is complete, simply re-enable it.
- Network File System (NFS) 3.0 (NFSv3) isn't supported for storage account failover. You can't create a storage account configured for geo-redundancy with NFSv3 enabled.
Failover isn't for account migration
Storage account failovers are a temporary solution used to develop and test your disaster recovery (DR) plans, or to recover from a service outage. Failover shouldn't be used as part of your data migration strategy. For information about how to migrate your storage accounts, see Azure Storage migration overview.
Storage accounts containing archived blobs
Storage accounts containing archived blobs support account failover. However, after a customer-managed failover is complete, all archived blobs must be rehydrated to an online tier before the account can be configured for geo-redundancy.
Storage resource provider
Microsoft provides two REST APIs for working with Azure Storage resources. These APIs form the basis of all actions you can perform against Azure Storage. The Azure Storage REST API enables you to work with data in your storage account, including blob, queue, file, and table data. The Azure Storage resource provider REST API enables you to manage the storage account and related resources.
After a failover is complete, clients can once again read and write Azure Storage data in the new primary region. However, the Azure Storage resource provider doesn't fail over, so resource management operations must still take place in the primary region. If the primary region is unavailable, you aren't be able to perform management operations on the storage account.
Because the Azure Storage resource provider doesn't fail over, the Location property will return the original primary location after the failover is complete.
Azure virtual machines
Azure virtual machines (VMs) don't fail over as part of a storage account failover. Any VMs that failed over to a secondary region in response to an outage need to be recreated after the failover completes. Account failover can potentially result in the loss of data stored in a temporary disk when the virtual machine (VM) is shut down. Azure recommends following the high availability and disaster recovery guidance specific to virtual machines in Azure.
Azure unmanaged disks
Unmanaged disks are stored as page blobs in Azure Storage. When a VM is running in Azure, any unmanaged disks attached to the VM are leased. An account failover can't proceed when there's a lease on a blob. Before a failover can be initiated on an account containing unmanaged disks attached to Azure VMs, the disks must be shut down. For this reason, Azure's recommended best practices include converting any unmanaged disks to managed disks.
To perform a failover on an account containing unmanaged disks, follow these steps:
- Before you begin, note the names of any unmanaged disks, their logical unit numbers (LUN), and the VM to which they're attached. Doing so will make it easier to reattach the disks after the failover.
- Shut down the VM.
- Delete the VM, but retain the virtual hard disk (VHD) files for the unmanaged disks. Note the time at which you deleted the VM.
- Wait until the Last Sync Time updates, and ensure that it's later than the time at which you deleted the VM. This step ensures that the secondary endpoint is fully updated with the VHD files when the failover occurs, and that the VM functions properly in the new primary region.
- Initiate the account failover.
- Wait until the account failover is complete and the secondary region becomes the new primary region.
- Create a VM in the new primary region and reattach the VHDs.
- Start the new VM.
Keep in mind that any data stored in a temporary disk is lost when the VM is shut down.
Copying data as a failover alternative
As previously discussed, you can maintain high availability by configuring applications to use a storage account configured for read access to a secondary region. However, if you prefer not to fail over during an outage within the primary region, you can manually copy your data as an alternative. Tools such as AzCopy and Azure PowerShell enable you to copy data from your storage account in the affected region to another storage account in an unaffected region. After the copy operation is complete, you can reconfigure your applications to use the storage account in the unaffected region for both read and write availability.
Design for high availability
It's important to design your application for high availability from the start. Refer to these Azure resources for guidance when designing your application and planning for disaster recovery:
- Designing resilient applications for Azure: An overview of the key concepts for architecting highly available applications in Azure.
- Resiliency checklist: A checklist for verifying that your application implements the best design practices for high availability.
- Use geo-redundancy to design highly available applications: Design guidance for building applications to take advantage of geo-redundant storage.
- Tutorial: Build a highly available application with Blob storage: A tutorial that shows how to build a highly available application that automatically switches between endpoints as failures and recoveries are simulated.
Refer to these best practices to maintain high availability for your Azure Storage data:
- Disks: Use Azure Backup to back up the VM disks used by your Azure virtual machines. Also consider using Azure Site Recovery to protect your VMs from a regional disaster.
- Block blobs: Turn on soft delete to protect against object-level deletions and overwrites, or copy block blobs to another storage account in a different region using AzCopy, Azure PowerShell, or the Azure Data Movement library.
- Files: Use Azure Backup to back up your file shares. Also enable soft delete to protect against accidental file share deletions. For geo-redundancy when GRS isn't available, use AzCopy or Azure PowerShell to copy your files to another storage account in a different region.
- Tables: Use AzCopy to export table data to another storage account in a different region.
Track outages
Customers can subscribe to the Azure Service Health Dashboard to track the health and status of Azure Storage and other Azure services.
Azure also recommends that you design your application to prepare for the possibility of write failures. Your application should expose write failures in a way that alerts you to the possibility of an outage in the primary region.