High availability for Azure SQL Managed Instance

Article
04/26/2024

Applies to: Azure SQL Managed Instance

This article describes high availability in Azure SQL Managed Instance.

Overview

The goal of the high availability architecture in Azure SQL Managed Instance is to minimize impact on customer workloads from customer initiated management operations that result with a brief downtime, service maintenance operations, and unplanned outages. For more information regarding specific SLAs for different service tiers, review Azure SQL Managed Instance.

High-availability protects you from impact on the:

Rack where nodes powering your service are running
Node itself
Application layer

To minimize the impact in case of regional or bigger outages, you can use one of the available techniques covered with our overview of business continuity.

SQL Managed Instance runs on the latest stable version of the SQL Server Database Engine on the Windows operating system with all applicable patches. SQL Managed Instance automatically handles critical servicing tasks, such as patching, backups, Windows and SQL engine upgrades, and unplanned events such as underlying hardware, software, or network failures. When an instance is patched or fails over, the downtime isn't impactful if you employ retry logic in your app. SQL Managed Instance can quickly recover even in the most critical circumstances, ensuring that your data is always available. Most users do not notice that upgrades are performed continuously.

The high availability solution is designed to ensure that committed data is never lost due to failures, that maintenance operations don't affect your workload, and that the instance won't be a single point of failure in your software architecture.

There are two different high availability architectural models based on the service tier:

The remote storage model is based on a separation of compute and storage in the General Purpose service tier that relies on the high availability and reliability of the remote storage and the high availability of compute clusters managed by Azure Service Fabric. This high availability model targets budget-oriented business applications that can tolerate some performance degradation during maintenance activities.
The local storage model is based on a cluster of database engine processes that rely on a quorum of available database engine nodes in the Business Critical service tier that have local storage. This local storage model targets mission-critical applications that have a high transaction rate and require high IO performance. The high availability architecture guarantees minimal performance impact on your workload during maintenance activities.

Locally redundant availability

Locally redundant availability is based on storing your compute nodes and data on within a single datacenter in the primary region and protects your data in the event of local failure, such as a small-scale network or power failure. If a large-scale disaster such as fire or flooding occurs within a region, all replicas of a storage account or data on the compute nodes might be lost or unrecoverable. As such, to further protect your data when using the locally redundant availability option, consider using a more resilient storage option for your database backups.

General Purpose service tier

The General Purpose service tier uses the remote storage availability architecture. The following figure shows four different nodes with the separated compute and storage layers.

Diagram showing separation of compute and storage.

The remote storage availability model includes two layers:

A stateless compute layer that runs the database engine process and contains only transient and cached data, such as the tempdb and model databases on the attached SSD, and plan cache, buffer pool, and columnstore pool in memory. This stateless node is operated by Azure Service Fabric that initializes database engine, controls health of the node, and performs failover to another node if necessary.
A stateful data layer with the database files (.mdf and .ldf) stored in Azure Blob Storage. Azure Blob Storage has built-in data availability and redundancy features. Locally redundant availability is based on storing your data on locally redundant storage (LRS) which copies your data three times within a single datacenter in the primary region. It guarantees that every record in the log file or page in the data file will be preserved even if database engine process crashes.

Whenever the database engine or the operating system is upgraded, or a failure is detected, Azure Service Fabric will move the stateless database engine process to another stateless compute node with sufficient free capacity. Data in Azure Blob storage isn't affected by the move, and the data/log files are attached to the newly initialized database engine process. This process guarantees high availability, but a heavy workload might experience some performance degradation during the transition since the new database engine process starts with cold cache.

Business Critical service tier

The Business Critical service tier uses the local storage availability model, which integrates compute resources (database engine process) and storage (locally attached SSD) on a single node. High availability is achieved by replicating both compute and storage to additional nodes.

Diagram of a cluster of database engine nodes.

The underlying database files (.mdf/.ldf) are placed on attached SSD storage to provide very low latency IO for your workload. High availability is implemented using a technology similar to SQL Server Always On availability groups. The cluster includes a single primary replica that is accessible for read-write customer workloads, and up to three secondary replicas (compute and storage) that contain copies of data. The primary replica constantly pushes changes to the secondary replicas sequentially to ensure that data is persisted on a sufficient number of secondary replicas before committing each transaction. This process guarantees that, if the primary replica or a readable secondary replica become unavailable for any reason, a fully synchronized replica is always available to fail over to. Failover is initiated by Azure Service Fabric. Once a secondary replica becomes the new primary replica, another secondary replica is created to ensure the cluster has a sufficient number of replicas to maintain quorum. Once failover completes, Azure SQL connections are automatically redirected to the new primary replica (or readable secondary replica based on the connection string).

As an extra benefit, the local storage availability model includes the ability to redirect read-only Azure SQL connections to one of the secondary replicas. This feature is called Read Scale-Out. It provides 100% additional compute capacity at no extra charge to off-load read-only operations, such as analytical workloads, from the primary replica.

Test application fault resiliency

High availability is a fundamental part of the SQL Managed Instance platform that works transparently for your database application. However, we recognize that you might want to test how the automatic failover operations initiated during planned or unplanned events would impact an application before you deploy it to production. You can manually trigger a failover by calling a special API to restart a managed instance. In addition to testing how failover impacts existing database sessions, you can also verify if it changes the end-to-end performance due to changes in network latency. Because the restart operation is intrusive and a large number of them could stress the platform, only one failover call is allowed every 15 minutes for each managed instance.

A failover can be initiated using PowerShell, REST API, or Azure CLI:

PowerShell	REST API	Azure CLI
Invoke-AzSqlInstanceFailover	SQL Managed Instance - Failover	az sql mi failover can be used to invoke a REST API call from Azure CLI

Conclusion

Azure SQL Managed Instance features a built-in high availability solution that is deeply integrated with the Azure platform. The service depends on Service Fabric to detect failure and recover, and Azure Blob storage to protect data. And for the Business Critical service tier, SQL Managed Instance uses SQL Server Always On availability group technology for database replication and failover. The combination of these technologies enables applications to fully realize the benefits of a mixed storage model and supports the most demanding SLAs.

Next steps

Learn about Service Fabric
Learn about Azure Traffic Manager
Learn How to initiate a manual failover on SQL Managed Instance
For more options for high availability and disaster recovery, see Business Continuity