Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
An Azure virtual network gateway is a component that provides secure connectivity between your Azure Virtual Network (VNet) and other networks—either your on-premises network or another VNet in Azure. There are two types of virtual network gateways: Azure ExpressRoute gateway, which uses private connections that don't traverse the public internet, and Azure VPN gateway, which uses encrypted tunnels over the internet. As an Azure service, a virtual network gateway provides a range of capabilities to support your reliability requirements.
When you use Azure, reliability is a shared responsibility. Azure provides a range of capabilities to support resiliency and recovery. You're responsible for understanding how those capabilities work within all of the services you use, and selecting the capabilities you need to meet your business objectives and uptime goals.
This article describes how to make a virtual network gateway resilient to a variety of potential outages and problems, including transient faults, availability zone outages, region outages, and planned service maintenance. It also highlights some key information about the virtual network gateway service level agreement (SLA).
To view information about VPN Gateway, be sure to select the appropriate virtual network gateway type at the beginning of this page.
Important
This article covers the reliability of ExpressRoute virtual network gateways, which are the Azure-based parts of the ExpressRoute system.
However, when you use ExpressRoute, it's critical that you design your entire network architecture - not just the gateway - to meet your resiliency requirements. Typically, you must use multiple sites (peering locations), as well as enable high availability and fast failover for your on-premises components. For more information, see Design and architect Azure ExpressRoute for resiliency.
To view information about ExpressRoute gateways, be sure to select the appropriate virtual network gateway type at the beginning of this page.
Important
This article covers the reliability of virtual network gateways, which are the Azure-based parts of the Azure VPN Gateway service.
However, when you use VPNs, it's critical that you design your entire network architecture - not just the gateway - to meet your resiliency requirements. You're responsible for managing the reliability of your side of the VPN connection, including client devices for point-to-site configurations and remote VPN devices for site-to-site configurations. For more information about how to configure your infrastructure for high availability, see Design highly available gateway connectivity for cross-premises and VNet-to-VNet connections.
Production deployment recommendations
The Azure Well-Architected Framework provides recommendations across reliability, performance, security, cost, and operations. To understand how these areas influence each other and contribute to a reliable ExpressRoute solution, see Architecture best practices for Azure ExpressRoute in the Azure Well-Architected Framework.
To ensure high reliability for your production virtual network gateways, we recommend that you:
- Enable zone redundancy if your Azure VPN Gateway resources are in a supported region. Deploy VPN Gateway using supported SKUs (VpnGw1AZ or higher) to ensure access to zone redundancy features.
- Use Standard SKU public IP addresses.
- Configure active-active mode for higher availability, when supported by your remote VPN devices.
- Implement proper monitoring using Azure Monitor VPN Gateway metrics.
Reliability architecture overview
With ExpressRoute, you must deploy components in the on-premises environment, peering locations, and within Azure. These components include:
Circuits and connections: An ExpressRoute circuit consists of two connections through a single peering location to the Microsoft Enterprise Edge. By using two connections, you can achieve active-active connectivity. However, this configuration doesn't protect against site-level failures.
Customer premises equipment (CPE) includes your edge routers and client devices. You need to ensure that your CPE is designed to be resilient to problems, and that it can quickly recover when problems happen in other parts of your ExpressRoute infrastructure.
Sites: Circuits are established through a site, which is a physical peering location. Sites are designed to be highly available and have built-in redundancy across all layers, but because they represent a single physical location, there is a possibility of sites having problems. To mitigate the risk of site outages, ExpressRoute offers different site resiliency options that vary in their level of protection.
Azure virtual network gateway: In Azure, you create a virtual network gateway that acts as the termination point for one or more ExpressRoute circuits within your Azure virtual network.
The following diagram shows two different ExpressRoute configurations, each with a single virtual network gateway, configured for different levels of resiliency across sites:
A VPN requires components to be deployed in both the on-premises environment and within Azure:
On-premises components: The components you deploy depend on whether you deploy a point-to-site or site-to-site configuration.
- Site-to-site configurations require an on-premises VPN device, which you're responsible for deploying, configuring, and managing.
- Point-to-site configurations require you to deploy a VPN client application in a remote device like a laptop or desktop, and import the user profile into the VPN client. Each point-to-site connection has its own user profile. You're responsible for deploying and configuring the client devices.
To learn more about the differences, see VPN Gateway topology and design.
Azure virtual network gateway: In Azure, you create a virtual network gateway, also called a VPN gateway, which acts as the termination point for VPN connections.
Local network gateway: A site-to-site VPN configuration also requires a local network gateway, which represents the remote VPN device. The local network gateway stores the public IP address associated with the VPN device to establish the IKE phase 1 and phase2, the on-premises IP networks (for static routing), BGP IP address of the remote peer (in case of dynamic routing).
- The public IP address of the on-premises VPN device to establish the IKE phase 1 and phase 2 connections
- The on-premises IP networks, for static routing
- The BGP IP address of the remote peer, for dynamic routing
The following diagram illustrates some key components in a VPN that connects from an on-premises environment to Azure:
Virtual network gateway
An ExpressRoute gateway contains two or more gateway virtual machines (VMs), which are the underlying VMs that your gateway uses to process ExpressRoute traffic.
A VPN virtual network gateway contains exactly two gateway virtual machines (VMs), which are the underlying VMs that your gateway uses to process VPN traffic.
You don't see or manage the gateway VMs directly. The platform automatically manages gateway VM creation, health monitoring, and the replacement of unhealthy gateway VMs. To achieve protection against server and server rack failures, Azure automatically distributes gateway VMs across multiple fault domains within a region. If a server rack fails, any gateway VM on that cluster is automatically migrated to another cluster by the Azure platform.
You configure the gateway SKU. Each SKU supports a different level of throughput, and a different number of circuits. When you use the ErGwScale SKU (preview), ExpressRoute automatically scales the gateway by adding more gateway VMs. For more information, see About ExpressRoute virtual network gateways.
A gateway runs in active-active mode by default, which supports high availability of your circuit. You can optionally switch to use active-passive mode, but this configuration increases the risk of a failure affecting your connectivity. For more information, see Active-active connections.
Ordinarily, traffic is routed through your virtual network gateway.
You configure the gateway SKU. Each SKU supports a different level of throughput, and a different number of VPN connections. For more information, see About gateway SKUs.
Depending on your high availability requirements, you can configure your gateway as active-standby, which means that one gateway VM processes traffic and the other is a standby gateway VM, or as active-active, which means that both gateway VMs process traffic. Active-active isn't always possible due to the asymmetric nature of connection flows. For more information, see Design highly available gateway connectivity for cross-premises and VNet-to-VNet connections.
You can protect against availability zone failures by distributing gateway VMs across multiple zones, providing automatic failover within the region, and maintaining connectivity during zone maintenance or outages. For more information, see Resilience to availability zone failures.
Resilience to transient faults
Transient faults are short, intermittent failures in components. They occur frequently in a distributed environment like the cloud, and they're a normal part of operations. Transient faults correct themselves after a short period of time. It's important that your applications can handle transient faults, usually by retrying affected requests.
All cloud-hosted applications should follow the Azure transient fault handling guidance when they communicate with any cloud-hosted APIs, databases, and other components. For more information, see Recommendations for handling transient faults.
For applications that connect through a virtual network gateway, implement retry logic with exponential backoff to handle potential transient connection problems. The stateful nature of virtual network gateways ensures that legitimate connections are maintained during brief network interruptions.
In a distributed networking environment, transient faults can occur at multiple layers, including:
- In your on-premises environment.
- In an edge site.
- In the internet.
- Within Azure.
ExpressRoute reduces the effect of transient faults by using redundant connection paths, fast fault detection, and automated failover. However, it's important that your applications and on-premises components are configured correctly to be resilient to a variety of issues. For comprehensive fault handling strategies, see Designing for high availability with ExpressRoute.
If the IP routing on the on-premises device is configured correctly, data traffic like TCP flows automatically transits through active IPsec tunnels in the event of a disconnection.
Transient faults can sometimes affect IPsec tunnels or TCP data flows. In the event of a disconnection, IKE (Internet Key Exchange) renegotiates the Security Associations (SAs) for both Phase 1 and Phase 2 to re-establish the IPsec tunnel.
Resilience to availability zone failures
Availability zones are physically separate groups of datacenters within each Azure region. When one zone fails, services can fail over to one of the remaining zones.
Virtual network gateways are automatically zone-redundant when they meet the requirements. Zone redundancy eliminates any single zone as a point of failure and provides the highest level of zone resiliency. Zone-redundant gateways provide automatic failover within the region, and maintain connectivity during zone maintenance or outages.
Zone-redundant ExpressRoute gateway VMs are automatically distributed across at least three availability zones.
The following diagram shows a zone-redundant virtual network gateway with three gateway VMs that are distributed across different availability zones:
Note
There's no availability zone configuration for circuits or connections. These resources are located in network edge facilities, which aren't designed to use availability zones.
In Azure VPN Gateway, zone redundancy means that the gateway VMs are automatically distributed across multiple availability zones.
The following diagram shows a zone-redundant virtual network gateway with two gateway VMs that are distributed across different availability zones:
Note
There's no availability zone configuration for local network gateways, because they're automatically zone-resilient.
When you use a supported SKU, any newly created gateway is automatically zone-redundant. Zone redundancy is recommended for all production workloads.
Requirements
- Region support: Zone-redundant virtual network gateways are available in all regions that support availability zones.
SKU: For a virtual network gateway to be zone-redundant, it must use a SKU that supports zone redundancy. The following table shows which SKUs support zone redundancy:
SKU name Supports availability zones Standard No HighPerformance No UltraPerformance No ErGw1Az Yes ErGw2Az Yes ErGw3Az Yes ErGwScale Yes
SKU: For a virtual network gateway to be zone-redundant, it must use a SKU that supports zone redundancy. All tiers of Azure VPN Gateway support zone redundancy except the Basic SKU, which is only for development environments. For more information about SKU options, see About Gateway SKUs
Public IP addresses: You must also use standard SKU public IP addresses and configure them to be zone-redundant.
Cost
Zone-redundant gateways for ExpressRoute require specific SKUs, which can have higher hourly rates compared to standard gateway SKUs due to their enhanced capabilities and performance characteristics. For pricing information, see Azure ExpressRoute pricing.
There's no extra cost for a gateway deployed across multiple availability zones, as long as you use a supported SKU. For pricing information, see VPN Gateway pricing.
Configure availability zone support
This section explains how to configure zone redundancy for your virtual network gateways.
- Create a new virtual network gateway with availability zone support. Any new virtual network gateways you create are automatically zone-redundant, if they meet the requirements listed above. For detailed configuration steps, see Create a zone-redundant virtual network gateway in availability zones.
- Create a new virtual network gateway with availability zone support. Any new virtual network gateways you create are automatically zone-redundant, if they meet the requirements listed above. For detailed configuration steps, see Create a zone-redundant virtual network gateway in availability zones.
- Change the availability zone configuration of an existing virtual network gateway. Virtual network gateways that you already created might not be zone-redundant. You can migrate a nonzonal gateway to a zone-redundant gateway with minimal downtime. For more information, see Migrate ExpressRoute gateways to availability zone-enabled SKUs.
- Change the availability zone configuration of an existing virtual network gateway. Virtual network gateways that you already created might not be zone-redundant. You can migrate a nonzonal gateway to a zone-redundant gateway with minimal downtime. For more information, see About SKU consolidation & migration.
Behavior when all zones are healthy
The following section describes what to expect when your virtual network gateway is configured for zone redundancy and all availability zones are operational.
Traffic routing between zones: Traffic from your on-premises environment is distributed among gateway VMs in all of the zones that your gateway uses. This active-active configuration ensures optimal performance and load distribution under normal operating conditions.
However, if you use FastPath for optimized performance, traffic from your on-premises environment bypasses the gateway, which improves throughput and reduces latency.
Data replication between zones: No data replication occurs between zones because the virtual network gateway doesn't store persistent customer data.
Traffic routing between zones: Zone redundancy doesn't affect how traffic is routed. Traffic is routed between the gateway VMs of your gateway based on the configuration of your clients. If your gateway uses active-active configuration and uses two public IP addresses, both gateway VMs might receive traffic, and for active-standby configuration, traffic is routed to a single primary gateway VM selected by Azure.
Data replication between zones: Azure VPN Gateway doesn't need to synchronize connection state across availability zones. In active-active mode, the gateway VM that processes the VPN connection is responsible for managing the connection's state.
- Gateway VM management: The platform automatically selects the zones for your gateway VMs, and manages placement across the zones. Health monitoring ensures that only healthy gateway VMs receive traffic.
Behavior during a zone failure
The following section describes what to expect when your virtual network gateway is configured for zone redundancy and there's an availability zone outage.
- Detection and response: The Azure platform detects and responds to a failure in an availability zone. You don't need to initiate a zone failover.
Notification: Azure doesn't automatically notify you when a zone is down. However:
You can use Azure Resource Health to monitor for the health of an individual resource, and you can set up Resource Health alerts to notify you of problems.
You can use Azure Service Health to understand the overall health of the service, including any zone failures, and you can set up Service Health alerts to notify you of problems.
Active requests: Any active requests connected through gateway VMs in the failing zone are terminated. Client applications should retry the requests by following the guidance for how to handle transient faults.
Expected data loss: Zone failures aren't expected to cause data loss because virtual network gateways don't store persistent customer data.
Expected downtime: During zone outages, connections might experience brief interruptions that typically last up to one minute as traffic is redistributed. Client applications should retry the requests by following the guidance for how to handle transient faults.
Traffic rerouting: The platform automatically distributes traffic to gateway VMs in healthy zones.
FastPath-enabled connections maintain optimized routing throughout the failover process, ensuring minimal effect on application performance.
- Traffic rerouting: Traffic automatically reroutes to the other gateway VM, which is in a different availability zone.
Zone recovery
When the affected availability zone recovers, Azure automatically restores any gateway VMs in the recovered zone, and returns to normal traffic distribution across all zones that the gateway uses.
Test for zone failures
The Azure platform manages traffic routing, failover, and failback for zone-redundant virtual network gateways. This feature is fully managed, so you don't need to initiate or validate availability zone failure processes.
Resilience to region-wide failures
A virtual network gateway is a single-region resource. If the region becomes unavailable, your gateway is also unavailable.
Note
You can use the Premium ExpressRoute SKU when you have Azure resources that are spread across multiple regions. However, the Premium SKU doesn't affect how your gateway is configured, and it's still deployed into one region. For more information, see What is Azure ExpressRoute?.
Custom multi-region solutions for resiliency
You can create independent connectivity paths to your Azure environment by using one or more of the following approaches:
- Create multiple ExpressRoute circuits, which connect to gateways in different Azure regions.
- Use a site-to-site VPN as a backup for private peering traffic.
- Use Internet connectivity as a backup for Microsoft peering traffic.
For detailed guidance, see Designing for disaster recovery with ExpressRoute private peering.
You can deploy separate VPN Gateways in two or more different regions. However, each gateway is attached to a different virtual network, and the gateways operate independently. There's no interaction, or replication of configuration or state between them. You're also responsible for configuring your clients and remote devices to connect to the correct VPN, or to switch between VPNs as required.
Resilience to service maintenance
Azure performs regular maintenance on virtual network gateways to ensure optimal performance and security. During these maintenance windows, some service disruptions can occur, but Azure designs these activities to minimize effect on your connectivity.
During planned maintenance operations on virtual network gateways, the process is executed on gateway VMs sequentially, never simultaneously. This process ensures that there's always one gateway VM active during maintenance, minimizing the impact on your active connections.
You can configure gateway maintenance windows to align with your operational requirements, reducing the likelihood of unexpected disruptions.
For more information, see Configure customer-controlled maintenance for ExpressRoute gateways.
Service level agreement
The service-level agreement (SLA) for Azure services describes the expected availability of each service and the conditions that your solution must meet to achieve that availability expectation. For more information, see SLAs for online services.
ExpressRoute provides a strong availability SLA that guarantees high uptime for your connections. Different availability SLAs apply if you deploy across multiple peering locations (sites), if you use ExpressRoute Metro, or if you have a single-site configuration.
All VPN Gateway SKUs other than Basic are eligible for a higher availability SLA. The Basic SKU provides a lower availability SLA and limited capabilities and should only be used for testing and development. For more information, see Gateway SKUs - Production vs. Dev-Test workloads