使用 ExpressRoute 进行高可用性设计Designing for high availability with ExpressRoute

ExpressRoute 旨在实现高可用性,以便与 Microsoft 资源建立运营商级专用网络连接。ExpressRoute is designed for high availability to provide carrier grade private network connectivity to Microsoft resources. 换句话说,Microsoft 网络中的 ExpressRoute 路径不存在单一故障点。In other words, there is no single point of failure in the ExpressRoute path within Microsoft network. 为了最大限度地提高可用性,还应该规划好 ExpressRoute 线路的客户与服务提供商细分。To maximize the availability, the customer and the service provider segment of your ExpressRoute circuit should also be architected for high availability. 本文先探讨使用 ExpressRoute 构建可靠网络连接时的网络体系结构注意事项,然后探讨如何微调功能,以帮助改善 ExpressRoute 线路的高可用性。In this article, first let's look into network architecture considerations for building robust network connectivity using an ExpressRoute, then let's look into the fine-tuning features that help you to improve the high availability of your ExpressRoute circuit.


本文所述的概念同样适用于在虚拟 WAN 下或其外部创建 ExpressRoute 线路的情况。The concepts described in this article equally applies when an ExpressRoute circuit is created under Virtual WAN or outside of it.

体系结构注意事项Architecture considerations

下图演示了使用 ExpressRoute 线路进行连接,以最大程度提高 ExpressRoute 线路可用性的建议方法。The following figure illustrates the recommended way to connect using an ExpressRoute circuit for maximizing the availability of an ExpressRoute circuit.


为实现高可用性,在整个端到端网络中保持 ExpressRoute 线路的冗余至关重要。For high availability, it's essential to maintain the redundancy of the ExpressRoute circuit throughout the end-to-end network. 换而言之,需要在本地网络中保持冗余,且不能破坏服务提供商网络中的冗余。In other words, you need to maintain redundancy within your on-premises network, and shouldn't compromise redundancy within your service provider network. 保持最低程度的冗余意味着可以避免网络中出现单一故障点。Maintaining redundancy at the minimum implies avoiding single point of network failures. 为网络设备提供冗余的电源和散热装置可进一步改善高可用性。Having redundant power and cooling for the network devices will further improve the high availability.

第一英里物理层设计注意事项First mile physical layer design considerations

如果在同一客户本地设备 (CPE) 上终止 ExpressRoute 线路的主要连接和辅助连接,将会降低你本地网络中的高可用性。If you terminate both the primary and secondary connections of an ExpressRoute circuits on the same Customer Premises Equipment (CPE), you're compromising the high availability within your on-premises network. 此外,如果你通过 CPE 的同一端口配置主要连接和辅助连接(通过终止不同子接口下的两个连接,或者在合作伙伴网络中合并两个连接),则合作伙伴也会被迫降低其网段中的高可用性。Additionally, if you configure both the primary and secondary connections via the same port of a CPE (either by terminating the two connections under different subinterfaces or by merging the two connections within the partner network), you're forcing the partner to compromise high availability on their network segment as well. 下图演示了这种损害。This compromise is illustrated in the following figure.


另一方面,如果在不同的地理位置终止 ExpressRoute 线路的主要连接和辅助连接,可能会降低连接的网络性能。On the other hand, if you terminate the primary and the secondary connections of an ExpressRoute circuits in different geographical locations, then you could be compromising the network performance of the connectivity. 如果流量在不同地理位置终止的主要副本与辅助连接之间主动进行负载均衡,这两条路径之间可能存在的明显网络延迟差会导致网络性能不佳。If traffic is actively load balanced across the primary and the secondary connections that are terminated on different geographical locations, potential substantial difference in network latency between the two paths would result in suboptimal network performance.

有关异地冗余设计注意事项,请参阅使用 ExpressRoute 进行灾难恢复设计For geo-redundant design considerations, see Designing for disaster recovery with ExpressRoute.

主动-主动连接Active-active connections

Microsoft 网络配置为以主动-主动模式运行 ExpressRoute 线路的主要连接和辅助连接。Microsoft network is configured to operate the primary and secondary connections of ExpressRoute circuits in active-active mode. 但是,通过路由播发,可以强制在主动-被动模式下运行 ExpressRoute 线路的冗余连接。However, through your route advertisements, you can force the redundant connections of an ExpressRoute circuit to operate in active-passive mode. 播发更具体的路由和 BGP AS 路径预置是优先使用某一条路径的常用方法。Advertising more specific routes and BGP AS path prepending are the common techniques used to make one path preferred over the other.

为了改善高可用性,建议在主动-主动模式下运行 ExpressRoute 线路的主要连接和辅助连接。To improve high availability, it's recommended to operate both the connections of an ExpressRoute circuit in active-active mode. 如果让连接以主动-主动模式运行,Microsoft 网络将会基于流对不同连接之间的流量进行负载均衡。If you let the connections operate in active-active mode, Microsoft network will load balance the traffic across the connections on per-flow basis.

在主动-被动模式下运行 ExpressRoute 线路的主要连接和辅助连接会带来以下风险:连接失败,随后主动路径中发生故障。Running the primary and secondary connections of an ExpressRoute circuit in active-passive mode face the risk of both the connections failing following a failure in the active path. 切换时发生故障的常见原因是被动连接缺少主动管理,并且被动连接播发过时的路由。The common causes for failure on switching over are lack of active management of the passive connection, and passive connection advertising stale routes.

或者,在主动-主动模式下运行 ExpressRoute 线路的主要连接和辅助连接会导致仅有大约一半的流量失败并重新路由,随后 ExpressRoute 连接也会失败。Alternatively, running the primary and secondary connections of an ExpressRoute circuit in active-active mode, results in only about half the flows failing and getting rerouted, following an ExpressRoute connection failure. 因此,主动-主动模式明显有助于改善平均恢复时间 (MTTR)。Thus, active-active mode will significantly help improve the Mean Time To Recover (MTTR).

Microsoft 对等互连的 NATNAT for Microsoft peering

Microsoft 对等互连旨在实现公共终结点之间的通信。Microsoft peering is designed for communication between public end-points. 因此,本地专用终结点在通过 Microsoft 对等互连通信之前,通常会使用客户或合作伙伴网络上的公共 IP 进行网络地址转换 (NAT)。So commonly, on-premises private endpoints are Network Address Translated (NATed) with public IP on the customer or partner network before they communicate over Microsoft peering. 假设你在主动-主动模式下使用主要连接和辅助连接,执行 NAT 的位置和方式会影响到在某个 ExpressRoute 连接失败后进行恢复的速度。Assuming you use both the primary and secondary connections in active-active mode, where and how you NAT has an impact on how quickly you recover following a failure in one of the ExpressRoute connections. 下图演示了两个不同的 NAT 选项:Two different NAT options are illustrated in the following figure:


在选项 1 中,拆分 ExpressRoute 主要连接与辅助连接之间的流量之后应用了 NAT。In the option 1, NAT is applied after splitting the traffic between the primary and secondary connections of the ExpressRoute. 为了满足 NAT 的有状态要求,将在主要设备与辅助设备之间使用独立的 NAT 池,使返回流量抵达用于传出流量的同一边缘设备。To meet the stateful requirements of NAT, independent NAT pools are used between the primary and the secondary devices so that the return traffic would arrive to the same edge device through which the flow egressed.

在选项 2 中,拆分 ExpressRoute 主要连接与辅助连接之间的流量之前使用了一个通用 NAT 池。In the option 2, a common NAT pool is used before splitting the traffic between the primary and secondary connections of the ExpressRoute. 必须认识到,在拆分流量之前使用通用 NAT 池并不意味着会造成单一故障点,进而影响高可用性。It's important to make the distinction that the common NAT pool before splitting the traffic does not mean introducing single-point of failure thereby compromising high-availability.

如果使用选项 1,发生 ExpressRoute 连接失败后,将无法访问相应的 NAT 池。With the option 1, following an ExpressRoute connection failure, ability to reach the corresponding NAT pool is broken. 因此,在相应的期限超时后,TCP 或应用层必须重建所有已中断的流。Therefore, all the broken flows have to be re-established either by TCP or application layer following the corresponding window timeout. 如果使用任一 NAT 池作为任何本地服务器的前端,而相应的连接失败,则在修复连接之前,无法从 Azure 访问本地服务器。If either of the NAT pools are used to frontend any of the on-premises servers and if the corresponding connectivity were to fail, the on-premises servers cannot be reached from Azure until the connectivity is fixed.

而使用选项 2 时,即使主要连接或辅助连接失败,也仍可访问 NAT。Whereas with the option 2, the NAT is reachable even after a primary or secondary connection failure. 因此,在发生故障后,网络层本身可以重新路由数据包,帮助更快进行恢复。Therefore, the network layer itself can reroute the packets and help faster recovery following the failure.


如果使用 NAT 选项 1(对 ExpressRoute 主要连接和辅助连接使用独立的 NAT 池),并将 IP 地址的端口从一个 NAT 池映射到本地服务器,则在相应的连接失败时,无法通过 ExpressRoute 线路访问服务器。If you use NAT option 1 (independent NAT pools for primary and secondary ExpressRoute connections) and map a port of an IP address from one of the NAT pool to an on-premises server, the server will not be reachable via the ExpressRoute circuit when the corresponding connection fails.

微调专用对等互连的功能Fine-tuning features for private peering

本部分介绍可帮助改善 ExpressRoute 线路高可用性的可选功能(根据 Azure 部署以及 MTTR 重要性选择使用)。In this section, let us review optional (depending on your Azure deployment and how sensitive you're to MTTR) features that help improve high availability of your ExpressRoute circuit. 具体而言,本部分将介绍区域感知的 ExpressRoute 虚拟网络网关部署和双向转发检测 (BFD)。Specifically, let's review zone-aware deployment of ExpressRoute virtual network gateways, and Bidirectional Forwarding Detection (BFD).

可用性区域感知的 ExpressRoute 虚拟网络网关Availability Zone aware ExpressRoute virtual network gateways

Azure 区域中的可用性区域是容错域和更新域的组合。An Availability Zone in an Azure region is a combination of a fault domain and an update domain. 如果你选择进行区域冗余的 Azure IaaS 部署,则还可能需要配置用于终止 ExpressRoute 专用对等互连的区域冗余虚拟网络网关。If you opt for zone-redundant Azure IaaS deployment, you may also want to configure zone-redundant virtual network gateways that terminate ExpressRoute private peering. 有关更多信息,请参阅关于 Azure 可用性区域中的区域冗余虚拟网络网关To learn further, see About zone-redundant virtual network gateways in Azure Availability Zones. 若要配置区域冗余的虚拟网络网关,请参阅在 Azure 可用性区域中创建区域冗余的虚拟网络网关To configure zone-redundant virtual network gateway, see Create a zone-redundant virtual network gateway in Azure Availability Zones.

改善故障检测时间Improving failure detection time

ExpressRoute 支持通过专用对等互连执行的 BFD。ExpressRoute supports BFD over private peering. BFD 可以通过 Microsoft 企业边缘 (MSEE) 及其本地端的 BGP 邻居之间的第 2 层网络,将故障检测时间从大约 3 分钟(默认值)减少到 1 秒以下。BFD reduces detection time of failure over the Layer 2 network between Microsoft Enterprise Edge (MSEEs) and their BGP neighbors on the on-premises side from about 3 minutes (default) to less than a second. 缩减故障检测时间有助于加速故障恢复。Quick failure detection time helps hastening failure recovery. 有关更多信息,请参阅配置基于 ExpressRoute 的 BFDTo learn further, see Configure BFD over ExpressRoute.

后续步骤Next steps

本文介绍了如何设计 ExpressRoute 线路连接的高可用性。In this article, we discussed how to design for high availability of an ExpressRoute circuit connectivity. 一个 ExpressRoute 线路对等互连点固定为一个地理位置,因此,影响整个位置的灾难性故障可能会影响该连接点。An ExpressRoute circuit peering point is pinned to a geographical location and therefore could be impacted by catastrophic failure that impacts the entire location.

有关与 Microsoft 主干网络建立可承受灾难性故障(会影响整个区域)的异地冗余网络连接的设计注意事项,请参阅使用 ExpressRoute 专用对等互连进行灾难恢复设计For design considerations to build geo-redundant network connectivity to Microsoft backbone that can withstand catastrophic failures, which impact an entire region, see Designing for disaster recovery with ExpressRoute private peering.