使用 Azure DNS 和流量管理器进行灾难恢复Disaster recovery using Azure DNS and Traffic Manager

灾难恢复侧重于从严重的应用程序功能丧失中恢复。Disaster recovery focuses on recovering from a severe loss of application functionality. 若要选择灾难恢复解决方案,业务和技术所有者必须先确定必需的灾难期间功能级别,如不可用、精简功能后部分可用、延迟可用或完全可用。In order to choose a disaster recovery solution, business and technology owners must first determine the level of functionality that is required during a disaster, such as - unavailable, partially available via reduced functionality, or delayed availability, or fully available. 大多数企业客户选择多区域体系结构,以通过应用程序级或基础结构级故障转移复原。Most enterprise customers are choosing a multi-region architecture for resiliency against an application or infrastructure level failover. 客户可以选择多种方法,以通过冗余体系结构实现故障转移和高可用性。Customers can choose several approaches in the quest to achieve failover and high availability via redundant architecture. 下面是一些常用方法:Here are some of the popular approaches:

  • 使用冷备用的主动/被动:在此故障转移解决方案中,在需要故障转移前,VM 和备用区域中运行的其他设备未处于活动状态。Active-passive with cold standby: In this failover solution, the VMs and other appliances that are running in the standby region are not active until there is a need for failover. 不过,生产环境是以备份、VM 映像或资源管理器模板的形式复制到其他区域。However, the production environment is replicated in the form of backups, VM images, or Resource Manager templates, to a different region. 这种故障转移机制经济高效,但需要较长时间才能完成整个故障转移。This failover mechanism is cost-effective but takes a longer time to undertake a complete failover.

    使用冷备用的主动/被动

    图:使用冷备用的主动/被动灾难恢复配置 Figure - Active/Passive with cold standby disaster recovery configuration

  • 使用指示灯的主动/被动:在此故障转移解决方案中,备用环境采用最简配置。Active/Passive with pilot light: In this failover solution, the standby environment is set up with a minimal configuration. 只配置运行必要的服务,就能支持关键的最简应用程序集。The setup has only the necessary services running to support only a minimal and critical set of applications. 在本机形式下,此方案只能执行最简功能,但如果发生故障转移,可以纵向扩展并生成附加服务来处理大量生产负载。In its native form, this scenario can only execute minimal functionality but can scale up and spawn additional services to take bulk of the production load if a failover occurs.

    使用指示灯的主动/被动

    图:使用指示灯的主动/被动灾难恢复配置 Figure: Active/Passive with pilot light disaster recovery configuration

  • 使用热备用的主动/被动:在此故障转移解决方案中,备用区域会进行预热并能处理基础负载,自动缩放已启用,并且所有实例都正常运行。Active/Passive with warm standby: In this failover solution, the standby region is pre-warmed and is ready to take the base load, auto scaling is turned on, and all the instances are up and running. 此解决方案不会通过缩放来处理全部生产负载,但是有用的,且所有服务都正常运行。This solution is not scaled to take the full production load but is functional, and all services are up and running. 此解决方案是指示灯方法的增强型版本。This solution is an augmented version of the pilot light approach.

    使用热备用的主动/被动

    图:使用热备用的主动/被动灾难恢复配置 Figure: Active/Passive with warm standby disaster recovery configuration

规划灾难恢复体系结构Planning your disaster recovery architecture

创建灾难恢复体系结构时,有以下两个技术现状:There are two technical aspects towards setting up your disaster recovery architecture:

  • 使用部署机制在主环境和备用环境之间复制实例、数据和配置。Using a deployment mechanism to replicate instances, data, and configurations between primary and standby environments. 这种灾难恢复可通过 Microsoft Azure 合作伙伴设备/服务(如 Veritas 或 NetApp)使用 Azure Site Recovery 在本机完成。This type of disaster recovery can be done natively via Azure Site-Recovery via Microsoft Azure partner appliances/services like Veritas or NetApp.
  • 开发一种将网络/Web 流量从主站点转移到备用站点的解决方案。Developing a solution to divert network/web traffic from the primary site to the standby site. 这种灾难恢复可通过 Azure DNS、Azure 流量管理器 (DNS) 或第三方全局负载均衡器实现。This type of disaster recovery can be achieved via Azure DNS, Azure Traffic Manager(DNS), or third-party global load balancers.

本文只介绍了网络和 Web 流量重定向方法。This article is limited to approaches via Network and Web traffic redirection. 若要了解如何设置 Azure Site Recovery,请参阅 Azure Site Recovery 文档For instructions to set up Azure Site Recovery, see Azure Site Recovery Documentation. DNS 是转移网络流量的最高效机制之一,因为 DNS 通常是全局的,位于数据中心外部,且不受任何区域级或可用性区域 (AZ) 级故障影响。DNS is one of the most efficient mechanisms to divert network traffic because DNS is often global and external to the data center and is insulated from any regional or availability zone (AZ) level failures. 可以使用基于 DNS 的故障转移机制。在 Azure 中,有两个 DNS 服务可以某种方式完成相同任务,即 Azure DNS(权威 DNS)和 Azure 流量管理器(基于 DNS 的智能流量路由)。One can use a DNS-based failover mechanism and in Azure, two DNS services can accomplish the same in some fashion - Azure DNS (authoritative DNS) and Azure Traffic Manager (DNS-based smart traffic routing).

请务必了解本文为介绍解决方案而广泛使用的一些 DNS 概念:It is important to understand few concepts in DNS that are extensively used to discuss the solutions provided in this article:

  • DNS A 记录 - A 记录是将域指向 IPv4 地址的指针。DNS A Record – A Records are pointers that point a domain to an IPv4 address.
  • CNAME 或 Canonical 名称 - 此记录类型用于指向其他 DNS 记录。CNAME or Canonical name - This record type is used to point to another DNS record. CNAME 响应不返回 IP 地址,而是返回指向包含 IP 地址的记录的指针。CNAME doesn’t respond with an IP address but rather the pointer to the record that contains the IP address.
  • 加权路由 - 可以关联权重和服务终结点,然后根据分配的权重来分布流量。Weighted Routing – one can choose to associate a weight to service endpoints and then distribute the traffic based on the assigned weights. 这种路由方法是流量管理器提供的四种流量路由机制之一。This routing method is one of the four traffic routing mechanisms available within Traffic Manager. 有关详细信息,请参阅加权路由方法For more information, see Weighted routing method.
  • 优先级路由 - 优先级路由以终结点的运行状况检查为依据。Priority Routing – Priority routing is based on health checks of endpoints. 默认情况下,Azure 流量管理器将所有流量都发送到优先级最高的终结点。在发生故障或灾难后,流量管理器将流量路由到辅助终结点。By default, Azure Traffic manager sends all traffic to the highest priority endpoint, and upon a failure or disaster, Traffic Manager routes the traffic to the secondary endpoint. 有关详细信息,请参阅优先级路由方法For more information, see Priority routing method.

使用 Azure DNS 执行手动故障转移Manual failover using Azure DNS

用于灾难恢复的 Azure DNS 手动故障转移解决方案使用标准的 DNS 机制通过故障转移复原到备份站点。The Azure DNS manual failover solution for disaster recovery uses the standard DNS mechanism to failover to the backup site. Azure DNS 手动故障转移解决方案与冷备用或指示灯方法一起使用时效果最佳。The manual option via Azure DNS works best when used in conjunction with the cold standby or the pilot light approach.

使用 Azure DNS 执行手动故障转移

图:使用 Azure DNS 执行手动故障转移 Figure - Manual failover using Azure DNS

为此解决方案做出了如下假设:The assumptions made for the solution are:

  • 主终结点和辅助终结点使用不经常变化的静态 IP。Both primary and secondary endpoints have static IPs that don’t change often. 假设主站点的 IP 为 100.168.124.44,辅助站点的 IP 为 100.168.124.43。Say for the primary site the IP is 100.168.124.44 and the IP for the secondary site is 100.168.124.43.
  • 主站点和辅助站点均有对应的 Azure DNS 区域。An Azure DNS zone exists for both the primary and secondary site. 假设主站点的终结点为 prod.contoso.com,备份站点的终结点为 dr.contoso.com。Say for the primary site the endpoint is prod.contoso.com and for the backup site is dr.contoso.com. 此外,还有主应用程序的 DNS 记录 www.contoso.com。A DNS record for the main application known as www.contoso.com also exists.
  • TTL 不高于组织中设置的 RTO SLA。The TTL is at or below the RTO SLA set in the organization. 例如,如果企业将应用程序灾难响应 RTO 设置为 60 分钟,TTL 应短于 60 分钟,最好是越低越好。For example, if an enterprise sets the RTO of the application disaster response to be 60 mins, then the TTL should be less than 60 mins, preferably the lower the better. 设置 Azure DNS 手动故障转移的具体步骤如下:You can set up Azure DNS for manual failover as follows:
  • 创建 DNS 区域Create a DNS zone
  • 创建 DNS 区域记录Create DNS zone records
  • 更新 CNAME 记录Update CNAME record

第 1 步:创建 DNSStep 1: Create a DNS

创建 DNS 区域(例如,www.contoso.com),如下所示:Create a DNS zone (for example, www.contoso.com) as shown below:

在 Azure 中创建 DNS 区域

图:在 Azure 中创建 DNS 区域 Figure - Create a DNS zone in Azure

第 2 步:创建 DNS 区域记录Step 2: Create DNS zone records

在此区域内,创建三条记录(例如,www.contoso.com、prod.contoso.com 和 dr.consoto.com),如下所示。Within this zone create three records (for example - www.contoso.com, prod.contoso.com and dr.consoto.com) as show below.

创建 DNS 区域记录

图:在 Azure 中创建 DNS 区域记录 Figure - Create DNS zone records in Azure

在此方案中,站点 www.contoso.com 的 TTL 为 30 分钟,这远低于规定的 RTO,并且指向生产站点 prod.contoso.com。In this scenario, site, www.contoso.com has a TTL of 30 mins, which is well below the stated RTO, and is pointing to the production site prod.contoso.com. 此配置适用于常规业务操作。This configuration is during normal business operations. prod.contoso.com 和 dr.contoso.com 的 TTL 已设置为 300 秒或 5 分钟。The TTL of prod.contoso.com and dr.contoso.com has been set to 300 seconds or 5 mins. 可以使用 Azure 监视服务,如 Azure Monitor、Azure App Insights 或任何合作伙伴监视解决方案(如 Dynatrace)。甚至可以使用自行开发的解决方案来监视或检测应用程序级或虚拟基础结构级故障。You can use an Azure monitoring service such as Azure Monitor or Azure App Insights, or, any partner monitoring solutions such as Dynatrace, You can even use home grown solutions that can monitor or detect application or virtual infrastructure level failures.

第 3 步:更新 CNAME 记录Step 3: Update the CNAME record

检测到故障后,立即将记录值更改为指向 dr.contoso.com,如下所示:Once failure is detected, change the record value to point to dr.contoso.com as shown below:

更新 CNAME 记录

图:在 Azure 中更新 CNAME 记录 Figure - Update the CNAME record in Azure

在 30 分钟内,大多数解析程序都会刷新缓存的区域文件,任何指向 www.contoso.com 的查询都会重定向到 dr.contoso.com。Within 30 minutes, during which most resolvers will refresh the cached zone file, any query to www.contoso.com will be redirected to dr.contoso.com. 还可以运行下面的 Azure CLI 命令来更改 CNAME 值:You can also run the following Azure CLI command to change the CNAME value:

  az network dns record-set cname set-record \
  --resource-group 123 \
  --zone-name contoso.com \
  --record-set-name www \
  --cname dr.contoso.com

这一步可以手动执行,也可以自动执行。This step can be executed manually or via automation. 若要手动完成,可以使用控制台或 Azure CLI。It can be done manually via the console or by the Azure CLI. Azure SDK 和 API 可用于自动更新 CNAME,无需手动干预。The Azure SDK and API can be used to automate the CNAME update so that no manual intervention is required. 可以通过 Azure 函数或在第三方监视应用程序内设置自动更新,甚至可以在本地设置。Automation can be built via Azure functions or within a third-party monitoring application or even from on- premises.

Azure DNS 手动故障转移的工作原理How manual failover works using Azure DNS

由于 DNS 服务器不在故障转移或灾难区域内,因此不受任何故障时间影响。Since the DNS server is outside the failover or disaster zone, it is insulated against any downtime. 这样一来,用户可以构建简单、经济高效且一直工作的故障转移方案,但前提是操作者在灾难期间联网,并能进行翻转。This enables user to architect a simple failover scenario that is cost effective and will work all the time assuming that the operator has network connectivity during disaster and can make the flip. 如果解决方案已脚本化,必须确保运行脚本的服务器或服务应不受影响生产环境的问题影响。If the solution is scripted, then one must ensure that the server or service running the script should be insulated against the problem affecting the production environment. 另请注意,已针对区域设置低 TTL,这样全世界没有任何解析程序能够长时间缓存终结点,并且客户可以在 RTO 内访问站点。Also, keep in mind the low TTL that was set against the zone so that no resolver around the world keeps the endpoint cached for long and customers can access the site within the RTO. 对于冷备用和指示灯方法,由于可能需要进行一些预热和其他管理活动,还应在翻转前留出足够长的时间。For a cold standby and pilot light, since some prewarming and other administrative activity may be required – one should also give enough time before making the flip.

使用 Azure 流量管理器执行自动故障转移Automatic failover using Azure Traffic Manager

若有复杂的体系结构和多组能执行相同功能的资源,可以将 Azure 流量管理器(基于 DNS)配置为检查资源的运行状况,并将来自不正常运行的资源的流量路由到正常运行的资源。When you have complex architectures and multiple sets of resources capable of performing the same function, you can configure Azure Traffic Manager (based on DNS) to check the health of your resources and route the traffic from the non-healthy resource to the healthy resource. 在下面的示例中,主要区域和次要区域均采用完全部署。In the following example, both the primary region and the secondary region have a full deployment. 这种部署包括云服务和同步数据库。This deployment includes the cloud services and a synchronized database.

使用 Azure 流量管理器执行自动故障转移

图:使用 Azure 流量管理器执行自动故障转移 Figure - Automatic failover using Azure Traffic Manager

但是,只有主要区域在主动处理来自用户的网络请求。However, only the primary region is actively handling network requests from the users. 只有当主要区域出现服务中断时,次要区域才会激活。The secondary region becomes active only when the primary region experiences a service disruption. 在这种情况下,会将所有新网络请求路由到次要区域。In that case, all new network requests route to the secondary region. 由于数据库备份几乎是瞬间完成,因此两个负载均衡器都有可对其检查运行状况的 IP,并且实例始终正常运行。借助此拓扑,无需任何手动干预,即可执行低 RTO 的故障转移。Since the backup of the database is near instantaneous, both the load balancers have IPs that can be health checked, and the instances are always up and running, this topology provides an option for going in for a low RTO and failover without any manual intervention. 在主要区域发生故障后,次要故障转移区域必须可供立即投入使用。The secondary failover region must be ready to go-live immediately after failure of the primary region. 此方案非常适合使用 Azure 流量管理器完成,其中内置了可执行各种运行状况检查的探针,包括 http/https 和 TCP。This scenario is ideal for the use of Azure Traffic Manager that has inbuilt probes for various types of health checks including http / https and TCP. Azure 流量管理器还包含规则引擎,可以配置为在故障发生时执行故障转移,如下所述。Azure Traffic manager also has a rule engine that can be configured to failover when a failure occurs as described below. 假设使用流量管理器执行以下解决方案:Let’s consider the following solution using Traffic Manager:

  • 客户有静态 IP 为 100.168.124.44 的区域 #1 终结点 prod.contoso.com,以及静态 IP 为 100.168.124.43 的区域 #2 终结点 dr.contoso.com。Customer has the Region #1 endpoint known as prod.contoso.com with a static IP as 100.168.124.44 and a Region #2 endpoint known as dr.contoso.com with a static IP as 100.168.124.43.
  • 其中每个环境都通过面向公众的属性(如负载均衡器)前置。Each of these environments is fronted via a public facing property like a load balancer. 负载平衡器可以配置为,拥有基于 DNS 的终结点或完全限定的域名 (FQDN),如上所述。The load balancer can be configured to have a DNS-based endpoint or a fully qualified domain name (FQDN) as shown above.
  • 区域 2 中的所有实例与区域 1 是近实时复制关系。All the instances in Region 2 are in near real-time replication with Region 1. 此外,计算机映像是最新的,所有软件/配置数据都进行了修补并与区域 1 保持一致。Furthermore, the machine images are up-to-date, and all software/configuration data is patched and are in line with Region 1.
  • 自动缩放已提前预先配置。Autoscaling is preconfigured in advance.

Azure 流量管理器自动故障转移的配置步骤如下:The steps taken to configure the failover with Azure Traffic Manager are as follows:

  1. 新建 Azure 流量管理器配置文件Create a new Azure Traffic Manager profile
  2. 在流量管理器配置文件中创建终结点Create endpoints within the Traffic Manager profile
  3. 设置运行状况检查和故障转移配置Set up health check and failover configuration

第 1 步:新建 Azure 流量管理器配置文件Step 1: Create a new Azure Traffic Manager profile

新建 Azure 流量管理器配置文件,并命名为“contoso123”,再选择“优先级”作为“路由方法”。Create a new Azure Traffic manager profile with the name contoso123 and select the Routing method as Priority. 若有要与之关联的现有资源组,可以选择现有资源组,否则新建资源组。If you have a pre-existing resource group that you want to associate with, then you can select an existing resource group, otherwise, create a new resource group.

创建流量管理器配置文件

图 - 创建流量管理器配置文件Figure - Create a Traffic Manager profile

第 2 步:在流量管理器配置文件中创建终结点Step 2: Create endpoints within the Traffic Manager profile

在这一步,创建指向生产站点和灾难恢复站点的终结点。In this step, you create endpoints that point to the production and disaster recovery sites. 此时,选择“类型” 作为外部终结点,但如果资源托管在 Azure 中,也可以选择“Azure 终结点” 。Here, choose the Type as an external endpoint, but if the resource is hosted in Azure, then you can choose Azure endpoint as well. 如果选择“Azure 终结点”,请选择 Azure 分配的“应用服务”或“公共 IP”作为“目标资源”。If you choose Azure endpoint, then select a Target resource that is either an App Service or a Public IP that is allocated by Azure. 优先级设置为“1” ,因为它是区域 1 的主服务。The priority is set as 1 since it is the primary service for Region 1. 同样,也在流量管理器中创建灾难恢复终结点。Similarly, create the disaster recovery endpoint within Traffic Manager as well.

创建灾难恢复终结点

图:创建灾难恢复终结点 Figure - Create disaster recovery endpoints

第 3 步:设置运行状况检查和故障转移配置Step 3: Set up health check and failover configuration

在这一步,将 DNS TTL 设置为 10 秒,大多数面向 Internet 的递归解析程序都采用此设置。In this step, you set the DNS TTL to 10 seconds, which is honored by most internet-facing recursive resolvers. 此配置意味着,没有 DNS 解析程序会缓存信息超过 10 秒。This configuration means that no DNS resolver will cache the information for more than 10 seconds. 对于终结点监视设置,“路径”当前设置为“/”或根路径,但也可以将终结点设置自定义为评估路径(例如,prod.contoso.com/index)。For the endpoint monitor settings, the path is current set at / or root, but you can customize the endpoint settings to evaluate a path, for example, prod.contoso.com/index. 下面的示例展示了使用 https 作为探测协议。The example below shows the https as the probing protocol. 不过,也可以选用 http 或 tcp 。However, you can choose http or tcp as well. 协议选择取决于最终应用程序。The choice of protocol depends upon the end application. “探测时间间隔”设置为 10 秒(可启用快速探测),重试次数设置为 3 次。The probing interval is set to 10 seconds, which enables fast probing, and the retry is set to 3. 因此,如果三个连续时间间隔的记录结果都为故障,流量管理器就会通过故障转移复原到第二个终结点。As a result, Traffic Manager will failover to the second endpoint if three consecutive intervals register a failure. 自动故障转移总时间的计算公式如下:故障转移时间 = TTL + 重试次数 * 探测时间间隔。在此示例中,值为 10 + 3 * 10 = 40 秒(最大)。The following formula defines the total time for an automated failover: Time for failover = TTL + Retry * Probing interval And in this case, the value is 10 + 3 * 10 = 40 seconds (Max). 如果重试次数设置为 1 次,且 TTL 设置为 10 秒,那么故障转移时间为 10 + 1 * 10 = 20 秒。If the Retry is set to 1 and TTL is set to 10 secs, then the time for failover 10 + 1 * 10 = 20 seconds. 请将重试次数设置为大于 1 的值,这样就不可能因误报或任何网络小跳点而执行故障转移。Set the Retry to a value greater than 1 to eliminate chances of failovers due to false positives or any minor network blips.

设置运行状况检查

图:设置运行状况检查和故障转移配置 Figure - Set up health check and failover configuration

流量管理器自动故障转移的工作原理How automatic failover works using Traffic Manager

灾难期间,将对主终结点进行探测,并将状态更改为“已降级” ,而灾难恢复站点的状态一直为“联机” 。During a disaster, the primary endpoint gets probed and the status changes to degraded and the disaster recovery site remains Online. 默认情况下,流量管理器将所有流量发送到主终结点(优先级最高)。By default, Traffic Manager sends all traffic to the primary (highest-priority) endpoint. 如果主终结点的状态显示为“已降级”,流量管理器会将流量路由到第二个终结点(只要它保持正常运行)。If the primary endpoint appears degraded, Traffic Manager routes the traffic to the second endpoint as long as it remains healthy. 可以视需要在流量管理器中配置更多终结点,以用作附加故障转移终结点,或用作负载均衡器在终结点之间共同均衡负载。One has the option to configure more endpoints within Traffic Manager that can serve as additional failover endpoints, or, as load balancers sharing the load between endpoints.

后续步骤Next steps