Azure Kubernetes 服务 (AKS) 中实现业务连续性和灾难恢复的最佳做法Best practices for business continuity and disaster recovery in Azure Kubernetes Service (AKS)

在 Azure Kubernetes 服务 (AKS) 中管理群集时,应用程序的正常运行时间变得非常重要。As you manage clusters in Azure Kubernetes Service (AKS), application uptime becomes important. 默认情况下,AKS 通过在虚拟机规模集 (VMSS) 中使用多个节点来提供高可用性。By default, AKS provides high availability by using multiple nodes in a Virtual Machine Scale Set (VMSS). 但是,这些节点不能避免系统受到区域故障的影响。But these multiple nodes don't protect your system from a region failure. 为了最大化正常运行时间,请提前规划以维持业务连续性并为灾难恢复做好准备。To maximize your uptime, plan ahead to maintain business continuity and prepare for disaster recovery.

本文重点介绍如何在 AKS 中规划业务连续性和灾难恢复。This article focuses on how to plan for business continuity and disaster recovery in AKS. 你将学习如何执行以下操作:You learn how to:

  • 在多个区域规划 AKS 群集。Plan for AKS clusters in multiple regions.
  • 使用 Azure 流量管理器跨多个群集路由流量。Route traffic across multiple clusters by using Azure Traffic Manager.
  • 为容器映像注册表使用异地复制。Use geo-replication for your container image registries.
  • 跨多个群集计划应用程序状态。Plan for application state across multiple clusters.
  • 跨多个区域复制存储。Replicate storage across multiple regions.

规划多区域部署Plan for multiregion deployment

最佳做法:在部署多个 AKS 群集时,选择提供 AKS 的区域并使用配对区域。Best practice: When you deploy multiple AKS clusters, choose regions where AKS is available, and use paired regions.

一个 AKS 群集部署到单个区域中。An AKS cluster is deployed into a single region. 为避免系统受到区域故障的影响,可以跨不同区域将应用程序部署到多个 AKS 群集中。To protect your system from region failure, deploy your application into multiple AKS clusters across different regions. 规划 AKS 群集的部署位置时,请考虑:When you plan where to deploy your AKS cluster, consider:

  • AKS 区域可用性:选择靠近用户的区域。AKS region availability: Choose regions close to your users. AKS 不断向新区域扩展。AKS continually expands into new regions.
  • Azure 配对区域:对于你的地理区域,选择两个相互配对的区域。Azure paired regions: For your geographic area, choose two regions that are paired with each other. 配对区域协调平台更新,并在需要时确定恢复工作的优先级。Paired regions coordinate platform updates and prioritize recovery efforts where needed.
  • 服务可用性:确定配对区域应采用热/热、热/暖还是热/冷配置。Service availability: Decide whether your paired regions should be hot/hot, hot/warm, or hot/cold. 是否要同时运行两个区域,其中一个区域已准备好开始提供流量?Do you want to run both regions at the same time, with one region ready to start serving traffic? 或者,是否要运行一个区域,以便有时间来准备好提供流量?Or do you want one region to have time to get ready to serve traffic?

AKS 区域可用性和配对区域是共同考虑的因素。AKS region availability and paired regions are a joint consideration. 将 AKS 群集部署到配对区域中,这些区域旨在一起管理区域灾难恢复。Deploy your AKS clusters into paired regions that are designed to manage region disaster recovery together. 例如,AKS 在中国东部 2 和中国东部 2 提供。For example, AKS is available in China East 2 and China East 2. 这些区域是配对的。These regions are paired. 创建 AKS BC/DR 策略时,请选择这两个区域。Choose these two regions when you're creating an AKS BC/DR strategy.

在部署应用程序时,请向 CI/CD 管道添加另一个步骤以部署到这些 AKS 群集中。When you deploy your application, add another step to your CI/CD pipeline to deploy to these multiple AKS clusters. 如果未更新部署管道,应用程序可能只会部署到其中一个区域和 AKS 群集。If you don't update your deployment pipelines, applications might be deployed into only one of your regions and AKS clusters. 定向到次要区域的客户流量不会收到最新代码更新。Customer traffic that's directed to a secondary region won't receive the latest code updates.

使用 Azure 流量管理器路由流量Use Azure Traffic Manager to route traffic

最佳做法:Azure 流量管理器可以将客户定向到最近的 AKS 群集和应用程序实例。Best practice: Azure Traffic Manager can direct customers to their closest AKS cluster and application instance. 为获得最佳性能和冗余,在进入 AKS 群集之前,通过流量管理器来定向所有应用程序流量。For the best performance and redundancy, direct all application traffic through Traffic Manager before it goes to your AKS cluster.

如果在不同的区域中创建了多个 AKS 群集,请使用流量管理器控制如何将流量传送到每个群集中运行的应用程序。If you have multiple AKS clusters in different regions, use Traffic Manager to control how traffic flows to the applications that run in each cluster. Azure 流量管理器是可以在区域间分布网络流量的基于 DNS 的流量负载均衡器。Azure Traffic Manager is a DNS-based traffic load balancer that can distribute network traffic across regions. 使用流量管理器根据群集响应时间或地理位置路由用户。Use Traffic Manager to route users based on cluster response time or based on geography.

将 AKS 与流量管理器配合使用

使用单个 AKS 群集的客户通常连接到给定应用程序的服务 IP 或 DNS 名称。Customers who have a single AKS cluster typically connect to the service IP or DNS name of a given application. 在多群集部署中,客户应连接到指向每个 AKS 群集上的服务的流量管理器 DNS 名称。In a multicluster deployment, customers should connect to a Traffic Manager DNS name that points to the services on each AKS cluster. 使用流量管理器终结点定义这些服务。Define these services by using Traffic Manager endpoints. 每个终结点都是服务负载均衡器 IP。Each endpoint is the service load balancer IP. 使用此配置可将网络流量从一个区域的流量管理器终结点定向到另一个区域的终结点。Use this configuration to direct network traffic from the Traffic Manager endpoint in one region to the endpoint in a different region.

通过流量管理器进行地理路由

流量管理器执行 DNS 查找,并为用户返回最适当的终结点。Traffic Manager performs DNS lookups and returns a user's most appropriate endpoint. 嵌套的配置文件可为主位置指定优先级。Nested profiles can prioritize a primary location. 例如,用户在一般情况下应连接到最近的地理区域。For example, users should generally connect to their closest geographic region. 如果该区域有问题,流量管理器会将用户定向到次要区域。If that region has a problem, Traffic Manager instead directs the users to a secondary region. 此方式确保客户可以连接到应用程序实例,即使最近的地理区域不可用。This approach ensures that customers can connect to an application instance even if their closest geographic region is unavailable.

有关如何设置终结点和路由的信息,请参阅使用流量管理器配置地理流量路由方法For information on how to set up endpoints and routing, see Configure the geographic traffic routing method by using Traffic Manager.

使用虚拟网络对等互连将区域互连Interconnect regions with global virtual network peering

如果群集需要相互通信,则可以通过虚拟网络对等互连来实现两个虚拟网络之间的相互连接。If the clusters need to talk to each other, connecting both virtual networks to each other can be achieved through virtual network peering. 这项技术将虚拟网络彼此互连,从而在 Azure 的主干网络(甚至在不同地理区域)中提供高带宽。This technology interconnects virtual networks to each other providing high bandwidth across Azure's backbone network, even across different geographic regions.

要对等互连运行 AKS 群集的虚拟网络,一个先决条件是在 AKS 群集中使用标准负载均衡器,以便通过虚拟网络对等互连访问 Kubernetes 服务。A prerequisite to peer the virtual networks where AKS clusters are running is to use the standard Load Balancer in your AKS cluster, so that Kubernetes services are reachable across the virtual network peering.

为容器映像启用异地复制Enable geo-replication for container images

最佳做法:将容器映像存储在 Azure 容器注册表,并将注册表异地复制到每个 AKS 区域。Best practice: Store your container images in Azure Container Registry and geo-replicate the registry to each AKS region.

若要在 AKS 中部署和运行应用程序,需要一种方法来存储和提取容器映像。To deploy and run your applications in AKS, you need a way to store and pull the container images. 容器注册表与 AKS 集成,因此可以安全存储容器映像或 Helm 图表。Container Registry integrates with AKS, so it can securely store your container images or Helm charts. 容器注册表支持多主数据库异地复制来自动将映像复制到世界各地的 Azure 区域。Container Registry supports multimaster geo-replication to automatically replicate your images to Azure regions around the world.

若要提高性能和可用性,请使用容器注册表异地复制在你拥有 AKS 群集的每个区域中创建一个注册表。To improve performance and availability, use Container Registry geo-replication to create a registry in each region where you have an AKS cluster. 然后每个 AKS 群集将从同一区域的本地容器注册表中拉取容器映像:Each AKS cluster then pulls container images from the local container registry in the same region:

用于容器映像的容器注册表异地复制

使用容器注册表异地复制从同一区域中提取映像可获得以下优势:When you use Container Registry geo-replication to pull images from the same region, the results are:

  • 速度更快:可以从同一个 Azure 区域内的高速、低延迟网络连接中提取映像。Faster: You pull images from high-speed, low-latency network connections within the same Azure region.
  • 更可靠:如果一个区域不可用,AKS 群集将从可用的容器注册表提取映像。More reliable: If a region is unavailable, your AKS cluster pulls the images from an available container registry.
  • 更经济实惠:数据中心之间没有任何网络出口费用。Cheaper: There's no network egress charge between datacenters.

异地复制是高级 SKU 容器注册表的一项功能。Geo-replication is a feature of Premium SKU container registries. 有关如何配置异地复制的信息,请参阅容器注册表异地复制For information on how to configure geo-replication, see Container Registry geo-replication.

从容器内删除服务状态Remove service state from inside containers

最佳做法:在可能的情况下,不要将服务状态存储在容器中。Best practice: Where possible, don't store service state inside the container. 请改用支持多区域复制的 Azure 平台即服务 (PaaS)。Instead, use an Azure platform as a service (PaaS) that supports multiregion replication.

服务状态指的是服务正常运行所需的内存中数据或磁盘上数据。Service state refers to the in-memory or on-disk data that a service requires to function. 状态包括服务读取和写入的数据结构和成员变量。State includes the data structures and member variables that the service reads and writes. 状态可能还包括存储在磁盘上的文件或其他资源,具体取决于服务的体系结构。Depending on how the service is architected, the state might also include files or other resources that are stored on the disk. 例如,状态可能包括数据库用来存储数据和事务日志的文件。For example, the state might include the files a database uses to store data and transaction logs.

状态可以外部化或与操作状态的代码共存。State can be either externalized or colocated with the code that manipulates the state. 通常,你会使用一个数据库或其他数据存储(在网络中不同计算机上运行或同一计算机进程外部运行)来实现状态的外部化。Typically, you externalize state by using a database or other data store that runs on different machines over the network or that runs out of process on the same machine.

当容器和微服务中运行的进程不保持状态时,它们最有弹性。Containers and microservices are most resilient when the processes that run inside them don't retain state. 由于应用程序几乎始终包含某种状态,因此请使用 Azure Cosmos DB、Azure Database for PostgreSQL、Azure Database for MySQL 或 Azure SQL 数据库等 PaaS 解决方案。Because applications almost always contain some state, use a PaaS solution such as Azure Cosmos DB, Azure Database for PostgreSQL, Azure Database for MySQL or Azure SQL Database.

如何构建可移植的应用程序,请参阅以下指导原则:To build portable applications, see the following guidelines:

创建存储迁移计划Create a storage migration plan

最佳做法:如果使用 Azure 存储,准备和测试如何将存储从主要区域迁移到备份区域。Best practice: If you use Azure Storage, prepare and test how to migrate your storage from the primary region to the backup region.

应用程序可能会为其数据使用 Azure 存储。Your applications might use Azure Storage for their data. 由于应用程序分布在不同区域的多个 AKS 群集中,因此需要保持存储同步。Because your applications are spread across multiple AKS clusters in different regions, you need to keep the storage synchronized. 下面是复制存储的两种常用方法:Here are two common ways to replicate storage:

  • 基于基础结构的异步复制Infrastructure-based asynchronous replication
  • 基于应用程序的异步复制Application-based asynchronous replication

基于基础结构的异步复制Infrastructure-based asynchronous replication

即使删除了 pod,应用程序也可能需要持久存储。Your applications might require persistent storage even after a pod is deleted. 在 Kubernetes 中,可以使用持久性卷来持久保存数据存储。In Kubernetes, you can use persistent volumes to persist data storage. 持久性卷会装载到节点 VM,然后公开给 pod。Persistent volumes are mounted to a node VM and then exposed to the pods. 持久性卷遵循 pod,即使 pod 被移动到同一群集内的其他节点也是如此。Persistent volumes follow pods even if the pods are moved to a different node inside the same cluster.

使用的复制策略取决于存储解决方案。The replication strategy you use depends on your storage solution. 常见的存储解决方案(例如 GlusterCephRookPortworx)在灾难恢复和复制方面都提供了自身的指导。Common storage solutions such as Gluster, Ceph, Rook, and Portworx provide their own guidance about disaster recovery and replication.

典型的策略是提供一个通用存储点,应用程序可在其中写入其数据。The typical strategy is to provide a common storage point where applications can write their data. 然后跨区域复制此数据,在本地访问。This data is then replicated across regions and then accessed locally.

基于基础结构的异步复制

如果使用 Azure 托管磁盘,可以选择如下所述的复制和 DR 解决方案:If you use Azure Managed Disks, you can choose replication and DR solutions such as these:

基于应用程序的异步复制Application-based asynchronous replication

目前,Kubernetes 不会针对基于应用程序的异步复制提供本机实现。Kubernetes currently provides no native implementation for application-based asynchronous replication. 由于容器和 Kubernetes 是松散耦合的,任何传统应用程序或语言方法都应适用。Because containers and Kubernetes are loosely coupled, any traditional application or language approach should work. 通常,应用程序本身会复制存储请求,然后将这些存储请求写入每个群集的基础数据存储中。Typically, the applications themselves replicate the storage requests, which are then written to each cluster's underlying data storage.

基于应用程序的异步复制

后续步骤Next steps

本文重点介绍了 AKS 群集的业务连续性和灾难恢复注意事项。This article focuses on business continuity and disaster recovery considerations for AKS clusters. 有关 AKS 中的群集操作的详细信息,请参阅以下文章中的最佳做法:For more information about cluster operations in AKS, see these articles about best practices: