IoT 中心高可用性和灾难恢复IoT Hub high availability and disaster recovery

作为实施弹性 IoT 解决方案的第一步,架构师、开发人员和企业主必须定义要构建的解决方案的运行时间目标。As a first step towards implementing a resilient IoT solution, architects, developers, and business owners must define the uptime goals for the solutions they're building. 可以主要根据每个方案的具体业务目标定义这些目标。These goals can be defined primarily based on specific business objectives for each scenario. 针对这种环境,Azure 业务连续性技术指南一文介绍了一个常规框架来帮助你思考业务连续性和灾难恢复。In this context, the article Azure Business Continuity Technical Guidance describes a general framework to help you think about business continuity and disaster recovery. Azure 应用程序的灾难恢复和高可用性一文针对 Azure 应用程序的高可用性 (HA) 和灾难恢复 (DR) 实现策略提供了体系结构指导。The Disaster recovery and high availability for Azure applications paper provides architecture guidance on strategies for Azure applications to achieve High Availability (HA) and Disaster Recovery (DR).

本文介绍 IoT 中心服务专门提供的 HA 和 DR 功能。This article discusses the HA and DR features offered specifically by the IoT Hub service. 从广义上讲,本文讨论的领域包括:The broad areas discussed in this article are:

  • 区域内部 HAIntra-region HA
  • 跨区域 DRCross region DR
  • 实现跨区域 HAAchieving cross region HA

根据你为 IoT 解决方案定义的运行时间目标,应确定下面所述的哪些选项最适合业务目标。Depending on the uptime goals you define for your IoT solutions, you should determine which of the options outlined below best suit your business objectives. 将其中的任何 HA/DR 备选方案整合到 IoT 解决方案需要仔细权衡以下方面的利弊:Incorporating any of these HA/DR alternatives into your IoT solution requires a careful evaluation of the trade-offs between the:

  • 所需的复原能力级别Level of resiliency you require
  • 实施和维护的复杂性Implementation and maintenance complexity
  • COGS 影响COGS impact

区域内部 HAIntra-region HA

IoT 中心服务通过在几乎所有服务层中实现冗余来提供区域内部 HA。The IoT Hub service provides intra-region HA by implementing redundancies in almost all layers of the service. IoT 中心服务发布的 SLA 是利用这些冗余实现的。The SLA published by the IoT Hub service is achieved by making use of these redundancies. IoT 解决方案开发人员无需完成任何额外工作就能利用这些 HA 功能。No additional work is required by the developers of an IoT solution to take advantage of these HA features. 尽管 IoT 中心提供相当高的运行时间保证,但与任何分布式计算平台一样,暂时性的故障仍有可能出现。Although IoT Hub offers a reasonably high uptime guarantee, transient failures can still be expected as with any distributed computing platform. 如果刚刚开始将解决方案从本地解决方案迁移到云,则需要将重心从优化“平均故障时间”改为优化“平均恢复时间”。If you're just getting started with migrating your solutions to the cloud from an on-premises solution, your focus needs to shift from optimizing "mean time between failures" to "mean time to recover". 换而言之,以混合模式操作云时,暂时性故障被视为正常。In other words, transient failures are to be considered normal while operating with the cloud in the mix. 必须在与云应用程序交互的组件中内置重试策略,以处理暂时性故障。Appropriate retry policies must be built in to the components interacting with a cloud application to deal with transient failures.

跨区域 DRCross region DR

在极少见的情况下,电源故障或其他涉及到实物资产的故障会导致数据中心遇到长时间的服务中断。There could be some rare situations when a datacenter experiences extended outages due to power failures or other failures involving physical assets. 此类事件非常罕见,在此期间,上述区域内部 HA 不一定总能发挥作用。Such events are rare during which the intra region HA capability described above may not always help. IoT 中心提供多种解决方案,用于在发生此类长时间服务中断后进行恢复。IoT Hub provides multiple solutions for recovering from such extended outages.

出现此类情况时为客户提供的恢复选项包括“Microsoft 发起的故障转移”和“手动故障转移”。The recovery options available to customers in such a situation are "Microsoft-initiated failover" and "manual failover". 两者之间的根本差别在于,前者由 Microsoft 发起,后者由用户发起。The fundamental difference between the two is that Microsoft initiates the former and the user initiates the latter. 此外,与 Microsoft 发起的故障转移选项相比,手动故障转移提供更低的恢复时间目标 (RTO)。Also, manual failover provides a lower recovery time objective (RTO) compared to the Microsoft-initiated failover option. 以下部分讨论了每个选项提供的具体 RTO。The specific RTOs offered with each option are discussed in the sections below. 执行上述任一选项从主要区域故障转移 IoT 中心时,中心将在对应的 Azure 异地配对区域完全正常运行。When either of these options to perform failover of an IoT hub from its primary region is exercised, the hub becomes fully functional in the corresponding Azure geo-paired region.

这两个故障转移选项提供以下恢复点目标 (RPO):Both these failover options offer the following recovery point objectives (RPOs):

数据类型Data type 恢复点目标 (RPO)Recovery point objectives (RPO)
标识注册表Identity registry 丢失 0-5 分钟的数据0-5 mins data loss
设备孪生数据Device twin data 丢失 0-5 分钟的数据0-5 mins data loss
云到设备的消息1Cloud-to-device messages1 丢失 0-5 分钟的数据0-5 mins data loss
父作业1和设备作业Parent1 and device jobs 丢失 0-5 分钟的数据0-5 mins data loss
设备到云的消息Device-to-cloud messages 所有未读的消息都丢失All unread messages are lost
操作监视消息Operations monitoring messages 所有未读的消息都丢失All unread messages are lost
云到设备的反馈消息Cloud-to-device feedback messages 所有未读的消息都丢失All unread messages are lost

1手动故障转移期间无法恢复云到设备的消息和父作业。1Cloud-to-device messages and parent jobs do not get recovered as a part of manual failover.

完成 IoT 中心的故障转移操作后,来自设备和后端应用程序的所有操作预期可继续进行,无需人工干预。Once the failover operation for the IoT hub completes, all operations from the device and back-end applications are expected to continue working without requiring a manual intervention. 这意味着,设备到云的消息应会继续正常工作,并且整个设备注册表会保持不变。This means that your device-to-cloud messages should continue to work, and the entire device registry is intact. 可以借助前面配置的相同订阅来使用通过事件网格发出的事件,前提是这些事件网格订阅仍然可用。Events emitted via Event Grid can be consumed via the same subscription(s) configured earlier as long as those Event Grid subscriptions continue to be available.

注意

  • 故障转移后,IoT 中心内置事件终结点的事件中心兼容名称和终结点会发生变化。The Event Hub-compatible name and endpoint of the IoT Hub built-in Events endpoint change after failover. 使用事件中心客户端或事件处理程序主机从内置终结点接收遥测消息时,应使用 IoT 中心连接字符串建立连接。When receiving telemetry messages from the built-in endpoint using either the event hub client or event processor host, you should use the IoT hub connection string to establish the connection. 这可以确保在故障转移后,后端应用程序可继续工作,而无需人工干预。This ensures that your back-end applications continue to work without requiring manual intervention post failover. 如果在后端应用程序中直接使用事件中心兼容的名称和终结点,在故障转移后需要通过提取新的事件中心兼容名称和终结点来重新配置应用程序,这样才能继续操作。If you use the Event Hub-compatible name and endpoint in your back-end application directly, you will need to reconfigure your application by fetching the new Event Hub-compatible name and endpoint after failover to continue operations.

  • 路由到存储时,我们建议列出 blob 或文件,然后循环访问它们,以确保在不进行分区的情况下读取所有 blob 或文件。When routing to storage, we recommend listing the blobs or files and then iterating over them, to ensure all blobs or files are read without making any assumptions of partition. 在 Microsoft 发起的故障转移或手动故障转移期间,分区范围可能发生变化。The partition range could potentially change during a Microsoft-initiated failover or manual failover. 可以使用 List Blobs API 枚举 blob 列表,或使用 List ADLS Gen2 API 枚举文件列表。You can use the List Blobs API to enumerate the list of blobs or List ADLS Gen2 API for the list of files.

Microsoft 发起的故障转移Microsoft-initiated failover

在少数情况下,Microsoft 会执行 Microsoft 发起的故障转移,以将所有 IoT 中心从受影响的区域故障转移到对应的异地配对区域。Microsoft-initiated failover is exercised by Microsoft in rare situations to failover all the IoT hubs from an affected region to the corresponding geo-paired region. 该过程是默认选项(用户无法选择禁用),且无需用户的干预。This process is a default option (no way for users to opt out) and requires no intervention from the user. Microsoft 有权决定何时执行此选项。Microsoft reserves the right to make a determination of when this option will be exercised. 在故障转移用户的中心之前,此机制不要求用户许可。This mechanism doesn't involve a user consent before the user's hub is failed over. Microsoft 发起的故障转移的恢复时间目标 (RTO) 为 2 到 26 小时。Microsoft-initiated failover has a recovery time objective (RTO) of 2-26 hours.

RTO 较高的原因是,Microsoft 必须代表该区域中所有受影响的客户执行故障转移操作。The large RTO is because Microsoft must perform the failover operation on behalf of all the affected customers in that region. 如果运行的某个较不关键 IoT 解决方案可以承受大约一天时间的停机,则可以依赖此选项来满足 IoT 解决方案的总体灾难恢复目标。If you are running a less critical IoT solution that can sustain a downtime of roughly a day, it is ok for you to take a dependency on this option to satisfy the overall disaster recovery goals for your IoT solution. “恢复时间”部分介绍了触发此过程后,使运行时操作完全正常所需的总时间。The total time for runtime operations to become fully operational once this process is triggered, is described in the "Time to recover" section.

手动故障转移Manual failover

如果 Microsoft 发起的故障转移提供的 RTO 无法满足企业的正常运行时间目标,请考虑使用手动故障转移来自行触发故障转移过程。If your business uptime goals aren't satisfied by the RTO that Microsoft initiated failover provides, consider using manual failover to trigger the failover process yourself. 此选项的 RTO 大致在 10 分钟到几个小时。The RTO using this option could be anywhere between 10 minutes to a couple of hours. 目前,RTO 取决于针对故障转移的 IoT 中心实例注册的设备数。The RTO is currently a function of the number of devices registered against the IoT hub instance being failed over. 托管大约 100,000 台设备的中心的 RTO 大致是 15 分钟。You can expect the RTO for a hub hosting approximately 100,000 devices to be in the ballpark of 15 minutes. “恢复时间”部分介绍了触发此过程后,使运行时操作完全正常所需的总时间。The total time for runtime operations to become fully operational once this process is triggered, is described in the "Time to recover" section.

不管主要区域是否遇到停机,手动故障转移选项始终可用。The manual failover option is always available for use irrespective of whether the primary region is experiencing downtime or not. 因此,用户可能会使用此选项来执行计划内故障转移。Therefore, this option could potentially be used to perform planned failovers. 计划内故障转移的一个示例用途是执行定期的故障转移演练。One example usage of planned failovers is to perform periodic failover drills. 需要注意的是,计划内故障转移操作会导致中心在此选项的 RTO 定义的时间段内停机,同时会导致数据丢失(由上面的 RPO 表定义)。A word of caution though is that a planned failover operation results in a downtime for the hub for the period defined by the RTO for this option, and also results in a data loss as defined by the RPO table above. 可以考虑设置一个测试 IoT 中心实例来定期执行计划内故障转移选项,以便在发生实际灾难时,自信地让端到端解决方案正常运行。You could consider setting up a test IoT hub instance to exercise the planned failover option periodically to gain confidence in your ability to get your end-to-end solutions up and running when a real disaster happens.

重要

  • 不应针对生产环境中使用的 IoT 中心执行测试演练。Test drills should not be performed on IoT hubs that are being used in your production environments.

  • 不应使用手动故障转移作为在 Azure 异地配对区域之间永久迁移中心的机制。Manual failover should not be used as a mechanism to permanently migrate your hub between the Azure geo paired regions. 否则,会增大从驻留在旧主要区域中的设备针对中心执行的操作的延迟。Doing so would cause an increased latency for the operations being performed against the hub from devices homed in the old primary region.

故障回复Failback

再次触发故障转移操作即可故障回复到旧的主要区域。Failing back to the old primary region can be achieved by triggering the failover action another time. 如果执行了原始故障转移操作,以便在原始主要区域发生长时间服务中断后进行恢复,我们建议在原始位置从服务中断恢复后,将中心故障回复到原始位置。If the original failover operation was performed to recover from an extended outage in the original primary region, we recommended that the hub should be failed back to the original location once that location has recovered from the outage situation.

重要

  • 每天只允许用户执行 2 次成功的故障转移和 2 次成功的故障回复操作。Users are only allowed to perform 2 successful failover and 2 successful failback operations per day.

  • 不允许背靠背(连续)的故障转移/故障回复操作。Back to back failover/failback operations are not allowed. 必须在这些操作之间等待 1 小时。You must wait for 1 hour between these operations.

恢复时间Time to recover

尽管 IoT 中心实例的 FQDN(因此也包括连接字符串)在故障转移后保持不变,但基础 IP 地址会发生变化。While the FQDN (and therefore the connection string) of the IoT hub instance remains the same post failover, the underlying IP address changes. 因此,可使用以下函数来表示触发故障转移过程后,针对 IoT 中心实例执行的运行时操作完全正常所需的总时间。Therefore the overall time for the runtime operations being performed against your IoT hub instance to become fully operational after the failover process is triggered can be expressed using the following function.

恢复时间 = RTO [手动故障转移为 10 分钟到 2 小时 | Microsoft 发起的故障转移为 2 到 26 小时] + DNS 传播延迟 + 客户端应用程序刷新任何缓存 IoT 中心 IP 地址花费的时间。Time to recover = RTO [10 min - 2 hours for manual failover | 2 - 26 hours for Microsoft-initiated failover] + DNS propagation delay + Time taken by the client application to refresh any cached IoT Hub IP address.

重要

IoT SDK 不会缓存 IoT 中心的 IP 地址。The IoT SDKs do not cache the IP address of the IoT hub. 我们建议,与 SDK 交互的用户代码不应缓存 IoT 中心的 IP 地址。We recommend that user code interfacing with the SDKs should not cache the IP address of the IoT hub.

实现跨区域 HAAchieve cross region HA

如果 Microsoft 发起的故障转移或手动故障转移选项提供的 RTO 无法满足企业的运行时间目标,应考虑在每个设备上实施自动跨区域故障转移机制。If your business uptime goals aren't satisfied by the RTO that either Microsoft-initiated failover or manual failover options provide, you should consider implementing a per-device automatic cross region failover mechanism. 本文不讨论 IoT 解决方案中部署拓扑的完整处理方式。A complete treatment of deployment topologies in IoT solutions is outside the scope of this article. 本文讨论了用于实现高可用性和灾难恢复的 区域故障转移 部署模型。The article discusses the regional failover deployment model for the purpose of high availability and disaster recovery.

在区域故障转移模型中,解决方案后端主要在一个数据中心位置运行。In a regional failover model, the solution back end runs primarily in one datacenter location. 辅助 IoT 中心和后端部署在另一个数据中心位置。A secondary IoT hub and back end are deployed in another datacenter location. 如果主要区域中的 IoT 中心遭遇服务中断或者从设备到主要区域的网络连接中断,设备将使用辅助服务终结点。If the IoT hub in the primary region suffers an outage or the network connectivity from the device to the primary region is interrupted, devices use a secondary service endpoint. 可以通过实现跨区域故障转移模型而不是保留在单个区域中来提高解决方案可用性。You can improve the solution availability by implementing a cross-region failover model instead of staying within a single region.

概括而言,为了实现 IoT 中心的区域故障转移模型,需要执行以下步骤:At a high level, to implement a regional failover model with IoT Hub, you need to take the following steps:

  • 辅助 IoT 中心和设备路由逻辑:如果主要区域的服务中断,设备必须开始连接到次要区域。A secondary IoT hub and device routing logic: If service in your primary region is disrupted, devices must start connecting to your secondary region. 由于大多数服务状态感知的性质,解决方案管理员通常触发区域间的故障转移过程。Given the state-aware nature of most services involved, it's common for solution administrators to trigger the inter-region failover process. 若要实现新终结点与设备间的通信并掌控此过程,最好让其定期检查 服务中是否存在当前活动的终结点。The best way to communicate the new endpoint to devices, while maintaining control of the process, is to have them regularly check a concierge service for the current active endpoint. 该监护服务可以是 Web 应用程序,可使用 DNS 重定向技术将它复制并使其可访问(例如,使用 Azure 流量管理器)。The concierge service can be a web application that is replicated and kept reachable using DNS-redirection techniques (for example, using Azure Traffic Manager).

    备注

    IoT 中心服务不是 Azure 流量管理器中受支持的终结点类型。IoT hub service is not a supported endpoint type in Azure Traffic Manager. 我们建议在提议的监护服务中实现终结点运行状况探测 API,使之与 Azure 流量管理器集成。The recommendation is to integrate the proposed concierge service with Azure traffic manager by making it implement the endpoint health probe API.

  • 标识注册表复制:若要进行使用,次要 IoT 中心必须包含所有可连接到解决方案的设备标识。Identity registry replication: To be usable, the secondary IoT hub must contain all device identities that can connect to the solution. 解决方案应该保留设备标识的异地复制备份,并在切换设备的活动终结点之前将其上传到辅助 IoT 中心。The solution should keep geo-replicated backups of device identities, and upload them to the secondary IoT hub before switching the active endpoint for the devices. IoT 中心的设备标识导出功能在此情景中很有用。The device identity export functionality of IoT Hub is useful in this context. 有关详细信息,请参阅 IoT 中心开发人员指南 - 标识注册表For more information, see IoT Hub developer guide - identity registry.

  • 合并逻辑:当主要区域再次可供使用时,所有在辅助站点中创建的状态和数据都必须迁移回到主要区域。Merging logic: When the primary region becomes available again, all the state and data that have been created in the secondary site must be migrated back to the primary region. 此状态和数据主要与设备标识和应用程序元数据相关,必须与主要 IoT 中心以及主要区域中的任何其他应用程序特定存储合并。This state and data mostly relate to device identities and application metadata, which must be merged with the primary IoT hub and any other application-specific stores in the primary region.

可使用幂等操作简化此步骤。To simplify this step, you should use idempotent operations. 幂等操作可最大程度降低事件的最终一致分布以及事件的重复项/失序传送所造成的副作用。Idempotent operations minimize the side-effects from the eventual consistent distribution of events, and from duplicates or out-of-order delivery of events. 此外,应用程序逻辑应该设计为能够容许潜在的不一致或稍微过期的状态。In addition, the application logic should be designed to tolerate potential inconsistencies or slightly out-of-date state. 之所以发生此情况是因为系统需要额外的时间来根据恢复点目标 (RPO) 修复自身。This situation can occur due to the additional time it takes for the system to heal based on recovery point objectives (RPO).

选择适当的 HA/DR 选项Choose the right HA/DR option

下面汇总了本文所述的 HA/DR 选项,可将其用作参考框架来选择适用于解决方案的选项。Here's a summary of the HA/DR options presented in this article that can be used as a frame of reference to choose the right option that works for your solution.

HA/DR 选项HA/DR option RTORTO RPORPO 是否需要人工干预?Requires manual intervention? 实施复杂性Implementation complexity 附加成本影响Additional cost impact
Microsoft 发起的故障转移Microsoft-initiated failover 2 - 26 小时2 - 26 hours 参考上面的 RPO 表Refer RPO table above No None None
手动故障转移Manual failover 10 分钟 - 2 小时10 min - 2 hours 参考上面的 RPO 表Refer RPO table above Yes 极低。Very low. 只需从门户触发此操作。You only need to trigger this operation from the portal. None
跨区域 HACross region HA 小于 1 分钟< 1 min 取决于自定义 HA 解决方案的复制频率Depends on the replication frequency of your custom HA solution No High 超过 1 个 IoT 中心的 1 倍> 1x the cost of 1 IoT hub

后续步骤Next steps