Azure Service Fabric 中的灾难恢复Disaster recovery in Azure Service Fabric

提供高可用性的关键一环是确保服务能够经受各种不同类型的故障。A critical part of delivering high-availability is ensuring that services can survive all different types of failures. 对于计划外和不受控制的故障,这一点尤其重要。This is especially important for failures that are unplanned and outside of your control. 本文介绍一些常见的故障模式,如果未正确建模和管理,这些故障可能成为灾难。This article describes some common failure modes that could be disasters if not modeled and managed correctly. 本文还介绍发生灾难时应采取的缓解措施和行动。It also discuss mitigations and actions to take if a disaster happened anyway. 目标是在发生计划内或其他故障时,限制或消除停机或数据丢失风险。The goal is to limit or eliminate the risk of downtime or data loss when they occur failures, planned or otherwise, occur.

避免灾难Avoiding disaster

Service Fabric 的主要目标是帮助针对环境和服务进行建模,使常见故障类型不构成灾难。Service Fabric's primary goal is to help you model both your environment and your services in such a way that common failure types are not disasters.

一般而言,有两种类型的灾难/故障方案:In general there are two types of disaster/failure scenarios:

  1. 硬件或软件故障Hardware or software faults
  2. 操作故障Operational faults

硬件和软件故障Hardware and software faults

硬件和软件故障是不可预知的。Hardware and software faults are unpredictable. 解决故障的最简单方法是跨硬件或软件故障边界运行服务的多个副本。The easiest way to survive faults is running more copies of the service spanned across hardware or software fault boundaries. 例如,如果服务仅在一台特定计算机上运行,则该计算机的故障是该服务的灾难。For example, if your service is running only on one particular machine, then the failure of that one machine is a disaster for that service. 避免此灾难的简单方法是,确保服务实际上在多台计算机上运行。The simple way to avoid this disaster is to ensure that the service is actually running on multiple machines. 另外,必须进行测试以确保一台计算机的故障不会中断正在运行的服务。Testing is also necessary to ensure the failure of one machine doesn't disrupt the running service. 容量规划确保可以在其他位置创建替换实例,并且容量减少不会使剩余服务过载。Capacity planning ensures a replacement instance can be created elsewhere and that reduction in capacity doesn't overload the remaining services. 无论尝试避免哪种故障,都可以使用这一相同模式。The same pattern works regardless of what you're trying to avoid the failure of. 例如:For example. 如果担心某一 SAN 发生故障,则可跨多个 SAN 运行。if you're concerned about the failure of a SAN, you run across multiple SANs. 如果担心服务器机架丢失,则可跨多个机架运行。If you're concerned about the loss of a rack of servers, you run across multiple racks. 如果担心数据中心丢失,则应跨多个 Azure 区域或数据中心运行服务。If you're worried about the loss of datacenters, your service should run across multiple Azure regions or datacenters.

在此类跨区模式下运行时,仍可能发生某些类型的同步故障,但系统会自动处理特定类型的单个甚至多个故障(不包括:单个 VM 或网络链接故障),且这些故障不再是“灾难”。When running in this type of spanned mode, you're still subject to some types of simultaneous failures, but single and even multiple failures of a particular type (ex: a single VM or network link failing) are automatically handled (and so no longer a "disaster"). Service Fabric 提供许多用于扩展群集的机制,并处理失败节点和服务的返回。Service Fabric provides many mechanisms for expanding the cluster and handles bringing failed nodes and services back. 使用 Service Fabric 还可以运行许多服务实例,以避免这些类型的计划外故障变成真正的灾难。Service Fabric also allows running many instances of your services in order to avoid these types of unplanned failures from turning into real disasters.

一些原因导致通过运行足够大的部署来跨越故障并不可行。There may be reasons why running a deployment large enough to span over failures is not feasible. 例如,相对于发生故障的可能性,需要的硬件资源可能超过你愿意支付的数目。For example, it may take more hardware resources than you're willing to pay for relative to the chance of failure. 处理分布式应用程序时,不同地理距离之间的附加通信跃点或状态复制成本可能导致不可接受的延迟。When dealing with distributed applications, it could be that additional communication hops or state replication costs across geographic distances causes unacceptable latency. 具体界限因不同的应用程序而异。Where this line is drawn differs for each application. 具体而言,软件故障可能发生在尝试缩放的服务中。For software faults specifically, the fault could be in the service that you are trying to scale. 在这种情况下,运行多个副本不能防止灾难发生,因为故障条件与所有实例相关。In this case more copies don't prevent the disaster, since the failure condition is correlated across all the instances.

操作故障Operational faults

即使服务遍布全球且存在许多冗余,仍然可能遇到灾难性事件。Even if your service is spanned across the globe with many redundancies, it can still experience disastrous events. 例如,如果有人意外重新配置或彻底删除了服务的 DNS 名称。For example, if someone accidentally reconfigures the dns name for the service, or deletes it outright. 举例来说,假设有一个有状态 Service Fabric 服务,有人意外删除了该服务。As an example, let's say you had a stateful Service Fabric service, and someone deleted that service accidentally. 除非有一些其他的缓解措施,否则该服务及其所有状态现在就不复存在了。Unless there's some other mitigation, that service and all of the state it had is now gone. 这些类型的操作灾难(“事故”)需要采取不同于常规计划外故障的缓解措施和步骤才能恢复。These types of operational disasters ("oops") require different mitigations and steps for recovery than regular unplanned failures.

避免这些类型的操作故障的最佳方法是:The best ways to avoid these types of operational faults are to

  1. 限制对环境的操作访问restrict operational access to the environment
  2. 严格审核危险操作strictly audit dangerous operations
  3. 强制实行自动化、防止手动或带外更改,并在执行特定更改之前针对实际环境对其进行验证impose automation, prevent manual or out of band changes, and validate specific changes against the actual environment before enacting them
  4. 确保破坏性操作为“软”。ensure that destructive operations are "soft". 软操作不会立即生效,或者可以在一定的时间范围内撤消Soft operations don't take effect immediately or can be undone within some time window

Service Fabric 提供了用于防止操作故障的一些机制,例如提供针对群集操作的基于角色的访问控制。Service Fabric provides some mechanisms to prevent operational faults, such as providing role-based access control for cluster operations. 但是,大多数操作故障都需要组织和其他系统来配合解决。However, most of these operational faults require organizational efforts and other systems. Service Fabric 提供了一些机制来解决操作故障,最值得注意的是有状态服务的备份和还原。Service Fabric does provide some mechanism for surviving operational faults, most notably backup and restore for stateful services.

管理故障Managing failures

Service Fabric 的目标几乎始终是自动管理故障。The goal of Service Fabric is almost always automatic management of failures. 但是,若要处理某些类型的故障,服务必须具有其他代码。However, in order to handle some types of failures, services must have additional code. 出于安全性和业务连续性的考虑,不应自动处理其他类型的故障 。Other types of failures should not be automatically addressed because of safety and business continuity reasons.

处理单个故障Handling single failures

单台计算机可能由于各种原因发生故障。Single machines can fail for all sorts of reasons. 其中一些原因是硬件问题,例如电源和网络硬件故障。Some of these are hardware causes, like power supplies and networking hardware failures. 其他故障与软件有关。Other failures are in software. 例如,实际操作系统和服务本身的故障。These include failures of the actual operating system and the service itself. Service Fabric 会自动检测这些类型的故障,包括由于网络问题导致某一计算机与其他计算机相隔离的情况。Service Fabric automatically detects these types of failures, including cases where the machine becomes isolated from other machines due to network issues.

无论服务是什么类型,如果代码的单个副本由于任何原因而发生故障,则运行单个实例会导致该服务停机。Regardless of the type of service, running a single instance results in downtime for that service if that single copy of the code fails for any reason.

若要处理任何单一故障,可执行的最简单操作是确保服务默认在多个节点上运行。In order to handle any single failure, the simplest thing you can do is to ensure that your services run on more than one node by default. 对于无状态服务,可通过将 InstanceCount 设置为大于 1 来实现这一点。For stateless services, this can be accomplished by having an InstanceCount greater than 1. 对于有状态服务,最低建议是始终将 TargetReplicaSetSizeMinReplicaSetSize 设置为不小于 3。For stateful services, the minimum recommendation is always a TargetReplicaSetSize and MinReplicaSetSize of at least 3. 运行服务代码的更多副本可确保服务能够自动处理任何单一故障。Running more copies of your service code ensures that your service can handle any single failure automatically.

处理协调性故障Handling coordinated failures

由于计划内或计划外基础结构故障和更改,或计划内软件更改,群集中可能发生协调性故障。Coordinated failures can happen in a cluster due to either planned or unplanned infrastructure failures and changes, or planned software changes. Service Fabric 将遇到协调性故障的基础结构区域建模为容错域。Service Fabric models infrastructure zones that experience coordinated failures as Fault Domains. 将遇到协调性软件更改的区域建模为升级域。Areas that will experience coordinated software changes are modeled as Upgrade Domains. 若要详细了解容错域和升级域,请参阅此文档,其中描述了群集拓扑和定义。More information about fault and upgrade domains is in this document that describes cluster topology and definition.

默认情况下,Service Fabric 会在规划服务应该运行的位置时考虑容错域和升级域。By default Service Fabric considers fault and upgrade domains when planning where your services should run. 默认情况下,Service Fabric 尝试确保服务跨多个容错域和升级域运行,以便发生计划内或计划外更改时,服务仍然可用。By default, Service Fabric tries to ensure that your services run across several fault and upgrade domains so if planned or unplanned changes happen your services remain available.

例如,假设电源故障导致计算机机架同时发生故障。For example, let's say that failure of a power source causes a rack of machines to fail simultaneously. 运行多个服务副本时,容错域中丢失多台计算机的故障会变成给定服务单一故障的另一个示例。With multiple copies of the service running the loss of many machines in fault domain failure turns into just another example of single failure for a given service. 因此,管理容错域对于确保服务高可用性至关重要。This is why managing fault domains is critical to ensuring high availability of your services. 在 Azure 中运行 Service Fabric 时,系统会自动管理容错域。When running Service Fabric in Azure, fault domains are managed automatically. 在其他环境中,可能不是这样。In other environments, they may not be. 若要在本地生成自己的群集,请务必正确映射和规划容错域布局。If you're building your own clusters on premises, be sure to map and plan your fault domain layout correctly.

升级域可用于针对软件同时升级的区域进行建模。Upgrade Domains are useful for modeling areas where software is going to be upgraded at the same time. 因此,升级域通常还可以定义在计划升级期间停用软件的边界。Because of this, Upgrade Domains also often define the boundaries where software is taken down during planned upgrades. Service Fabric 和服务的升级均遵循相同的模型。Upgrades of both Service Fabric and your services follow the same model. 若要详细了解滚动升级、升级域和有助于防止意外更改影响群集和服务的 Service Fabric 运行状况模型,请参阅以下文档:For more on rolling upgrades, upgrade domains, and the Service Fabric health model that helps prevent unintended changes from impacting the cluster and your service, see these documents:

可以使用 Service Fabric Explorer 中提供的群集映射来可视化群集布局:You can visualize the layout of your cluster using the cluster map provided in Service Fabric Explorer:

Service Fabric Explorer 中分散在容错域之间的节点

备注

建模故障区域、滚动升级、运行服务代码和状态的多个实例、确保服务在容错域和升级域中运行的放置规则,以及内置运行状况监视仅仅是 Service Fabric 提供的一些功能,目的是防止正常操作问题和故障变成灾难。Modeling areas of failure, rolling upgrades, running many instances of your service code and state, placement rules to ensure your services run across fault and upgrade domains, and built-in health monitoring are just some of the features that Service Fabric provides in order to keep normal operational issues and failures from turning into disasters.

处理同步硬件或软件故障Handling simultaneous hardware or software failures

上文讨论了单一故障。Above we talked about single failures. 可以看到,只需跨容错域和升级域运行代码(和状态)的多个副本,即可轻松处理无状态和有状态服务的单一故障。As you can see, are easy to handle for both stateless and stateful services just by keeping more copies of the code (and state) running across fault and upgrade domains. 也可能发生多个同时随机故障。Multiple simultaneous random failures can also happen. 这些故障更有可能造成实际灾难。These are more likely to lead to an actual disaster.

导致服务故障的随机故障Random failures leading to service failures

假设服务的 InstanceCount 为 5,并且运行这些实例的多个节点均同时发生故障。Let's say that the service had an InstanceCount of 5, and several nodes running those instances all failed at the same time. Service Fabric 在其他节点上自动创建替换实例,作出响应。Service Fabric responds by automatically creating replacement instances on other nodes. 它会持续创建替换实例,直至服务恢复到所需的实例计数。It will continue creating replacements until the service is back to its desired instance count. 再举一例,假设有一个 InstanceCount 为 -1 的无状态服务,这意味着它会在群集中的全部有效节点上运行。As another example, let's say there was a stateless service with an InstanceCountof -1, meaning it runs on all valid nodes in the cluster. 假设其中一些实例发生故障。Let's say that some of those instances were to fail. 在这种情况下,Service Fabric 会注意到服务未处于所需状态,并尝试在缺少的节点上创建实例。In this case, Service Fabric notices that the service is not in its desired state, and tries to create the instances on the nodes where they are missing.

对于有状态服务,这种情况取决于服务有无永久性状态。For stateful services the situation depends on whether the service has persisted state or not. 还取决于服务拥有的副本数以及发生故障的副本数。It also depends on how many replicas the service had and how many failed. 请根据以下 3 个阶段确定有状态服务是否发生灾难并对其进行管理:Determining whether a disaster occurred for a stateful service and managing it follows three stages:

  1. 确定有无仲裁丢失Determining if there has been quorum loss or not

    • 仲裁丢失是指有状态服务的大多数副本(包括主要副本)同时发生故障的情况。A quorum loss is any time a majority of the replicas of a stateful service are down at the same time, including the Primary.
  2. 确定仲裁丢失是否具有永久性Determining if the quorum loss is permanent or not

    • 大多数情况下,故障是暂时性的。Most of the time, failures are transient. 重启进程、节点和 VM,网络分区修复。Processes are restarted, nodes are restarted, VMs are relaunched, network partitions heal. 但是,故障有时是永久性的。Sometimes though, failures are permanent.
      • 对于没有永久性状态的服务,仲裁或多个副本故障会立即导致永久性仲裁丢失 。For services without persisted state, a failure of a quorum or more of replicas results immediately in permanent quorum loss. Service Fabric 在有状态、非永久性服务中检测到仲裁丢失时,会立即通过声明(潜在的)数据丢失转到步骤 3。When Service Fabric detects quorum loss in a stateful non-persistent service, it immediately proceeds to step 3 by declaring (potential) data loss. 转到数据丢失是有意义的,因为 Service Fabric 知道等待副本恢复没有任何意义,这是因为即使副本恢复,也将是空副本。Proceeding to data loss makes sense because Service Fabric knows that there's no point in waiting for the replicas to come back, because even if they were recovered they would be empty.
      • 对于有状态、永久性服务,仲裁或多个副本故障会导致 Service Fabric 开始等待副本恢复并还原仲裁。For stateful persistent services, a failure of a quorum or more of replicas causes Service Fabric to start waiting for the replicas to come back and restore quorum. 这会对服务的受影响分区(或“副本集”)的任何写入造成服务中断 。This results in a service outage for any writes to the affected partition (or "replica set") of the service. 但是,仍有可能可以读取,其一致性保证降低。However, reads may still be possible with reduced consistency guarantees. Service Fabric 等待仲裁恢复的默认时间是无限的,因为处理是(潜在的)数据丢失事件并伴有其他风险。The default amount of time that Service Fabric waits for quorum to be restored is infinite, since proceeding is a (potential) data loss event and carries other risks. 可以替代默认的 QuorumLossWaitDuration 值,但不建议这样做。Overriding the default QuorumLossWaitDuration value is possible but is not recommended. 此时应该尽量还原发生故障的副本。Instead at this time, all efforts should be made to restore the down replicas. 这需要备份发生故障的节点,并确保其可将驱动器重新装载到存储本地永久性状态的位置。This requires bringing the nodes that are down back up, and ensuring that they can remount the drives where they stored the local persistent state. 如果仲裁丢失由进程故障导致,则 Service Fabric 会自动尝试重新创建进程并重启其中的副本。If the quorum loss is caused by process failure, Service Fabric automatically tries to recreate the processes and restart the replicas inside them. 如果失败,Service Fabric 会报告运行状况错误。If this fails, Service Fabric reports health errors. 如果能够解决这些问题,副本通常可以恢复。If these can be resolved then the replicas usually come back. 但是,副本有时不能恢复。Sometimes, though, the replicas can't be brought back. 例如,驱动器可能全部发生故障,或者计算机由于某种原因遭到物理破坏。For example, the drives may all have failed, or the machines physically destroyed somehow. 这些情况属于永久性仲裁丢失事件。In these cases, we now have a permanent quorum loss event. 若要指示 Service Fabric 停止等待发生故障的副本恢复,群集管理员必须确定哪些服务的哪些分区受到了影响,并调用 Repair-ServiceFabricPartition -PartitionIdSystem.Fabric.FabricClient.ClusterManagementClient.RecoverPartitionAsync(Guid partitionId) API。To tell Service Fabric to stop waiting for the down replicas to come back, a cluster administrator must determine which partitions of which services are affected and call the Repair-ServiceFabricPartition -PartitionId or System.Fabric.FabricClient.ClusterManagementClient.RecoverPartitionAsync(Guid partitionId) API. 使用此 API 可以指定分区 ID,使其从仲裁丢失转为潜在的数据丢失。This API allows specifying the ID of the partition to move out of QuorumLoss and into potential dataloss.

    备注

    为了确保安全,请务必针对特定分区有目标地使用此 API 。It is never safe to use this API other than in a targeted way against specific partitions.

  3. 确定有无实际数据丢失,并从备份还原Determining if there has been actual data loss, and restoring from backups

    • 如果 Service Fabric 调用 OnDataLossAsync 方法,这始终是因为疑似数据丢失 。When Service Fabric calls the OnDataLossAsync method, it is always because of suspected data loss. Service Fabric 可确保将此调用传送到最合适的剩余副本 。Service Fabric ensures that this call is delivered to the best remaining replica. 也就是进度最大的副本。This is whichever replica has made the most progress. 我们总是将其称为疑似数据丢失,这是因为剩余副本在发生故障时实际上可能与主要副本具有完全相同的状态 。The reason we always say suspected data loss is that it is possible that the remaining replica actually has all same state as the Primary did when it went down. 但是,如果没有该状态作为对比,Service Fabric 或操作者就没有很好的方法来明确这一点。However, without that state to compare it to, there's no good way for Service Fabric or operators to know for sure. 此时,Service Fabric 还知道其他副本不会恢复。At this point, Service Fabric also knows the other replicas are not coming back. 这是当我们停止等待仲裁丢失自行解决时所做的决策。That was the decision made when we stopped waiting for the quorum loss to resolve itself. 服务采取的最佳做法通常是冻结并等待特定的管理员介入。The best course of action for the service is usually to freeze and wait for specific administrative intervention. 那么 OnDataLossAsync 方法所执行的典型实现是什么?So what does a typical implementation of the OnDataLossAsync method do?
    • 首先,记录 OnDataLossAsync 已被触发,并发出任何必要的管理警报。First, log that OnDataLossAsync has been triggered, and fire off any necessary administrative alerts.
    • 此时,通常需要暂停并等待进一步决策和要采取的手动操作。Usually at this point, to pause and wait for further decisions and manual actions to be taken. 这是因为即使备份可用,也可能需要时间准备。This is because even if backups are available they may need to be prepared. 例如,如果两个不同的服务协调信息,则可能需要修改这些备份,以确保发生还原后,这两个服务所关注的信息一致。For example, if two different services coordinate information, those backups may need to be modified in order to ensure that once the restore happens that the information those two services care about is consistent.
    • 通常还有一些其他遥测或服务消耗。Often there is also some other telemetry or exhaust from the service. 此元数据可能包含在其他服务或日志中。This metadata may be contained in other services or in logs. 此信息可用于确定主要副本是否收到并处理了任何调用,这些调用未存在于备份中或未复制到此特定副本中。This information can be used needed to determine if there were any calls received and processed at the primary that were not present in the backup or replicated to this particular replica. 可能需要重播或向备份添加这些调用才能进行恢复。These may need to be replayed or added to the backup before restoration is feasible.
    • 将剩余副本的状态与任何可用备份中所含的状态进行比较。Comparisons of the remaining replica's state to that contained in any backups that are available. 如果使用 Service Fabric 可靠集合,则可使用工具和进程执行此操作,如此文所述。If using the Service Fabric reliable collections then there are tools and processes available for doing so, described in this article. 目标是查看副本中的状态是否充足,或查看备份可能丢失的内容。The goal is to see if the state within the replica is sufficient, or also what the backup may be missing.
    • 完成比较以及必要的还原后,如果更改了任何状态,则服务代码应返回 true。Once the comparison is done, and if necessary the restore completed, the service code should return true if any state changes were made. 如果副本确定它是该状态的最佳可用副本,并且未执行任何更改,则返回 false。If the replica determined that it was the best available copy of the state and made no changes, then return false. True 表示任何其他剩余副本现在可能与此副本不一致 。True indicates that any other remaining replicas may now be inconsistent with this one. 应删除这些副本并根据此副本重新生成。They will be dropped and rebuilt from this replica. False 表示未进行任何状态更改,因此其他副本可以保留其所含内容。False indicates that no state changes were made, so the other replicas can keep what they have.

在生产环境中部署服务之前,服务创建者务必对潜在的数据丢失和故障方案进行实践。It is critically important that service authors practice potential data loss and failure scenarios before services are ever deployed in production. 为了防止数据丢失的可能性,务必定期将任何有状态服务的状态备份到异地冗余存储。To protect against the possibility of data loss, it is important to periodically back up the state of any of your stateful services to a geo-redundant store. 还需确保能够将其还原。You must also ensure that you have the ability to restore it. 由于许多不同服务进行备份的时间有所不同,因此需要确保还原后,各个服务具有相互一致的视图。Since backups of many different services are taken at different times, you need to ensure that after a restore your services have a consistent view of each other. 例如,考虑这样一种情况:某一服务生成并存储一个数值,并将其发送给同样存储该数值的另一服务。For example, consider a situation where one service generates a number and stores it, then sends it to another service that also stores it. 还原后,可能发现第二个服务含有该数值但第一个没有,因为其备份未包含该操作。After a restore, you might discover that the second service has the number but the first does not, because it's backup didn't include that operation.

如果发现剩余副本不足以在数据丢失方案中继续运行,并且不能通过遥测或消耗重构服务状态,则备份频率决定最可行的恢复点目标 (RPO)。If you find out that the remaining replicas are insufficient to continue from in a data loss scenario, and you can't reconstruct service state from telemetry or exhaust, the frequency of your backups determines your best possible recovery point objective (RPO). Service Fabric 提供许多工具以测试各种故障方案,包括需要从备份还原的永久性仲裁和数据丢失。Service Fabric provides many tools for testing various failure scenarios, including permanent quorum and data loss requiring restoration from a backup. 这些方案都包括在 Service Fabric 的可测试工具中,由故障分析服务管理。These scenarios are included as a part of Service Fabric's testability tools, managed by the Fault Analysis Service. 有关这些工具和模式的详细信息,请参阅此文More info on those tools and patterns is available here.

备注

系统服务也会遭受仲裁丢失,同时对相关服务造成特定影响。System services can also suffer quorum loss, with the impact being specific to the service in question. 例如,命名服务的仲裁丢失会影响名称解析,而故障转移管理器服务的仲裁丢失会阻止创建新服务与故障转移。For instance, quorum loss in the naming service impacts name resolution, whereas quorum loss in the failover manager service blocks new service creation and failovers. 虽然 Service Fabric 系统服务采用与状态管理服务相同的模式,但不建议尝试将其从仲裁丢失转为潜在的数据丢失。While the Service Fabric system services follow the same pattern as your services for state management, it is not recommended that you should attempt to move them out of Quorum Loss and into potential data loss. 相反,建议寻求支持来确定针对具体情况的解决方案。The recommendation is instead to seek support to determine a solution that is targeted to your specific situation. 通常最好是单纯地等待,直到关闭的副本恢复为止。Usually it is preferable to simply wait until the down replicas return.

Service Fabric 群集的可用性Availability of the Service Fabric cluster

一般来说,Service Fabric 群集本身是一个高度分布式环境,并且没有任何单一故障点。Generally speaking, the Service Fabric cluster itself is a highly distributed environment with no single points of failure. 任何一个节点故障不会为群集造成可用性或可靠性问题,主要是因为 Service Fabric 系统服务遵循前面提供的相同准则:默认情况下,它们始终运行 3 个或以上副本,并且这些无状态系统服务在所有节点上运行。A failure of any one node will not cause availability or reliability issues for the cluster, primarily because the Service Fabric system services follow the same guidelines provided earlier: they always run with three or more replicas by default, and those system services that are stateless run on all nodes. 基础 Service Fabric 网络和故障检测层是完全分布式层。The underlying Service Fabric networking and failure detection layers are fully distributed. 大多数系统服务可以根据群集中的元数据重新生成,或者知道如何从其他位置重新同步其状态。Most system services can be rebuilt from metadata in the cluster, or know how to resynchronize their state from other places. 如果系统服务遇到如上所述的仲裁丢失情况,则群集可用性可能会受到影响。The availability of the cluster can become compromised if system services get into quorum loss situations like those described above. 在这些情况下,可能无法在群机上执行某些操作(例如启动升级或部署新服务),但群集本身仍处于启动状态。In these cases you may not be able to perform certain operations on the cluster like starting an upgrade or deploying new services, but the cluster itself is still up. 已在运行的服务会继续在这些情况下运行,除非它们要求写入系统服务才能继续运行。Services on already running will remain running in these conditions unless they require writes to the system services to continue functioning. 例如,如果故障转移管理器处于仲裁丢失状态,则所有服务会继续运行,但失败的任何服务无法自动重启,因为这需要故障转移管理器的介入。For example, if the Failover Manager is in quorum loss all services will continue to run, but any services that fail will not be able to automatically restart, since this requires the involvement of the Failover Manager.

数据中心或 Azure 区域的故障Failures of a datacenter or Azure region

在极少数情况下,物理数据中心可能会由于电源或网络连接中断而暂时不可用。In rare cases, a physical data center can become temporarily unavailable due to loss of power or network connectivity. 在这些情况下,Service Fabric 群集和该数据中心或 Azure 区域中的服务不可用。In these cases, your Service Fabric clusters and services in that datacenter or Azure region will be unavailable. 但会保留数据 。However, your data is preserved . 对于在 Azure 中运行的群集,可以在 Azure 状态页上查看有关中断的最新信息。For clusters running in Azure, you can view updates on outages on the Azure status page. 物理数据中心遭到部分或完全损坏(这种情况很少见)时,托管在此处的任何 Service Fabric 群集或其中的服务可能会丢失。In the highly unlikely event that a physical data center is partially or fully destroyed, any Service Fabric clusters hosted there or the services inside them could be lost. 这包括未在该数据中心或区域外部备份的任何状态。This includes any state not backed up outside of that datacenter or region.

可通过两种不同的策略解决单个数据中心或区域出现的永久性或持续性故障。There's two different strategies for surviving the permanent or sustained failure of a single datacenter or region.

  1. 在多个此类区域中运行单独的 Service Fabric 群集,并利用某些机制在这些环境之间进行故障转移和故障回复。Run separate Service Fabric clusters in multiple such regions, and utilize some mechanism for failover and fail-back between these environments. 这种多群集、主动-主动或主动-被动模型需要额外的管理工作和操作代码。This sort of multi-cluster active-active or active-passive model requires additional management and operations code. 这还需要协调来自同一数据中心或区域中服务的备份,以便某一数据中心或区域失败时,其他数据中心或区域可以使用该备份。This also requires coordination of backups from the services in one datacenter or region so that they are available in other datacenters or regions when one fails.
  2. 运行跨多个数据中心或区域的单个 Service Fabric 群集。Run a single Service Fabric cluster that spans multiple datacenters or regions. 支持的最低配置是 3 个数据中心或区域。The minimum supported configuration for this is three datacenters or regions. 建议的区域或数据中心数量为 5。The recommended number of regions or datacenters is five. 这需要更复杂的群集拓扑。This requires a more complex cluster topology. 但是,此模型的好处是一个数据中心或区域的故障可从灾难转变为正常故障。However, the benefit of this model is that failure of one datacenter or region is converted from a disaster into a normal failure. 可通过适用于单个区域内群集的机制来处理这些故障。These failures can be handled by the mechanisms that work for clusters within a single region. 容错域、升级域和 Service Fabric 放置规则可确保工作负荷得以分散,使其能够承受正常故障。Fault domains, upgrade domains, and Service Fabric's placement rules ensure workloads are distributed so that they tolerate normal failures. 若要详细了解有助于在此类群集中运行服务的策略,请仔细阅读放置策略For more information on policies that can help operate services in this type of cluster, read up on placement policies

随机故障导致群集故障Random failures leading to cluster failures

Service Fabric 具有种子节点的概念。Service Fabric has the concept of Seed Nodes. 种子节点可以维护基础群集的可用性。These are nodes that maintain the availability of the underlying cluster. 这些节点有助于通过在某些类型的网络故障期间,与其他节点建立租约并充当决胜属性来确保群集保持启动状态。These nodes help to ensure the cluster remains up by establishing leases with other nodes and serving as tiebreakers during certain kinds of network failures. 如果随机故障删除了群集中的大多数种子节点,并且这些种子节点没有恢复,则由于丢失了种子节点仲裁,群集联合环将崩溃,并且群集将失败。If random failures remove a majority of the seed nodes in the cluster and they are not brought back, then your cluster federation ring collapses as you've lost seed node quorum and the cluster fails. 在 Azure 中,Service Fabric 资源提供程序管理 Service Fabric 群集配置,并且默认情况下,将在主节点类型的容错域和升级域之间分配种子节点;如果将主节点类型标记为“白银”或“黄金”耐久性,则在删除种子节点时(通过缩放主节点类型或手动删除种子节点),群集将尝试从主节点类型的可用容量中提升另一个非种子节点,如果可用容量低于群集可靠性级别对主节点类型的要求,则群集将失败。In Azure, Service Fabric Resource Provider manages Service Fabric cluster configurations, and by default distributes Seed Nodes across Primary Node Type fault and upgrade domains; If the primary nodetype is marked as Silver or Gold durability, when you remove a seed node, either by scaling in your primary nodetype or manually removing a seed node, the cluster will attempt to promote another non-seed node from the primary nodetype available capacity, and will fail if you have less available capacity than your cluster Reliability level requires for your Primary Node Type.

在独立 Service Fabric 群集和 Azure 中,“主节点类型”是运行种子的节点。In both standalone Service Fabric clusters and Azure, the "Primary Node Type" is the one that runs the seeds. 定义主节点类型时,Service Fabric 会自动利用通过创建最多 9 个种子节点和每个系统服务的 7 个副本所提供的节点数。When defining a primary node type, Service Fabric will automatically take advantage of the number of nodes provided by creating up to 9 seed nodes and 7 replicas of each of the system services. 如果一组随机故障同时使用大部分系统服务副本,则系统服务会进入如上所述的仲裁丢失状态。If a set of random failures takes out a majority of those system service replicas simultaneously, the system services will enter quorum loss, as we described above. 如果大部分种子节点丢失,群集会在不久之后关闭。If a majority of the seed nodes are lost, the cluster will shut down soon after.

后续步骤Next steps