Azure Service Fabric 中的灾难恢复Disaster recovery in Azure Service Fabric

提供高可用性的关键一环是确保服务能够经受各种不同类型的故障。A critical part of delivering high availability is ensuring that services can survive all different types of failures. 对于计划外和不受控制的故障,这一点尤其重要。This is especially important for failures that are unplanned and outside your control.

本文介绍一些常见的故障模式,如果未正确建模和管理,这些故障可能成为灾难。This article describes some common failure modes that might be disasters if not modeled and managed correctly. 本文还介绍发生灾难时应采取的缓解措施和行动。It also discusses mitigations and actions to take if a disaster happens anyway. 目标是在发生计划内故障或其他故障时,限制或消除停机或数据丢失风险。The goal is to limit or eliminate the risk of downtime or data loss when failures, planned or otherwise, occur.

避免灾难Avoiding disaster

Azure Service Fabric 的主要目标是帮助你针对环境和服务进行建模,使常见故障类型不构成灾难。The main goal of Azure Service Fabric is to help you model both your environment and your services in such a way that common failure types are not disasters.

一般而言,有两种类型的灾难/故障场景:In general, there are two types of disaster/failure scenarios:

  • 硬件和软件故障Hardware and software faults
  • 操作故障Operational faults

硬件和软件故障Hardware and software faults

硬件和软件故障是不可预知的。Hardware and software faults are unpredictable. 解决故障的最简单方法是跨硬件或软件故障边界运行服务的多个副本。The easiest way to survive faults is running more copies of the service across hardware or software fault boundaries.

例如,如果服务仅在一台计算机上运行,则该计算机的故障是该服务的灾难。For example, if your service is running on only one machine, the failure of that one machine is a disaster for that service. 避免此灾难的简单方法是,确保服务在多台计算机上运行。The simple way to avoid this disaster is to ensure that the service is running on multiple machines. 另外,必须进行测试以确保一台计算机的故障不会中断正在运行的服务。Testing is also necessary to ensure that the failure of one machine doesn't disrupt the running service. 容量规划确保可以在其他位置创建替换实例,并且容量减少不会使剩余服务过载。Capacity planning ensures that a replacement instance can be created elsewhere and that reduction in capacity doesn't overload the remaining services.

无论尝试避免哪种故障,都可以使用这一相同模式。The same pattern works regardless of what you're trying to avoid the failure of. 例如,如果你担心某一 SAN 发生故障,则可跨多个 SAN 运行。For example, if you're concerned about the failure of a SAN, you run across multiple SANs. 如果担心服务器机架丢失,则可跨多个机架运行。If you're concerned about the loss of a rack of servers, you run across multiple racks. 如果担心数据中心丢失,则应跨多个 Azure 区域或你自己的数据中心运行服务。If you're worried about the loss of datacenters, your service should run across multiple Azure regions, or across your own datacenters.

当服务跨多个物理实例(计算机、机架、数据中心、区域)时,仍可能发生某些类型的同步故障。When a service is spanned across multiple physical instances (machines, racks, datacenters, regions), you're still subject to some types of simultaneous failures. 但是,系统会自动处理特定类型的单个甚至多个故障(例如,单个虚拟机故障或网络链接故障),因此这些故障不再是“灾难”。But single and even multiple failures of a particular type (for example, a single virtual machine or network link failing) are automatically handled and so are no longer a "disaster."

Service Fabric 提供用于扩展群集的机制,并负责恢复发生故障的节点和服务。Service Fabric provides mechanisms for expanding the cluster and handles bringing failed nodes and services back. 使用 Service Fabric 还可以运行许多服务实例,以防止计划外故障转变成真正的灾难。Service Fabric also allows running many instances of your services to prevent unplanned failures from turning into real disasters.

可能有一些原因导致通过运行足够大的部署来解决故障并不可行。There might be reasons why running a deployment large enough to span failures is not feasible. 例如,相对于发生故障的可能性,需要的硬件资源可能超过你愿意支付的数目。For example, it might take more hardware resources than you're willing to pay for relative to the chance of failure. 处理分散式应用程序时,不同地理距离之间的附加通信跃点或状态复制成本可能导致不可接受的延迟。When you're dealing with distributed applications, additional communication hops or state replication costs across geographic distances might cause unacceptable latency. 具体界限因不同的应用程序而异。Where this line is drawn differs for each application.

具体而言,软件故障可能发生在尝试缩放的服务中。For software faults specifically, the fault might be in the service that you're trying to scale. 在这种情况下,运行多个副本不能防止灾难发生,因为故障条件与所有实例相关。In this case, more copies don't prevent the disaster, because the failure condition is correlated across all the instances.

操作故障Operational faults

即使服务遍布全球且存在许多冗余,仍然可能遇到灾难性事件。Even if your service is spanned across the globe with many redundancies, it can still experience disastrous events. 例如,某人意外重新配置或彻底删除了服务的 DNS 名称。For example, someone might accidentally reconfigure the DNS name for the service, or delete it outright.

举例来说,假设有一个有状态 Service Fabric 服务,有人意外删除了该服务。As an example, let's say you had a stateful Service Fabric service, and someone deleted that service accidentally. 除非有一些其他的缓解措施,否则该服务及其所有状态现在就不复存在了。Unless there's some other mitigation, that service and all of the state that it had are now gone. 这些类型的操作灾难(“事故”)需要采取不同于常规计划外故障的缓解措施和步骤才能恢复。These types of operational disasters ("oops") require different mitigations and steps for recovery than regular unplanned failures.

避免这些类型的操作故障的最佳方法是:The best ways to avoid these types of operational faults are to:

  • 限制对环境的操作访问。Restrict operational access to the environment.
  • 严格审核危险操作。Strictly audit dangerous operations.
  • 强制实行自动化、防止手动或带外更改,并在执行特定更改之前针对环境验证这些更改。Impose automation, prevent manual or out-of-band changes, and validate specific changes against the environment before enacting them.
  • 确保破坏性操作为“软”操作。Ensure that destructive operations are "soft." 软操作不会立即生效,或者可以在某个时间范围内撤消。Soft operations don't take effect immediately or can be undone within a time window.

Service Fabric 提供了用于防止操作故障的机制,例如提供针对群集操作的基于角色的访问控制。Service Fabric provides mechanisms to prevent operational faults, such as providing role-based access control for cluster operations. 但是,大多数操作故障都需要组织和其他系统来配合解决。However, most of these operational faults require organizational efforts and other systems. Service Fabric 提供用于解决操作故障的机制,最值得注意的是有状态服务的备份和还原Service Fabric does provide mechanisms for surviving operational faults, most notably backup and restore for stateful services.

管理故障Managing failures

Service Fabric 的目标是自动管理故障。The goal of Service Fabric is automatic management of failures. 但是,若要处理某些类型的故障,服务必须有其他代码。But to handle some types of failures, services must have additional code. 出于安全性和业务连续性的原因,不应自动处理其他类型的故障 。Other types of failures should not be automatically addressed for safety and business continuity reasons.

处理单个故障Handling single failures

单台计算机可能由于各种原因发生故障。Single machines can fail for all sorts of reasons. 有时,原因与硬件相关,例如电源和网络硬件故障。Sometimes it's hardware causes, like power supplies and network hardware failures. 其他故障与软件有关。Other failures are in software. 这些故障包括操作系统和服务本身的故障。These include failures of the operating system and the service itself. Service Fabric 会自动检测这些类型的故障,包括由于网络问题导致某一计算机与其他计算机隔离的情况。Service Fabric automatically detects these types of failures, including cases where the machine becomes isolated from other machines because of network problems.

无论服务是什么类型,如果代码的单个副本由于任何原因而发生故障,则运行单个实例会导致该服务停机。Regardless of the type of service, running a single instance results in downtime for that service if that single copy of the code fails for any reason.

若要处理任何单一故障,可执行的最简单操作是确保服务默认在多个节点上运行。To handle any single failure, the simplest thing you can do is ensure that your services run on more than one node by default. 对于无状态服务,请确保 InstanceCount 大于 1。For stateless services, make sure that InstanceCount is greater than 1. 对于有状态服务,最低建议是将 TargetReplicaSetSizeMinReplicaSetSize 都设置为 3。For stateful services, the minimum recommendation is that TargetReplicaSetSize and MinReplicaSetSize are both set to 3. 运行服务代码的更多副本可确保服务能够自动处理任何单一故障。Running more copies of your service code ensures that your service can handle any single failure automatically.

处理协调性故障Handling coordinated failures

由于计划内或计划外基础结构故障和更改,或计划内软件更改,群集中可能发生协调性故障。Coordinated failures in a cluster can be due to either planned or unplanned infrastructure failures and changes, or planned software changes. Service Fabric 将遇到协调性故障的基础结构区域建模为容错域 。Service Fabric models infrastructure zones that experience coordinated failures as fault domains. 将遇到协调性软件更改的区域建模为升级域 。Areas that will experience coordinated software changes are modeled as upgrade domains. 有关容错域、升级域和群集拓扑的详细信息,请参阅使用群集资源管理器描述 Service Fabric 群集For more information about fault domains, upgrade domains, and cluster topology, see Describe a Service Fabric cluster by using Cluster Resource Manager.

默认情况下,Service Fabric 会在规划服务应该运行的位置时考虑容错域和升级域。By default, Service Fabric considers fault and upgrade domains when planning where your services should run. 默认情况下,Service Fabric 尝试确保服务跨多个容错域和升级域运行,以便发生计划内或计划外更改时,服务仍然可用。By default, Service Fabric tries to ensure that your services run across several fault and upgrade domains so that if planned or unplanned changes happen, your services remain available.

例如,假设电源故障导致机架上的所有计算机同时发生故障。For example, let's say that failure of a power source causes all the machines on a rack to fail simultaneously. 运行多个服务副本时,容错域中丢失多台计算机的故障会变成服务单一故障的另一个示例。With multiple copies of the service running, the loss of many machines in fault domain failure turns into just another example of a single failure for a service. 因此,管理容错域和升级域对于确保服务高可用性至关重要。This is why managing fault and upgrade domains is critical to ensuring high availability of your services.

在 Azure 中运行 Service Fabric 时,系统会自动管理容错域和升级域。When you're running Service Fabric in Azure, fault domains and upgrade domains are managed automatically. 在其他环境中,可能不是这样。In other environments, they might not be. 若要在本地生成自己的群集,请务必正确映射和规划容错域布局。If you're building your own clusters on-premises, be sure to map and plan your fault domain layout correctly.

升级域可用于对那些同时会在其中升级软件的区域建模。Upgrade domains are useful for modeling areas where software will be upgraded at the same time. 因此,升级域通常还可以定义在计划升级期间停用软件的边界。Because of this, upgrade domains also often define the boundaries where software is taken down during planned upgrades. Service Fabric 和服务的升级均遵循相同的模型。Upgrades of both Service Fabric and your services follow the same model. 若要详细了解滚动升级、升级域以及有助于防止意外更改影响群集与服务的 Service Fabric 运行状况模型,请参阅:For more information on rolling upgrades, upgrade domains, and the Service Fabric health model that helps prevent unintended changes from affecting the cluster and your service, see:

可以使用 Service Fabric Explorer 中提供的群集映射来可视化群集布局:You can visualize the layout of your cluster by using the cluster map provided in Service Fabric Explorer:

Service Fabric Explorer 中分散在容错域之间的节点

备注

建模故障区域、滚动升级、运行服务代码和状态的多个实例、确保服务在容错域和升级域中运行的放置规则,以及内置运行状况监视仅仅是 Service Fabric 提供的一些功能,目的是防止正常操作问题和故障变成灾难 。Modeling areas of failure, rolling upgrades, running many instances of your service code and state, placement rules to ensure that your services run across fault and upgrade domains, and built-in health monitoring are just some of the features that Service Fabric provides to keep normal operational issues and failures from turning into disasters.

处理同步硬件或软件故障Handling simultaneous hardware or software failures

前面讨论了单一故障。We've been talking about single failures. 可以看到,只需跨容错域和升级域运行代码(和状态)的多个副本,即可轻松处理无状态和有状态服务的单一故障。As you can see, they're easy to handle for both stateless and stateful services just by keeping more copies of the code (and state) running across fault and upgrade domains.

也可能发生多个同时随机故障。Multiple simultaneous random failures can also happen. 这些故障更有可能造成停机或实际灾难。These are more likely to lead to downtime or an actual disaster.

无状态服务Stateless services

无状态服务的实例计数指示必须处于运行状态的实例的所需数目。The instance count for a stateless service indicates the desired number of instances that need to be running. 当任一(或所有)实例发生故障时,Service Fabric 会通过在其他节点上自动创建替代实例来做出响应。When any (or all) of the instances fail, Service Fabric responds by automatically creating replacement instances on other nodes. Service Fabric 会不断地创建替代实例,直到服务恢复到所需的实例计数。Service Fabric continues to create replacements until the service is back to its desired instance count.

例如,假设无状态服务的 InstanceCount 值为 -1。For example, assume that the stateless service has an InstanceCount value of -1. 此值意味着应在群集中的每个节点上运行一个实例。This value means that one instance should be running on each node in the cluster. 如果其中一些实例发生故障,Service Fabric 会检测到服务不处于所需状态,因此会尝试在缺少这些实例的节点上创建实例。If some of those instances fail, Service Fabric will detect that service is not in its desired state and will try to create the instances on the nodes where they're missing.

有状态服务Stateful services

有两种类型的有状态服务:There are two types of stateful services:

  • 具有永久性状态的有状态服务。Stateful with persisted state.
  • 具有非永久性状态的有状态服务。Stateful with non-persisted state. (状态存储在内存中。)(State is stored in memory.)

在有状态服务发生故障后如何进行恢复取决于该有状态服务的类型、该服务拥有的副本数,以及发生故障的副本数。Recovery from failure of a stateful service depends on the type of the stateful service, how many replicas the service had, and how many replicas failed.

在有状态服务中,传入的数据在副本(主要副本和任何活动的次要副本)之间复制。In a stateful service, incoming data is replicated between replicas (the primary and any active secondaries). 如果大多数副本接收数据,则将数据视为已提交仲裁 。If a majority of the replicas receive the data, data is considered quorum committed. (如果有五个副本,其中三个副本将是仲裁。)这意味着,在任何时候,都至少有一个副本仲裁包含最新数据。(For five replicas, three will be a quorum.) This means that at any point, there will be at least a quorum of replicas with the latest data. 如果副本发生故障(例如,五个副本中的两个发生故障),我们可以使用仲裁值来评估是否可以恢复。If replicas fail (say two out of five), we can use the quorum value to calculate if we can recover. (由于五个副本中的剩余三个副本仍处于运行状态,因此可以保证至少有一个副本包含完整数据。)(Because the remaining three out of five replicas are still up, it's guaranteed that at least one replica will have complete data.)

当副本仲裁发生故障时,会将分区声明为处于仲裁丢失状态 。When a quorum of replicas fail, the partition is declared to be in a quorum loss state. 假设某个分区包含五个副本,这意味着,可以保证至少有三个副本包含完整数据。Say a partition has five replicas, which means that at least three are guaranteed to have complete data. 如果副本的仲裁(五个副本中的三个)发生故障,Service Fabric 将无法确定剩余的副本(五个副本中的两个)是否包含足够的用于还原分区的数据。If a quorum (three out five) of replicas fail, Service Fabric can't determine if the remaining replicas (two out five) have enough data to restore the partition. 如果 Service Fabric 检测到仲裁丢失,其默认行为是阻止对分区进行更多的写入,声明仲裁丢失,并等待副本仲裁还原。In cases where Service Fabric detects quorum loss, its default behavior is to prevent additional writes to the partition, declare quorum loss, and wait for a quorum of replicas to be restored.

请根据以下 3 个阶段确定有状态服务是否发生灾难并对其进行管理:Determining whether a disaster occurred for a stateful service and then managing it follows three stages:

  1. 确定有无仲裁丢失。Determining if there has been quorum loss or not.

    当有状态服务的大多数副本同时关闭时,将声明仲裁丢失。Quorum loss is declared when a majority of the replicas of a stateful service are down at the same time.

  2. 确定仲裁丢失是否属于永久性。Determining if the quorum loss is permanent or not.

    大多数情况下,故障是暂时性的。Most of the time, failures are transient. 重启进程、节点和虚拟机,网络分区将会修复。Processes are restarted, nodes are restarted, virtual machines are relaunched, and network partitions heal. 但是,故障有时是永久性的。Sometimes, though, failures are permanent. 故障是否为永久性取决于有状态服务是否保存其状态,或者是否仅将其状态保存在内存中:Whether failures are permanent or not depends on whether the stateful service persists its state or whether it keeps it only in memory:

    • 对于没有永久性状态的服务,仲裁或多个副本故障会立即导致永久性仲裁丢失 。For services without persisted state, a failure of a quorum or more of replicas results immediately in permanent quorum loss. Service Fabric 在有状态、非永久性服务中检测到仲裁丢失时,会立即通过声明(潜在的)数据丢失转到步骤 3。When Service Fabric detects quorum loss in a stateful non-persistent service, it immediately proceeds to step 3 by declaring (potential) data loss. 转到数据丢失状态是有意义的,因为 Service Fabric 知道等待副本恢复没有任何意义。Proceeding to data loss makes sense because Service Fabric knows that there's no point in waiting for the replicas to come back. 由于服务的非永久性,即使副本恢复,数据也会丢失。Even if they recover, the data will be lost because of the non-persisted nature of the service.
    • 对于有状态的永久性服务,仲裁或多个副本故障会导致 Service Fabric 等待副本恢复并还原仲裁。For stateful persistent services, a failure of a quorum or more of replicas causes Service Fabric to wait for the replicas to come back and restore the quorum. 这会对服务的受影响分区(或“副本集”)的任何写入造成服务中断 。This results in a service outage for any writes to the affected partition (or "replica set") of the service. 但是,仍有可能可以读取,但其一致性保证会降低。However, reads might still be possible with reduced consistency guarantees. Service Fabric 等待仲裁恢复的默认时间是无限的,因为处理是(潜在的)数据丢失事件并伴有其他风险 。The default amount of time that Service Fabric waits for the quorum to be restored is infinite , because proceeding is a (potential) data-loss event and carries other risks. 这意味着,除非管理员通过操作来声明数据丢失,否则 Service Fabric 不会继续执行下一步。This means that Service Fabric will not proceed to the next step unless an administrator takes action to declare data loss.
  3. 确定数据是否丢失,并从备份还原。Determining if data is lost, and restoring from backups.

    如果已声明仲裁丢失(自动声明或通过管理操作声明),Service Fabric 和服务会继续确定数据是否实际丢失。If quorum loss has been declared (either automatically or through administrative action), Service Fabric and the services move on to determining if data was actually lost. 此时,Service Fabric 还知道其他副本不会恢复。At this point, Service Fabric also knows that the other replicas aren't coming back. 这是当我们停止等待仲裁丢失自行解决时所做的决策。That was the decision made when we stopped waiting for the quorum loss to resolve itself. 服务采取的最佳做法通常是冻结并等待特定的管理员介入。The best course of action for the service is usually to freeze and wait for specific administrative intervention.

    Service Fabric 调用 OnDataLossAsync 方法始终是因为疑似数据丢失 。When Service Fabric calls the OnDataLossAsync method, it's always because of suspected data loss. Service Fabric 可确保将此调用传送到最合适的剩余副本 。Service Fabric ensures that this call is delivered to the best remaining replica. 也就是进度最大的副本。This is whichever replica has made the most progress.

    我们总是将其称为疑似数据丢失,这是因为剩余副本在仲裁丢失时可能与主要副本具有完全相同的状态 。The reason we always say suspected data loss is that it's possible that the remaining replica has all the same state as the primary did when quorum was lost. 但是,如果没有该状态作为对比,Service Fabric 或操作者就没有很好的方法来明确这一点。However, without that state to compare it to, there's no good way for Service Fabric or operators to know for sure.

    那么 OnDataLossAsync 方法所执行的典型实现是什么?So what does a typical implementation of the OnDataLossAsync method do?

    1. 该实现会记录 OnDataLossAsync 已被触发,并触发任何必要的管理警报。The implementation logs that OnDataLossAsync has been triggered, and it fires off any necessary administrative alerts.

    2. 通常,该实现会暂停并等待进一步决策和要采取的手动操作。Usually, the implementation pauses and waits for further decisions and manual actions to be taken. 这是因为即使备份可用,也可能需要进行准备。This is because even if backups are available, they might need to be prepared.

      例如,如果两个不同的服务协调信息,则可能需要修改这些备份,以确保发生还原后,这两个服务所关注的信息一致。For example, if two different services coordinate information, those backups might need to be modified to ensure that after the restore happens, the information that those two services care about is consistent.

    3. 通常有一些其他遥测或服务消耗。Often there's some other telemetry or exhaust from the service. 此元数据可能包含在其他服务或日志中。This metadata might be contained in other services or in logs. 可根据需要使用此信息确定主要副本是否收到并处理了任何调用,这些调用未存在于备份中或未复制到此特定副本中。This information can be used as needed to determine if there were any calls received and processed at the primary that were not present in the backup or replicated to this particular replica. 可能需要重播或向备份添加这些调用才能进行还原。These calls might need to be replayed or added to the backup before restoration is feasible.

    4. 该实现将剩余副本的状态与任何可用备份中所含的状态进行比较。The implementation compares the remaining replica's state to that contained in any available backups. 如果使用 Service Fabric 可靠集合,则可使用工具和进程执行此操作。If you're using Service Fabric reliable collections, there are tools and processes available for doing so. 目标是查看副本中的状态是否充足,以及查看备份可能丢失的内容。The goal is to see if the state within the replica is sufficient, and to see what the backup might be missing.

    5. 完成比较并根据需要完成还原后,如果更改了任何状态,则服务代码应返回 true 。After the comparison is done, and after the restore is completed (if necessary), the service code should return true if any state changes were made. 如果副本确定它是该状态的最佳可用副本,并且未进行任何更改,则代码将返回 false 。If the replica determined that it was the best available copy of the state and made no changes, the code returns false.

      true 值表示任何其他剩余副本现在可能与此副本不一致 。A value of true indicates that any other remaining replicas might now be inconsistent with this one. 应删除这些副本并根据此副本重新生成。They will be dropped and rebuilt from this replica. false 值表示未进行任何状态更改,因此其他副本可以保留其所含的内容 。A value of false indicates that no state changes were made, so the other replicas can keep what they have.

在生产环境中部署服务之前,服务创建者务必对潜在的数据丢失和故障场景进行实践。It's critically important that service authors practice potential data-loss and failure scenarios before services are deployed in production. 为了防止数据丢失的可能性,务必定期将任何有状态服务的状态备份到异地冗余存储。To protect against the possibility of data loss, it's important to periodically back up the state of any of your stateful services to a geo-redundant store.

还需确保能够还原状态。You must also ensure that you have the ability to restore the state. 由于许多不同服务进行备份的时间有所不同,因此需要确保还原后各个服务具有相互一致的视图。Because backups of many different services are taken at different times, you need to ensure that after a restore, your services have a consistent view of each other.

例如,考虑这样一种情况:某一服务生成并存储一个数值,并将其发送给同样存储该数值的另一服务。For example, consider a situation where one service generates a number and stores it, and then sends it to another service that also stores it. 还原后,可能发现第二个服务含有该数值但第一个没有,因为其备份未包含该操作。After a restore, you might discover that the second service has the number but the first does not, because its backup didn't include that operation.

如果发现剩余副本不足以在数据丢失场景中继续运行,并且不能通过遥测或消耗重构服务状态,则备份频率决定最可行的恢复点目标 (RPO)。If you find out that the remaining replicas are insufficient to continue in a data-loss scenario, and you can't reconstruct the service state from telemetry or exhaust, the frequency of your backups determines your best possible recovery point objective (RPO). Service Fabric 提供许多工具来测试各种故障场景,包括需要从备份还原的永久性仲裁和数据丢失。Service Fabric provides many tools for testing various failure scenarios, including permanent quorum and data loss that requires restoration from a backup. 这些场景都包括在 Service Fabric 的可测试性工具中,由故障分析服务管理。These scenarios are included as a part of the testability tools in Service Fabric, managed by the Fault Analysis Service. 有关这些工具和模式的详细信息,请参阅故障分析服务简介For more information on those tools and patterns, see Introduction to the Fault Analysis Service.

备注

系统服务也可能遭受仲裁丢失。System services can also suffer quorum loss. 这种影响特定于相关的服务。The impact is specific to the service in question. 例如,命名服务的仲裁丢失会影响名称解析,而故障转移管理器服务的仲裁丢失会阻止新服务的创建与故障转移。For instance, quorum loss in the naming service affects name resolution, whereas quorum loss in the Failover Manager service blocks new service creation and failovers.

Service Fabric 系统服务采用与状态管理服务相同的模式,但我们不建议你尝试将其从仲裁丢失转为潜在的数据丢失。The Service Fabric system services follow the same pattern as your services for state management, but we don't recommend that you try to move them out of quorum loss and into potential data loss. 我们建议你寻求支持来确定针对具体情况的解决方案。Instead, we recommend that you seek support to find a solution that's targeted to your situation. 通常最好是单纯地等待,直到关闭的副本恢复为止。It's usually preferable to simply wait until the down replicas return.

排查仲裁丢失问题Troubleshooting quorum loss

由于发生暂时性故障,副本可能会间歇性关闭。Replicas might be down intermittently because of a transient failure. 请等待一段时间,让 Service Fabric 尝试将其启动。Wait for some time as Service Fabric tries to bring them up. 如果副本关闭的持续时间超过预期,请执行以下故障排除操作:If replicas have been down for more than an expected duration, follow these troubleshooting actions:

  • 副本可能会崩溃。Replicas might be crashing. 检查副本级别的运行状况报告和应用程序日志。Check replica-level health reports and your application logs. 收集故障转储并采取必要的措施进行恢复。Collect crash dumps and take necessary actions to recover.
  • 副本进程可能变得无响应。The replica process might have become unresponsive. 检查应用程序日志以验证这种情况。Inspect your application logs to verify this. 收集进程转储,然后停止无响应的进程。Collect process dumps and then stop the unresponsive process. Service Fabric 会创建一个替代进程,并会尝试恢复副本。Service Fabric will create a replacement process and will try to bring the replica back.
  • 托管副本的节点可能已关闭。Nodes that host the replicas might be down. 重启基础虚拟机以使节点启动。Restart the underlying virtual machine to bring the nodes up.

有时可能无法恢复副本。Sometimes, it might not be possible to recover replicas. 例如,驱动器出现故障或者计算机实际上无响应。For example, the drives have failed or the machines physically aren't responding. 在这些情况下,需要告知 Service Fabric 不要等待副本恢复。In these cases, Service Fabric needs to be told not to wait for replica recovery.

如果使服务联机造成的潜在数据丢失不可接受,请不要使用这些方法 。Do not use these methods if potential data loss is unacceptable to bring the service online. 在这种情况下,应尽全力恢复物理计算机。In that case, all efforts should be made toward recovering physical machines.

以下操作可能导致数据丢失。The following actions might result in data loss. 在执行这些操作之前请进行检查。Check before you follow them.

备注

为了确保安全,请务必针对特定分区有目标地使用这些方法 。It's never safe to use these methods other than in a targeted way against specific partitions.

  • 使用 Repair-ServiceFabricPartition -PartitionIdSystem.Fabric.FabricClient.ClusterManagementClient.RecoverPartitionAsync(Guid partitionId) API。Use the Repair-ServiceFabricPartition -PartitionId or System.Fabric.FabricClient.ClusterManagementClient.RecoverPartitionAsync(Guid partitionId) API. 使用此 API 可以指定分区 ID,使其从仲裁丢失转为潜在的数据丢失。This API allows specifying the ID of the partition to move out of quorum loss and into potential data loss.
  • 如果群集频繁出现故障,导致服务进入仲裁丢失状态,并且可以接受潜在的数据丢失,则指定适当的 QuorumLossWaitDuration 值有助于服务自动恢复 。If your cluster encounters frequent failures that cause services to go into a quorum-loss state and potential data loss is acceptable , specifying an appropriate QuorumLossWaitDuration value can help your service automatically recover. 在执行恢复之前,Service Fabric 会等待提供的 QuorumLossWaitDuration 值(默认为 infinite)。Service Fabric will wait for the provided QuorumLossWaitDuration value (default is infinite) before performing recovery. 我们不建议使用此方法,因为它可能导致意外的数据丢失 。We don't recommend this method because it can cause unexpected data losses.

Service Fabric 群集的可用性Availability of the Service Fabric cluster

一般情况下,Service Fabric 群集是一个分散程度很高的环境,没有任何单一故障点。In general, the Service Fabric cluster is a highly distributed environment with no single points of failure. 任何一个节点发生故障不会给群集造成可用性或可靠性问题,主要是因为 Service Fabric 系统服务遵循前面提供的准则。A failure of any one node will not cause availability or reliability issues for the cluster, primarily because the Service Fabric system services follow the same guidelines provided earlier. 即,默认情况下,它们始终运行三个或三个以上的副本,并且无状态系统服务在所有节点上运行。That is, they always run with three or more replicas by default, and system services that are stateless run on all nodes.

基础 Service Fabric 网络和故障检测层是完全分布式层。The underlying Service Fabric networking and failure detection layers are fully distributed. 大多数系统服务可以根据群集中的元数据重新生成,或者知道如何从其他位置重新同步其状态。Most system services can be rebuilt from metadata in the cluster, or know how to resynchronize their state from other places. 如果系统服务遇到前面所述的仲裁丢失情况,则群集可用性可能会受到影响。The availability of the cluster can become compromised if system services get into quorum-loss situations like those described earlier. 在这些情况下,可能无法在群机上执行某些操作(例如启动升级或部署新服务),但群集本身仍处于启动状态。In these cases, you might not be able to perform certain operations on the cluster (like starting an upgrade or deploying new services), but the cluster itself is still up.

运行中群集上的服务会继续在这些情况下运行,除非它们要求将数据写入系统服务才能继续运行。Services on a running cluster will keep running in these conditions unless they require writes to the system services to continue functioning. 例如,如果故障转移管理器处于仲裁丢失状态,则所有服务会继续运行。For example, if Failover Manager is in quorum loss, all services will continue to run. 但是,发生故障的任何服务无法自动重启,因为这需要故障转移管理器的介入。But any services that fail won't be able to automatically restart, because this requires the involvement of Failover Manager.

数据中心或 Azure 区域的故障Failures of a datacenter or an Azure region

在极少数情况下,物理数据中心可能会由于电源或网络连接中断而变得暂时不可用。In rare cases, a physical datacenter can become temporarily unavailable from loss of power or network connectivity. 在这些情况下,Service Fabric 群集和该数据中心或 Azure 区域中的服务不可用。In these cases, your Service Fabric clusters and services in that datacenter or Azure region will be unavailable. 但会保留数据 。However, your data is preserved .

对于在 Azure 中运行的群集,可以在 Azure 状态页上查看有关中断的最新信息。For clusters running in Azure, you can view updates on outages on the Azure status page. 物理数据中心遭到部分或完全损坏(这种情况很少见)时,托管在此处的任何 Service Fabric 群集或其中的服务可能会丢失。In the highly unlikely event that a physical datacenter is partially or fully destroyed, any Service Fabric clusters hosted there, or the services inside them, might be lost. 这种丢失包括未在该数据中心或区域外部备份的任何状态。This loss includes any state not backed up outside that datacenter or region.

可通过两种不同的策略解决单个数据中心或区域出现的永久性或持续性故障:There are two different strategies for surviving the permanent or sustained failure of a single datacenter or region:

  • 在多个此类区域中运行单独的 Service Fabric 群集,并使用某些机制在这些环境之间进行故障转移和故障回复。Run separate Service Fabric clusters in multiple such regions, and use some mechanism for failover and failback between these environments. 这种多群集主动/主动或主动/被动模型需要额外的管理工作和操作代码。This sort of multiple-cluster active/active or active/passive model requires additional management and operations code. 此模型还需要协调来自同一数据中心或区域中服务的备份,以便某一数据中心或区域失败时,其他数据中心或区域可以使用该备份。This model also requires coordination of backups from the services in one datacenter or region so that they're available in other datacenters or regions when one fails.

  • 运行跨多个数据中心或区域的单个 Service Fabric 群集。Run a single Service Fabric cluster that spans multiple datacenters or regions. 此策略支持的最低配置是三个数据中心或区域。The minimum supported configuration for this strategy is three datacenters or regions. 建议的区域或数据中心数量为 5。The recommended number of regions or datacenters is five.

    此模型需要更复杂的群集拓扑。This model requires a more complex cluster topology. 但是,好处是一个数据中心或区域的故障可从灾难转变为正常故障。However, the benefit is that failure of one datacenter or region is converted from a disaster into a normal failure. 可通过适用于单个区域内群集的机制来处理这些故障。These failures can be handled by the mechanisms that work for clusters within a single region. 容错域、升级域和 Service Fabric 放置规则可确保工作负载得以分散,使其能够承受正常故障。Fault domains, upgrade domains, and Service Fabric placement rules ensure that workloads are distributed so that they tolerate normal failures.

    若要详细了解那些有助于在此类群集中运行服务的策略,请参阅 Service Fabric 服务的放置策略For more information on policies that can help operate services in this type of cluster, see Placement policies for Service Fabric services.

导致群集故障的随机故障Random failures that lead to cluster failures

Service Fabric 具有种子节点的概念 。Service Fabric has the concept of seed nodes. 种子节点可以维护基础群集的可用性。These are nodes that maintain the availability of the underlying cluster.

种子节点可以与其他节点签署租约,在发生某些类型的故障时充当断路器,这样可确保群集保持启动状态。Seed nodes help to ensure that the cluster stays up by establishing leases with other nodes and serving as tiebreakers during certain kinds of failures. 如果随机故障删除了群集中的大部分种子节点并且无法快速将其恢复,则群集会自动关闭。If random failures remove a majority of the seed nodes in the cluster and they're not brought back quickly, your cluster automatically shuts down. 群集随后会发生故障。The cluster then fails.

在 Azure 中,Service Fabric 资源提供程序管理 Service Fabric 群集配置。In Azure, Service Fabric Resource Provider manages Service Fabric cluster configurations. 默认情况下,资源提供程序在主节点类型的容错域和升级域之间分配种子节点 。By default, Resource Provider distributes seed nodes across fault and upgrade domains for the primary node type. 如果将主节点类型标记为“银”或“金”级耐久性,则在通过缩减主节点类型或手动删除种子节点的方式删除种子节点时,群集会尝试从主节点类型的可用容量中提升另一个非种子节点。If the primary node type is marked as Silver or Gold durability, when you remove a seed node (either by scaling in your primary node type or by manually removing it), the cluster will try to promote another non-seed node from the primary node type's available capacity. 如果你的可用容量少于群集可靠性级别对主节点类型的要求,则此尝试会失败。This attempt will fail if you have less available capacity than your cluster reliability level requires for your primary node type.

在独立 Service Fabric 群集和 Azure 中,主节点类型是运行种子的节点类型。In both standalone Service Fabric clusters and Azure, the primary node type is the one that runs the seeds. 定义主节点类型时,Service Fabric 会自动利用为每个系统服务创建最多九个种子节点和七个副本时提供的节点数。When you're defining a primary node type, Service Fabric will automatically take advantage of the number of nodes provided by creating up to nine seed nodes and seven replicas of each system service. 如果一组随机故障同时导致其中的大部分副本关闭,则系统服务会进入仲裁丢失状态。If a set of random failures takes out a majority of those replicas simultaneously, the system services will enter quorum loss. 如果大部分种子节点丢失,群集会在不久之后关闭。If a majority of the seed nodes are lost, the cluster will shut down soon after.

后续步骤Next steps