备份和还原 Reliable Services 及 Reliable ActorsBackup and restore Reliable Services and Reliable Actors

Azure Service Fabric 是一个高可用性平台,用于复制多个节点中的状态以维护此高可用性。Azure Service Fabric is a high-availability platform that replicates the state across multiple nodes to maintain this high availability. 因此,即使群集中的一个节点出现故障,服务也将继续可用。Thus, even if one node in the cluster fails, the services continue to be available. 尽管此平台提供的内置冗余对某些情况来说可能已经足够使用,但在特定情况下,仍需要服务备份数据(到外部存储)。While this in-built redundancy provided by the platform may be sufficient for some, in certain cases it is desirable for the service to back up data (to an external store).

备注

请务必备份和还原数据(并测试它是否正常工作),以便从数据丢失的情形中进行恢复。It is critical to backup and restore your data (and test that it works as expected) so you can recover from data loss scenarios.

备注

Azure 建议使用定期备份和还原来配置可靠有状态服务和 Reliable Actors 的数据备份。Azure recommends to use Periodic backup and restore for configuring data backup of Reliable Stateful services and Reliable Actors.

例如,服务可能要备份数据,以防止出现以下情况:For example, a service may want to back up data in order to protect from the following scenarios:

  • 整个 Service Fabric 群集永久丢失。In the event of the permanent loss of an entire Service Fabric cluster.
  • 大部分服务分区副本永久丢失Permanent loss of a majority of the replicas of a service partition
  • 状态被意外删除或受损而引起管理错误。Administrative errors whereby the state accidentally gets deleted or corrupted. 例如,当具有足够权限的管理员错误地删除了服务时可能发生此情况。For example, this may happen if an administrator with sufficient privilege erroneously deletes the service.
  • 服务中的 bug 导致数据损坏。Bugs in the service that cause data corruption. 例如,当某个服务代码升级程序开始将错误数据写入到可靠集合中时可能发生此情况。For example, this may happen when a service code upgrade starts writing faulty data to a Reliable Collection. 在此情况下,代码和数据可能都必须还原到先前的状态。In such a case, both the code and the data may have to be reverted to an earlier state.
  • 离线数据处理。Offline data processing. 对于商业智能来说使用离线处理的数据很方便,此处理是独立于生成数据的服务进行的。It might be convenient to have offline processing of data for business intelligence that happens separately from the service that generates the data.

备份/还原功能允许在 Reliable Services API 上构建的服务创建和还原备份。The Backup/Restore feature allows services built on the Reliable Services API to create and restore backups. 平台提供的备份 API 允许在不阻止读取或写入操作的情况下备份服务分区的状态。The backup APIs provided by the platform allow backup(s) of a service partition's state, without blocking read or write operations. 还原 API 允许从选定的备份还原服务分区的状态。The restore APIs allow a service partition's state to be restored from a chosen backup.

备份的类型Types of Backup

有两个备份选项:完整和增量。There are two backup options: Full and Incremental. 完整备份是包含重新创建副本状态所需的所有数据的备份:检查点和所有日志记录。A full backup is a backup that contains all the data required to recreate the state of the replica: checkpoints and all log records. 因为它具有检查点和日志,所以完整备份可以自行还原。Since it has the checkpoints and the log, a full backup can be restored by itself.

当检查点较大时,会出现与完整备份有关的问题。The problem with full backups arises when the checkpoints are large. 例如,具有 16 GB 状态的副本的检查点合计大约有 16 GB。For example, a replica that has 16 GB of state will have checkpoints that add up approximately to 16 GB. 如果我们的恢复点目标是 5 分钟,则副本需要每 5 分钟备份一次。If we have a Recovery Point Objective of five minutes, the replica needs to be backed up every five minutes. 每次备份时,需要复制 16 GB 的检查点以及 50 MB(可使用 CheckpointThresholdInMB 进行配置)的日志。Each time it backs up it needs to copy 16 GB of checkpoints in addition to 50 MB (configurable using CheckpointThresholdInMB) worth of logs.

完整备份示例。

此问题的解决方案是增量备份,其中备份仅包含自上次备份以来的更改日志记录。The solution to this problem is incremental backups, where back up only contains the changed log records since the last backup.

增量备份示例。

由于增量备份只在自上次备份以来更改(不包括检查点),因此它们往往更快,但无法自行还原。Since incremental backups are only changes since the last backup (does not include the checkpoints), they tend to be faster but they cannot be restored on their own. 若要还原增量备份,需要整个备份链。To restore an incremental backup, the entire backup chain is required. 备份链是从完整备份开始,随后是一些连续增量备份的链。A backup chain is a chain of backups starting with a full backup and followed by a number of contiguous incremental backups.

备份 Reliable ServicesBackup Reliable Services

服务创建者可完全控制何时进行备份,以及将备份存储在何处。The service author has full control of when to make backups and where backups will be stored.

若要开始备份,服务需要调用继承的成员函数 BackupAsyncTo start a backup, the service needs to invoke the inherited member function BackupAsync.
仅可从主副本创建备份,并且需要授予这些备份写入状态。Backups can be made only from primary replicas, and they require write status to be granted.

如下所示,BackupAsync 采用 BackupDescription 对象,用户可以在其中指定完整或增量备份,以及指定在本地创建备份文件夹并准备好移出到某个外部存储时调用的回调函数 Func<< BackupInfo, CancellationToken, Task<bool>>>As shown below, BackupAsync takes in a BackupDescription object, where one can specify a full or incremental backup, as well as a callback function, Func<< BackupInfo, CancellationToken, Task<bool>>> that is invoked when the backup folder has been created locally and is ready to be moved out to some external storage.


BackupDescription myBackupDescription = new BackupDescription(BackupOption.Incremental,this.BackupCallbackAsync);

await this.BackupAsync(myBackupDescription);

采用增量备份的请求可能因 FabricMissingFullBackupException 而失败。Request to take an incremental backup can fail with FabricMissingFullBackupException. 此异常指示正在发生以下情况之一:This exception indicates that one of the following things is happening:

  • 副本在成为主要副本后从未执行完整备份,the replica has never taken a full backup since it has become primary,
  • 自上次备份被截断后记录了部分日志,或some of the log records since the last backup has been truncated or
  • 副本传递了 MaxAccumulatedBackupLogSizeInMB 限制。replica passed the MaxAccumulatedBackupLogSizeInMB limit.

用户可以通过配置 MinLogSizeInMBTruncationThresholdFactor 增加能够执行增量备份的可能性。Users can increase the likelihood of being able to do incremental backups by configuring MinLogSizeInMB or TruncationThresholdFactor. 增加这些值会增加每个副本磁盘的使用。Increasing these values increases the per replica disk usage. 有关详细信息,请参阅 Reliable Services 配置For more information, see Reliable Services Configuration

BackupInfo 提供有关备份的信息,包括运行时保存备份的文件夹位置 (BackupInfo.Directory)。BackupInfo provides information regarding the backup, including the location of the folder where the runtime saved the backup (BackupInfo.Directory). 此回调函数可将 BackupInfo.Directory 移到外部存储或其他位置。The callback function can move the BackupInfo.Directory to an external store or another location. 此函数也返回一个布尔值,此值表示是否已成功将备份文件夹移动到其目标位置。This function also returns a bool that indicates whether it was able to successfully move the backup folder to its target location.

以下代码演示如何使用 BackupCallbackAsync 方法将备份上传到 Azure 存储:The following code demonstrates how the BackupCallbackAsync method can be used to upload the backup to Azure Storage:

private async Task<bool> BackupCallbackAsync(BackupInfo backupInfo, CancellationToken cancellationToken)
{
    var backupId = Guid.NewGuid();

    await externalBackupStore.UploadBackupFolderAsync(backupInfo.Directory, backupId, cancellationToken);

    return true;
}

在上例中,ExternalBackupStore 是用于与 Azure Blob 存储进行交互的示例类,UploadBackupFolderAsync 是压缩文件夹并将其放置在 Azure Blob 存储中的方法。In the preceding example, ExternalBackupStore is the sample class that is used to interface with Azure Blob storage, and UploadBackupFolderAsync is the method that compresses the folder and places it in the Azure Blob store.

请注意:Note that:

  • 在任何给定时间,每个副本只能有一项正在进行的备份操作。There can be only one backup operation in-flight per replica at any given time. 同时多次 BackupAsync 调用会引发 FabricBackupInProgressException,将即时备份数限制为一。More than one BackupAsync call at a time will throw FabricBackupInProgressException to limit inflight backups to one.
  • 如果备份正在进行时副本发生故障转移,那么备份可能没有完成。If a replica fails over while a backup is in progress, the backup may not have been completed. 因此,故障转移完成后,服务应负责根据需要通过调用 BackupAsync 来重启备份。Thus, once the failover finishes, it is the service's responsibility to restart the backup by invoking BackupAsync as necessary.

还原 Reliable ServicesRestore Reliable Services

一般而言,可能需要执行还原操作的情况可以为以下类型之一:In general, the cases when you might need to perform a restore operation fall into one of these categories:

  • 服务分区丢失数据。The service partition lost data. 例如,分区的三个副本中两个副本(包括主副本)的磁盘数据已损坏或被擦除。For example, the disk for two out of three replicas for a partition (including the primary replica) gets corrupted or wiped. 新的主副本可能需要从备份中还原数据。The new primary may need to restore data from a backup.
  • 整个服务已丢失。The entire service is lost. 例如,管理员删除了整个服务,因此需要还原该服务和数据。For example, an administrator removes the entire service and thus the service and the data need to be restored.
  • 服务复制了损坏的应用程序数据(例如,由于应用程序的 bug)。The service replicated corrupt application data (for example, because of an application bug). 在此情况下,必须升级或还原服务来消除损坏的原因,并还原未损坏的数据。In this case, the service has to be upgraded or reverted to remove the cause of the corruption, and non-corrupt data has to be restored.

虽然有许多恢复方法,但是我们仅针对上述情形提供一些使用 RestoreAsync 进行恢复的示例。While many approaches are possible, we offer some examples on using RestoreAsync to recover from the above scenarios.

Reliable Services 中的分区数据丢失Partition data loss in Reliable Services

在此情况下,运行时会自动检测数据丢失并调用 OnDataLossAsync API。In this case, the runtime would automatically detect the data loss and invoke the OnDataLossAsync API.

服务创建者需要执行下列操作来恢复:The service author needs to perform the following to recover:

  • 重写虚拟基类方法 OnDataLossAsyncOverride the virtual base class method OnDataLossAsync.
  • 在包含服务的备份的外部位置中查找最新备份。Find the latest backup in the external location that contains the service's backups.
  • 下载最新的备份(如果备份已压缩,则将其解压缩到备份文件夹中)。Download the latest backup (and uncompress the backup into the backup folder if it was compressed).
  • OnDataLossAsync 方法提供 RestoreContextThe OnDataLossAsync method provides a RestoreContext. 对所提供的 RestoreContext 调用 RestoreAsync API。Call the RestoreAsync API on the provided RestoreContext.
  • 如果还原成功,则返回 true。Return true if the restoration was a success.

以下是 OnDataLossAsync 方法的实现示例:Following is an example implementation of the OnDataLossAsync method:

protected override async Task<bool> OnDataLossAsync(RestoreContext restoreCtx, CancellationToken cancellationToken)
{
    var backupFolder = await this.externalBackupStore.DownloadLastBackupAsync(cancellationToken);

    var restoreDescription = new RestoreDescription(backupFolder);

    await restoreCtx.RestoreAsync(restoreDescription);

    return true;
}

传入 RestoreContext.RestoreAsync 调用的 RestoreDescription 包含名为 BackupFolderPath 的成员。RestoreDescription passed in to the RestoreContext.RestoreAsync call contains a member called BackupFolderPath. 还原单个完整备份时,此 BackupFolderPath 应设置为包含完整备份的文件夹的本地路径。When restoring a single full backup, this BackupFolderPath should be set to the local path of the folder that contains your full backup. 还原一个完整备份和一些增量备份时,BackupFolderPath 应设置为包含完整备份以及所有增量备份的文件夹的本地路径。When restoring a full backup and a number of incremental backups, BackupFolderPath should be set to the local path of the folder that not only contains the full backup, but also all the incremental backups. 如果所提供的 BackupFolderPath 不包含完整的备份,则 RestoreAsync 调用可引发 FabricMissingFullBackupExceptionRestoreAsync call can throw FabricMissingFullBackupException if the BackupFolderPath provided does not contain a full backup. 如果 BackupFolderPath 具有已断开的增量备份链,它还可引发 ArgumentExceptionIt can also throw ArgumentException if BackupFolderPath has a broken chain of incremental backups. 例如,如果它包含完整备份、第一个增量备份和第三个增量备份,但不包含第二个增量备份。For example, if it contains the full backup, the first incremental and the third incremental backup but no the second incremental backup.

备注

默认情况下,RestorePolicy 设置为安全。The RestorePolicy is set to Safe by default. 这表示如果检测到备份文件夹中包含的状态早于或等于此副本中包含的状态,则 RestoreAsync API 会因 ArgumentException 而失败。This means that the RestoreAsync API will fail with ArgumentException if it detects that the backup folder contains a state that is older than or equal to the state contained in this replica. 可以使用 RestorePolicy.Force 来跳过此安全检查。RestorePolicy.Force can be used to skip this safety check. 这被指定为 RestoreDescription 的一部分。This is specified as part of RestoreDescription.

删除或丢失服务Deleted or lost service

如果删除了一个服务,则在还原数据之前必须首先重新创建此服务。If a service is removed, you must first re-create the service before the data can be restored. 请务必创建具有相同配置的服务,例如,相同的分区方案,以便可以无缝地还原数据。It is important to create the service with the same configuration, for example, partitioning scheme, so that the data can be restored seamlessly. 一旦启动服务,就必须在此服务的每个分区上调用用于还原数据的 API(上面的 OnDataLossAsync)。Once the service is up, the API to restore data (OnDataLossAsync above) has to be invoked on every partition of this service. 实现此操作的一种方法是在每个分区上使用 FabricClient.TestManagementClient.StartPartitionDataLossAsyncOne way of achieving this is by using FabricClient.TestManagementClient.StartPartitionDataLossAsync on every partition.

从这个角度来看,实现操作与上述情况相同。From this point, implementation is the same as the above scenario. 每个分区需要从外部存储中还原最新的相关备份。Each partition needs to restore the latest relevant backup from the external store. 值得注意的一点是,分区 ID 现在可能已更改,因为运行时是动态创建分区 ID。One caveat is that the partition ID may have now changed, since the runtime creates partition IDs dynamically. 因此,此服务需要还原相应的分区信息和服务名称来标识每个分区要还原的正确的最新备份。Thus, the service needs to store the appropriate partition information and service name to identify the correct latest backup to restore from for each partition.

备注

不建议在每个分区上使用 FabricClient.ServiceManager.InvokeDataLossAsync 来还原整个服务,因为这可能会损坏群集状态。It is not recommended to use FabricClient.ServiceManager.InvokeDataLossAsync on each partition to restore the entire service, since that may corrupt your cluster state.

复制损坏的应用程序数据Replication of corrupt application data

如果新部署的应用程序升级有一个 bug,则可能会导致数据损坏。If the newly deployed application upgrade has a bug, that may cause corruption of data. 例如,应用程序升级可能使用无效的区号更新可靠字典中的每个电话号码记录。For example, an application upgrade may start to update every phone number record in a Reliable Dictionary with an invalid area code. 在此情况下,将复制无效的电话号码,因为 Service Fabric 并不知道要存储的数据的性质。In this case, the invalid phone numbers will be replicated since Service Fabric is not aware of the nature of the data that is being stored.

在检测到这样一个导致数据损坏的严重 bug 之后首先要做的是在应用程序级别冻结服务,并且如果可以,请升级到没有此 bug 的应用程序代码版本。The first thing to do after you detect such an egregious bug that causes data corruption is to freeze the service at the application level and, if possible, upgrade to the version of the application code that does not have the bug. 但是,即使修复了服务代码,数据仍可能是损坏的,因此可能需要还原数据。However, even after the service code is fixed, the data may still be corrupt, and thus data may need to be restored. 在此情况下,还原最新备份可能不足以解决问题,因为此最新备份可能已损坏。In such cases, it may not be sufficient to restore the latest backup, since the latest backups may also be corrupt. 因此,需要查找在数据被损坏之前创建的最新备份。Thus, you have to find the last backup that was made before the data got corrupted.

如果用户不确定哪些备份已损坏,那么用户可以部署一个新的 Service Fabric 群集,并还原受影响分区的备份,正如上面的“删除或丢失服务”情况一样。If you are not sure which backups are corrupt, you could deploy a new Service Fabric cluster and restore the backups of affected partitions just like the above "Deleted or lost service" scenario. 针对每个分区,从最新备份到最旧备份开始还原。For each partition, start restoring the backups from the most recent to the least. 找到一个未被损坏的备份时,移动或删除此分区的较新(比此备份更新)的所有备份。Once you find a backup that does not have the corruption, move/delete all backups of this partition that were more recent (than that backup). 对每个分区重复此过程。Repeat this process for each partition. 此时,如果在生产群集中的分区上调用 OnDataLossAsync,在外部存储中找到的最新备份是通过上述过程选取的备份。Now, when OnDataLossAsync is called on the partition in the production cluster, the last backup found in the external store will be the one picked by the above process.

现在,可使用“删除或丢失服务”部分中的步骤将服务的状态还原到错误代码损坏此状态之前的状态。Now, the steps in the "Deleted or lost service" section can be used to restore the state of the service to the state before the buggy code corrupted the state.

请注意:Note that:

  • 还原时,要还原的备份状态可能早于数据丢失前分区的状态。When you restore, there is a chance that the backup being restored is older than the state of the partition before the data was lost. 因此,只能将还原作为最后的办法来恢复尽可能多的数据。Because of this, you should restore only as a last resort to recover as much data as possible.
  • 表示备份文件夹路径和备份文件夹内的文件路径的字符串可以大于 255 个字符,具体取决于 FabricDataRoot 路径和应用程序类型名称的长度。The string that represents the backup folder path and the paths of files inside the backup folder can be greater than 255 characters, depending on the FabricDataRoot path and Application Type name's length. 这会导致某些 .NET 方法(如 Directory.Move)引发 PathTooLongException 异常。This can cause some .NET methods, like Directory.Move, to throw the PathTooLongException exception. 一种解决方法是直接调用 kernel32 API,如 CopyFileOne workaround is to directly call kernel32 APIs, like CopyFile.

备份和还原 Reliable ActorsBack up and restore Reliable Actors

Reliable Actors 框架在 Reliable Services 的基础之上构建。Reliable Actors Framework is built on top of Reliable Services. 承载着执行组件的 ActorService 是有状态的可靠服务。The ActorService, which hosts the actor(s) is a stateful reliable service. 因此,Reliable Services 中提供的所有备份和还原功能也可用于 Reliable Actors(状态提供程序特定的行为除外)。Hence, all the backup and restore functionality available in Reliable Services is also available to Reliable Actors (except behaviors that are state provider specific). 由于会基于每个分区执行备份,因此该分区中所有执行组件的状态都会进行备份(还原也类似,会基于每个分区进行)。Since backups will be taken on a per-partition basis, states for all actors in that partition will be backed up (and restoration is similar and will happen on a per-partition basis). 如果要执行备份/还原,服务所有者应创建一个派生自 ActorService 类的自定义执行组件服务类,并像之前部分所述的 Reliable Services 一样进行备份/还原。To perform backup/restore, the service owner should create a custom actor service class that derives from ActorService class and then do backup/restore similar to Reliable Services as described above in previous sections.

class MyCustomActorService : ActorService
{
    public MyCustomActorService(StatefulServiceContext context, ActorTypeInformation actorTypeInfo)
          : base(context, actorTypeInfo)
    {
    }

    //
    // Method overrides and other code.
    //
}

创建自定义执行组件服务类时,需要在注册执行组件时也注册该服务类。When you create a custom actor service class, you need to register that as well when registering the actor.

ActorRuntime.RegisterActorAsync<MyActor>(
    (context, typeInfo) => new MyCustomActorService(context, typeInfo)).GetAwaiter().GetResult();

Reliable Actors 的默认状态提供程序是 KvsActorStateProviderThe default state provider for Reliable Actors is KvsActorStateProvider. 默认情况下,未为 KvsActorStateProvider 启用增量备份。Incremental backup is not enabled by default for KvsActorStateProvider. 可以通过在其构造函数中使用相应设置创建 KvsActorStateProvider,然后将其传递给 ActorService 构造函数来启用增量备份,如以下代码片段中所示:You can enable incremental backup by creating KvsActorStateProvider with the appropriate setting in its constructor and then passing it to ActorService constructor as shown in following code snippet:

class MyCustomActorService : ActorService
{
    public MyCustomActorService(StatefulServiceContext context, ActorTypeInformation actorTypeInfo)
          : base(context, actorTypeInfo, null, null, new KvsActorStateProvider(true)) // Enable incremental backup
    {
    }

    //
    // Method overrides and other code.
    //
}

启用增量备份后,由于以下原因之一,执行增量备份会失败,显示 FabricMissingFullBackupException,并需要在执行增量备份前执行完整备份:After incremental backup has been enabled, taking an incremental backup can fail with FabricMissingFullBackupException for one of following reasons and you will need to take a full backup before taking incremental backup(s):

  • 副本在变为主副本后从未执行过完整备份。The replica has never taken a full backup since it became primary.
  • 自上次执行备份后,一些日志记录被截断。Some of the log records were truncated since last backup was taken.

启用增量备份后,KvsActorStateProvider 未使用循环缓冲区管理其日志记录和定期截取它。When incremental backup is enabled, KvsActorStateProvider does not use circular buffer to manage its log records and periodically truncates it. 如果在 45 分钟的时间内用户未执行备份,则系统自动截断日志记录。If no backup is taken by user for a period of 45 minutes, the system automatically truncates the log records. 可以通过在 KvsActorStateProvider 构造函数中指定 logTruncationIntervalInMinutes 来配置此间隔(与启用增量备份时类似)。This interval can be configured by specifying logTruncationIntervalInMinutes in KvsActorStateProvider constructor (similar to when enabling incremental backup). 如果主副本需要通过发送其所有数据来生成另一个副本,日志记录也可能会被截断。The log records may also get truncated if primary replica needs to build another replica by sending all its data.

从备份链进行还原时,与 Reliable Services 类似,BackupFolderPath 应包含子目录(其中一个子目录包含完整备份,其他子目录包含增量备份)。When doing restore from a backup chain, similar to Reliable Services, the BackupFolderPath should contain subdirectories with one subdirectory containing full backup and others subdirectories containing incremental backup(s). 如果备份链验证失败,还原 API 会引发 FabricException 并显示相应的错误消息。The restore API will throw FabricException with appropriate error message if the backup chain validation fails.

备注

KvsActorStateProvider 目前会忽略 RestorePolicy.Safe 选项。KvsActorStateProvider currently ignores the option RestorePolicy.Safe. 计划在将来的版本中支持此功能。Support for this feature is planned in an upcoming release.

测试备份和还原Testing Back up and Restore

请务必确保关键数据正在进行备份,并可进行还原。It is important to ensure that critical data is being backed up, and can be restored from. 为此,可在 PowerShell 中调用会导致特定分区丢失数据的 Start-ServiceFabricPartitionDataLoss cmdlet,以测试服务的数据备份和还原功能是否按预期运行。This can be done by invoking the Start-ServiceFabricPartitionDataLoss cmdlet in PowerShell that can induce data loss in a particular partition to test whether the data backup and restore functionality for your service is working as expected. 此外,也可以通过编程方式调用数据丢失,并从该事件中还原。It is also possible to programmatically invoke data loss and restore from that event as well.

备注

可以在 GitHub 上找到 Web 引用应用中备份和还原功能的实现示例。You can find a sample implementation of backup and restore functionality in the Web Reference App on GitHub. 有关更多详细信息,请查看 Inventory.Service 服务。Please look at the Inventory.Service service for more details.

表象之下:有关备份和还原的更多详细信息Under the hood: more details on backup and restore

下面提供了有关备份和还原的更多详细信息。Here's some more details on backup and restore.

BackupBackup

可靠性状态管理器具有在不阻止任何读取或写入操作的情况下创建一致的备份的功能。The Reliable State Manager provides the ability to create consistent backups without blocking any read or write operations. 为了实现此功能,它利用了检查点和日志持久性机制。To do so, it utilizes a checkpoint and log persistence mechanism. 可靠性状态管理器在特定时间点采用模糊(轻型)检查点来缓解来自事务日志的压力,并缩短恢复时间。The Reliable State Manager takes fuzzy (lightweight) checkpoints at certain points to relieve pressure from the transactional log and improve recovery times. 调用 BackupAsync 时,可靠状态管理器指示所有可靠对象将其最新的检查点文件复制到本地备份文件夹。When BackupAsync is called, the Reliable State Manager instructs all Reliable objects to copy their latest checkpoint files to a local backup folder. 然后,可靠性状态管理器将复制所有日志记录(从“开始指针”开始到最新的日志记录)到备份文件夹中。Then, the Reliable State Manager copies all log records, starting from the "start pointer" to the latest log record into the backup folder. 由于所有日志记录直至最新日志记录都包含在备份中,并且可靠状态管理器保留了预写日志记录,因此,可靠状态管理器保证所有提交的事务(CommitAsync 已成功返回)都包含在备份中。Since all the log records up to the latest log record are included in the backup and the Reliable State Manager preserves write-ahead logging, the Reliable State Manager guarantees that all transactions that are committed (CommitAsync has returned successfully) are included in the backup.

在调用 BackupAsync 之后提交的任何事务可能在备份中,也可能不在备份中。Any transaction that commits after BackupAsync has been called may or may not be in the backup. 一旦平台填充此本地备份文件夹(也就是说,本地备份是由运行时完成的),就将调用此服务的备份回调。Once the local backup folder has been populated by the platform (that is, local backup is completed by the runtime), the service's backup callback is invoked. 此回调负责将此备份文件夹移至 Azure 存储等外部位置。This callback is responsible for moving the backup folder to an external location such as Azure Storage.

还原Restore

可靠状态管理器具有使用 RestoreAsync API 从备份还原的功能。The Reliable State Manager provides the ability to restore from a backup by using the RestoreAsync API.
仅可在 OnDataLossAsync 方法内调用 RestoreContextRestoreAsync 方法。The RestoreAsync method on RestoreContext can be called only inside the OnDataLossAsync method. OnDataLossAsync 返回的布尔值表示服务是否从外部源还原其状态。The bool returned by OnDataLossAsync indicates whether the service restored its state from an external source. 如果 OnDataLossAsync 返回 true,则 Service Fabric 会使用此主要副本重新生成所有其他副本。If the OnDataLossAsync returns true, Service Fabric will rebuild all other replicas from this primary. Service Fabric 可确保将接收 OnDataLossAsync 调用的副本先转换为主角色,但不会授予其读取状态或写入状态。Service Fabric ensures that replicas that will receive OnDataLossAsync call first transition to the primary role but are not granted read status or write status. 这意味着对于 StatefulService 实现程序,不会调用 RunAsync,直到 OnDataLossAsync 成功完成。This implies that for StatefulService implementers, RunAsync will not be called until OnDataLossAsync finishes successfully. 然后,会对新的主要副本调用 OnDataLossAsyncThen, OnDataLossAsync will be invoked on the new primary. 将一次调用一回 API,如此循环,直到服务成功完成此 API(返回 true 或 false),并完成相关的重新配置。Until a service completes this API successfully (by returning true or false) and finishes the relevant reconfiguration, the API will keep being called one at a time.

RestoreAsync 首先删除调用它的主要副本中的所有现有状态。RestoreAsync first drops all existing state in the primary replica that it was called on. 之后,可靠状态管理器创建备份文件夹中存在的所有可靠对象。Then the Reliable State Manager creates all the Reliable objects that exist in the backup folder. 接下来,指示可靠对象从备份文件夹中的它们的检查点还原。Next, the Reliable objects are instructed to restore from their checkpoints in the backup folder. 最后,可靠性状态管理器从备份文件夹中的日志记录中恢复其自己的状态,并执行恢复。Finally, the Reliable State Manager recovers its own state from the log records in the backup folder and performs recovery. 作为恢复过程的一部分,将对可靠对象重播从“起始点”开始的操作,该起始点已提交备份文件夹中的日志记录。As part of the recovery process, operations starting from the "starting point" that have committed log records in the backup folder are replayed to the Reliable objects. 此步骤可以确保恢复的状态是一致的。This step ensures that the recovered state is consistent.

后续步骤Next steps