如何在 Azure 认知搜索索引中为 Blob 设置更改和删除检测How to set up change and deletion detection for blobs in Azure Cognitive Search indexing

创建初始搜索索引后,你可能需要配置后续索引器作业,以只选取自初始运行以来已创建或删除的文档。After an initial search index is created, you might want to configure subsequent indexer jobs to pick up just those documents that have been created or deleted since the initial run. 对于源自 Azure Blob 存储的搜索内容,当你使用计划触发索引时,将自动执行更改检测。For search content that originates from Azure Blob storage, change detection occurs automatically when you use a schedule to trigger indexing. 默认情况下,服务仅为已更改的 Blob 重新编制索引,正如由 Blob 的 LastModified 时间戳所确定。By default, the service reindexes only the changed blobs, as determined by the blob's LastModified timestamp. 与搜索索引器支持的其他数据源不同,Blob 始终具有时间戳,无需手动设置更改检测策略。In contrast with other data sources supported by search indexers, blobs always have a timestamp, which eliminates the need to set up a change detection policy manually.

虽然更改检测是指定的,但删除检测不是。Although change detection is a given, deletion detection is not. 如果要检测已删除的文档,请确保使用“软删除”方法。If you want to detect deleted documents, make sure to use a "soft delete" approach. 如果彻底删除 Blob,相应的文档不会从搜索索引中删除。If you delete the blobs outright, corresponding documents will not be removed from the search index.

可通过两种方法实现软删除方法。There are two ways to implement the soft delete approach. 下面介绍了这两种方法。Both are described below.

本机 Blob 软删除(预览版)Native blob soft delete (preview)

重要

对本机 Blob 软删除的支持目前为预览版。Support for native blob soft delete is in preview. 提供的预览版功能不附带服务级别协议,我们不建议将其用于生产工作负荷。Preview functionality is provided without a service level agreement, and is not recommended for production workloads. 有关详细信息,请参阅 Microsoft Azure 预览版补充使用条款For more information, see Supplemental Terms of Use for Microsoft Azure Previews. REST API 版本 2020-06-30-Preview 提供此功能。The REST API version 2020-06-30-Preview provides this feature. 目前不支持门户或 .NET SDK。There is currently no portal or .NET SDK support.

备注

使用本机 Blob 软删除策略时,索引中文档的文档键必须是 Blob 属性或 Blob 元数据。When using the native blob soft delete policy the document keys for the documents in your index must either be a blob property or blob metadata.

在此方法中,你将使用 Azure Blob 存储提供的本机 Blob 软删除功能。In this method you will use the native blob soft delete feature offered by Azure Blob storage. 如果在存储帐户中启用了本机 Blob 软删除,你的数据源已设置了本地软删除策略,并且索引器找到了一个已转变为软删除状态的 Blob,则索引器会从索引中删除该文档。If native blob soft delete is enabled on your storage account, your data source has a native soft delete policy set, and the indexer finds a blob that has been transitioned to a soft deleted state, the indexer will remove that document from the index. 为 Azure Data Lake Storage Gen2 中的 Blob 编制索引时,不支持本机 Blob 软删除策略。The native blob soft delete policy is not supported when indexing blobs from Azure Data Lake Storage Gen2.

使用以下步骤:Use the following steps:

  1. 为 Azure Blob 存储启用本地软删除Enable native soft delete for Azure Blob storage. 我们建议将保留策略设置为比索引器间隔计划大得多的值。We recommend setting the retention policy to a value that's much higher than your indexer interval schedule. 这样,如果在运行索引器时出现问题,或者如果有大量的文档需要编制索引,可以为索引器留出大量的时间来最终处理已软删除的 Blob。This way if there's an issue running the indexer or if you have a large number of documents to index, there's plenty of time for the indexer to eventually process the soft deleted blobs. 仅当 Azure 认知搜索索引器在处理处于“已软删除”状态的 Blob 时,才会从索引中删除文档。Azure Cognitive Search indexers will only delete a document from the index if it processes the blob while it's in a soft deleted state.

  2. 在数据源中配置本机 Blob 软删除检测策略。Configure a native blob soft deletion detection policy on the data source. 下面显示了一个示例。An example is shown below. 由于此功能目前为预览版,因此必须使用预览版 REST API。Since this feature is in preview, you must use the preview REST API.

  3. 运行索引器,或者将索引器设置为按计划运行。Run the indexer or set the indexer to run on a schedule. 当索引器运行并处理 Blob 时,将从索引中删除文档。When the indexer runs and processes the blob the document will be removed from the index.

    PUT https://[service name].search.azure.cn/datasources/blob-datasource?api-version=2020-06-30-Preview
    Content-Type: application/json
    api-key: [admin key]
    {
        "name" : "blob-datasource",
        "type" : "azureblob",
        "credentials" : { "connectionString" : "<your storage connection string>" },
        "container" : { "name" : "my-container", "query" : null },
        "dataDeletionDetectionPolicy" : {
            "@odata.type" :"#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
        }
    }
    

对取消删除的 Blob 重新编制索引(使用本机软删除策略)Reindexing un-deleted blobs (using native soft delete policies)

在存储帐户中启用本机软删除后,如果从 Azure Blob 存储中删除某个 Blob,该 Blob 将转变为“已软删除”状态,允许你在保留期内取消删除该 Blob。If you delete a blob from Azure Blob storage with native soft delete enabled on your storage account, the blob will transition to a soft deleted state, giving you the option to un-delete that blob within the retention period. 如果在索引器处理删除后反转删除,索引器不会始终为已还原的 Blob 编制索引。If you reverse a deletion after the indexer processed it, the indexer will not always index the restored blob. 这是因为,索引器根据 Blob 的 LastModified 时间戳确定要为哪些 Blob 编制索引。This is because the indexer determines which blobs to index based on the blob's LastModified timestamp. 取消删除某个已软删除的 Blob 时,该 Blob 的 LastModified 时间戳不会更新,因此,如果索引器已处理的 Blob 的 LastModified 时间戳更接近当前时间,则索引器不会为取消删除的 Blob 重新编制索引。When a soft deleted blob is un-deleted, its LastModified timestamp does not get updated, so if the indexer has already processed blobs with more recent LastModified timestamps, it won't reindex the un-deleted blob.

若要确保为取消删除的 Blob 重新编制索引,需要更新该 Blob 的 LastModified 时间戳。To make sure that an un-deleted blob is reindexed, you will need to update the blob's LastModified timestamp. 为此,可以重新保存该 Blob 的元数据。One way to do this is by resaving the metadata of that blob. 你无需更改元数据,但重新保存元数据会更新 Blob 的 LastModified 时间戳,使索引器知道它需要为此 Blob 重新编制索引。You don't need to change the metadata, but resaving the metadata will update the blob's LastModified timestamp so that the indexer knows that it needs to reindex this blob.

使用自定义元数据的软删除Soft delete using custom metadata

在此方法中,你将使用 Blob 的元数据来指示何时应从搜索索引中删除文档。In this method you will use a blob's metadata to indicate when a document should be removed from the search index. 此方法需要两个单独的操作:从索引中删除搜索文档,然后在 Azure 存储中删除 Blob。This method requires two separate actions, deleting the search document from the index, followed by blob deletion in Azure Storage.

使用以下步骤:Use the following steps:

  1. 将一个自定义元数据键值对属性添加到 Blob,以告知 Azure 认知搜索该 Blob 已采用逻辑方式删除。Add a custom metadata key-value pair to the blob to indicate to Azure Cognitive Search that it is logically deleted.

  2. 在数据源中配置软删除列检测策略。Configure a soft deletion column detection policy on the data source. 下面显示了一个示例。An example is shown below.

  3. 在索引器处理 Blob 并从索引中删除文档后,你可以删除 Azure Blob 存储中的 Blob。Once the indexer has processed the blob and deleted the document from the index, you can delete the blob in Azure Blob storage.

例如,如果某个 Blob 具有值为 true 的元数据属性 IsDeleted,以下策略会将该 Blob 视为已删除:For example, the following policy considers a blob to be deleted if it has a metadata property IsDeleted with the value true:

    PUT https://[service name].search.azure.cn/datasources/blob-datasource?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
        "name" : "blob-datasource",
        "type" : "azureblob",
        "credentials" : { "connectionString" : "<your storage connection string>" },
        "container" : { "name" : "my-container", "query" : null },
        "dataDeletionDetectionPolicy" : {
            "@odata.type" :"#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy",
            "softDeleteColumnName" : "IsDeleted",
            "softDeleteMarkerValue" : "true"
        }
    }

对取消删除的 Blob 重新编制索引(使用自定义元数据)Reindexing un-deleted blobs (using custom metadata)

索引器处理已删除的 Blob 并从索引中删除相应的搜索文档后,如果你稍后还原该 Blob 并且 Blob 的 LastModified 时间戳早于上次索引器运行,则它不会重新访问该 Blob。After an indexer processes a deleted blob and removes the corresponding search document from the index, it won't revisit that blob if you restore it later if the blob's LastModified timestamp is older than the last indexer run.

若要为该文档重新编制索引,请更改该 Blob 的 "softDeleteMarkerValue" : "false",然后重新运行索引器。If you would like to reindex that document, change the "softDeleteMarkerValue" : "false" for that blob and rerun the indexer.

后续步骤Next steps