如何在 Azure 认知搜索中使用 Blob 索引器和技能组为加密的 Blob 编制索引How to index encrypted blobs using blob indexers and skillsets in Azure Cognitive Search

本文介绍了如何使用 Azure 认知搜索为之前已使用 Azure Key VaultAzure Blob 存储中加密的文档编制索引。This article shows you how to use Azure Cognitive Search to index documents that have been previously encrypted within Azure Blob Storage using Azure Key Vault. 通常,索引器无法从加密的文件中提取内容,因为它无权访问加密密钥。Normally, an indexer cannot extract content from encrypted files because it doesn't have access to the encryption key. 但是,通过利用后跟 DocumentExtractionSkillDecryptBlobFile 自定义技能,你可以提供对该密钥的受控访问,以对文件进行解密,并从这些文件中提取内容。However, by leveraging the DecryptBlobFile custom skill followed by the DocumentExtractionSkill, you can provide controlled access to the key to decrypt the files and then have content extracted from them. 这样就可以在不损害已存储文档的加密状态的情况下解锁为这些文档编制索引的功能。This unlocks the ability to index these documents without compromising the encryption status of your stored documents.

本指南从以前加密的整个文档(非结构化文本,例如 Azure Blob 存储中的 PDF、HTML、DOCX 和 PPTX)着手,使用 Postman 和搜索 REST API 执行以下任务:Starting with previously encrypted whole documents (unstructured text) such as PDF, HTML, DOCX, and PPTX in Azure Blob storage, this guide uses Postman and the Search REST APIs to perform the following tasks:

  • 定义一个管道,用于解密文档并从中提取文本。Define a pipeline that decrypts the documents and extracts text from them.
  • 定义用于存储输出的索引。Define an index to store the output.
  • 执行该管道以创建并加载索引。Execute the pipeline to create and load the index.
  • 使用全文搜索和丰富的查询语法浏览结果。Explore results using full text search and a rich query syntax.

如果你没有 Azure 订阅,请在开始之前建立一个免费帐户If you don't have an Azure subscription, open a free account before you begin.

先决条件Prerequisites

此示例假设你已将文件上传到 Azure Blob 存储,并在此过程中对它们进行了加密。This example assumes that you have already uploaded your files to Azure Blob Storage and have encrypted them in the process. 如果你在初次上传和加密文件时需要帮助,请参阅此教程来了解如何那样做。If you need help with getting your files initially uploaded and encrypted, check out this tutorial for how to do so.

1 - 创建服务并收集凭据1 - Create services and collect credentials

设置自定义技能Set up the custom skill

此示例使用 Azure 搜索强大技能 GitHub 存储库中的示例 DecryptBlobFile 项目。This example uses the sample DecryptBlobFile project from the Azure Search Power Skills GitHub repository. 在本部分,你将技能部署到 Azure 函数,以便在技能组中使用。In this section, you will deploy the skill to an Azure Function so that it can be used in a skillset. 一个内置的部署脚本会创建一个名称以 psdbf-function-app- 开头的 Azure 函数资源,并加载该技能。A built-in deployment script creates an Azure Function resource named starting with psdbf-function-app- and loads the skill. 系统会提示你提供订阅和资源组。You'll be prompted to provide a subscription and resource group. 请确保选择你的 Azure Key Vault 实例所在的订阅。Be sure to choose the same subscription that your Azure Key Vault instance lives in.

在操作方面,DecryptBlobFile 技能采用每个 blob 的 URL 和 SAS 令牌作为输入,并使用 Azure 认知搜索需要的文件引用约定输出下载的解密文件。Operationally, the DecryptBlobFile skill takes the URL and SAS token for each blob as inputs, and it outputs the downloaded, decrypted file using the file reference contract that Azure Cognitive Search expects. 请记住,DecryptBlobFile 需要使用加密密钥来执行解密操作。Recall that DecryptBlobFile needs the encryption key to perform the decryption. 在设置过程中,你还将创建一个访问策略,用于授予对 Azure Key Vault 中的加密密钥的 DecryptBlobFile 函数访问权限。As part of set up, you'll also create an access policy that grants DecryptBlobFile function access to the encryption key in Azure Key Vault.

  1. 单击 DecryptBlobFile 登陆页上的“部署到 Azure”按钮,这将在 Azure 门户中打开所提供的资源管理器模板。Click the Deploy to Azure button found on the DecryptBlobFile landing page, which will open the provided Resource Manager template within the Azure portal.

  2. 选择 你的 Azure Key Vault 实例所在的订阅(如果选择其他订阅,本指南将无法使用),并选择一个现有资源组或创建一个新资源组(如果创建新的资源组,则还需要选择要部署到的区域)。Select the subscription where your Azure Key Vault instance exists (this guide will not work if you select a different subscription), and either select an existing resource group or create a new one (if you create a new one, you will also need to select a region to deploy to).

  3. 选择“查看 + 创建”,确保同意条款,然后选择“创建”来部署 Azure 函数。Select Review + create, make sure you agree to the terms, and then select Create to deploy the Azure Function.

    门户中的 ARM 模板ARM template in portal

  4. 等待部署完成。Wait for the deployment to finish.

  5. 在门户中导航到你的 Azure Key Vault 实例。Navigate to your Azure Key Vault instance in the portal. 在 Azure Key Vault 中创建访问策略,以授予对自定义技能的密钥访问权限。Create an access policy in the Azure Key Vault that grants key access to the custom skill.

    1. 在“设置”下,选择“访问策略”,然后选择“添加访问策略” Under Settings, select Access policies, and then select Add access policy

      Keyvault 的“添加访问策略”Keyvault add access policy

    2. 在“从模板配置”下,选择“Azure Data Lake Storage”或“Azure 存储”。Under Configure from template, select Azure Data Lake Storage or Azure Storage.

    3. 对于主体,请选择你部署的 Azure 函数实例。For the principal, select the Azure Function instance that you deployed. 你可以使用在步骤 2 中用来创建它的资源前缀(默认前缀值为 psdbf-function-app)来搜索它。You can search for it using the resource prefix that was used to create it in step 2, which has a default prefix value of psdbf-function-app.

    4. 对于经授权的应用程序选项,不要选择任何内容。Do not select anything for the authorized application option.

      Keyvault 的“添加访问策略”模板Keyvault add access policy template

    5. 请务必单击访问策略页上的“保存”,然后再离开该页以添加访问策略。Be sure to click Save on the access policies page before navigating away to actually add the access policy.

      Keyvault 的保存访问策略操作Keyvault save access policy

  6. 在门户中导航到“psdbf-function-app”函数,记下以下属性,因为本指南中稍后将需要它们:Navigate to the psdbf-function-app function in the portal, and make a note of the following properties as you will need them later in the guide:

    1. 函数 URL,可在函数主页面上的“概要”下找到。The function URL, which can be found under Essentials on the main page for the function.

      函数 URLFunction URL

    2. 主机密钥代码,可以通过以下方式找到:导航到“应用密钥”,单击以显示 默认 密钥,然后复制值。The host key code, which can be found by navigating to App keys, clicking to show the default key, and copying the value.

      函数主机密钥代码Function Host Key Code

认知服务Cognitive Services

AI 扩充和技能组执行由认知服务(包括用于自然语言和图像处理的文本分析与计算机视觉)提供支持。AI enrichment and skillset execution are backed by Cognitive Services, including Text Analytics and Computer Vision for natural language and image processing. 如果你的目标是完成实际原型或项目,则此时应预配认知服务(在 Azure 认知搜索所在的同一区域中),以便可将认知服务附加到索引操作。If your objective was to complete an actual prototype or project, you would at this point provision Cognitive Services (in the same region as Azure Cognitive Search) so that you can attach it to indexing operations.

但是,对于本练习,可以跳过资源预配,因为 Azure 认知搜索在幕后可以连接到认知服务,并为每个索引器运行提供 20 个免费事务。For this exercise, however, you can skip resource provisioning because Azure Cognitive Search can connect to Cognitive Services behind the scenes and give you 20 free transactions per indexer run. 处理 20 个文档后,索引器会失败,除非将一个认知服务密钥附加到技能组。After it processes 20 documents, the indexer will fail unless a Cognitive Services key is attached to the skillset. 对于大型项目,请计划在即用即付 S0 层预配认知服务。For larger projects, plan on provisioning Cognitive Services at the pay-as-you-go S0 tier. 有关详细信息,请参阅附加认知服务For more information, see Attach Cognitive Services. 请注意,若要在文档数超过 20 个的情况下运行技能组,即使没有任何选定的认知技能连接到认知服务,也需要使用认知服务密钥(例如,在使用提供的技能组的情况下,即使未向其中添加任何技能,也需要这样做)。Note that a Cognitive Services key is required to run a skillset with more than 20 documents even if none of your selected cognitive skills connect to Cognitive Services (such as with the provided skillset if no skills are added to it).

最后一个组件是可以在门户中创建的 Azure 认知搜索。The last component is Azure Cognitive Search, which you can create in the portal. 可使用免费层完成本指南。You can use the Free tier to complete this guide.

与 Azure 函数一样,请花点时间收集管理密钥。As with the Azure Function, take a moment to collect the admin key. 此外,在开始构建请求时,需要提供终结点和管理 API 密钥用于对每个请求进行身份验证。Further on, when you begin structuring requests, you will need to provide the endpoint and admin api-key used to authenticate each request.

  1. 登录到 Azure 门户,在搜索服务的“概述”页中获取搜索服务的名称。 Sign in to the Azure portal, and in your search service Overview page, get the name of your search service. 可以通过查看终结点 URL 来确认服务名称。You can confirm your service name by reviewing the endpoint URL. 如果终结点 URL 为 https://mydemo.search.azure.cn,则服务名称为 mydemoIf your endpoint URL were https://mydemo.search.azure.cn, your service name would be mydemo.

  2. 在“设置” > “密钥”中,获取有关该服务的完全权限的管理员密钥 。In Settings > Keys, get an admin key for full rights on the service. 有两个可交换的管理员密钥,为保证业务连续性而提供,以防需要滚动一个密钥。There are two interchangeable admin keys, provided for business continuity in case you need to roll one over. 可以在请求中使用主要或辅助密钥来添加、修改和删除对象。You can use either the primary or secondary key on requests for adding, modifying, and deleting objects.

    获取服务名称以及管理密钥和查询密钥

所有请求要求在发送到服务的每个请求的标头中指定 API 密钥。All requests require an api-key in the header of every request sent to your service. 具有有效的密钥可以在发送请求的应用程序与处理请求的服务之间建立信任关系,这种信任关系以每个请求为基础。A valid key establishes trust, on a per request basis, between the application sending the request and the service that handles it.

2 - 设置 Postman2 - Set up Postman

安装并设置 Postman。Install and set up Postman.

下载并安装 PostmanDownload and install Postman

  1. 下载 Postman 集合源代码Download the Postman collection source code.

  2. 选择“文件” > “导入”将源代码导入 Postman。Select File > Import to import the source code into Postman.

  3. 选择“集合”选项卡,然后选择“...”(省略号)按钮。Select the Collections tab, and then select the ... (ellipsis) button.

  4. 选择“编辑”。Select Edit.

    显示导航栏的 Postman 应用Postman app showing navigation

  5. 在“编辑”对话框中,选择“变量”选项卡。In the Edit dialog box, select the Variables tab.

在“变量”选项卡上,可以添加 Postman 每次在遇到双大括号中的值时要替换成的值。On the Variables tab, you can add values that Postman swaps in every time it encounters a specific variable inside double braces. 例如,Postman 会将符号 {{admin-key}} 替换成为 admin-key 设置的当前值。For example, Postman replaces the symbol {{admin-key}} with the current value that you set for admin-key. Postman 将在 URL、标头和请求正文等内容中进行这种替换。Postman makes the substitution in URLs, headers, the request body, and so on.

若要获取 admin-key 的值,请使用你前面记下的 Azure 认知搜索管理 API 密钥。To get the value for admin-key, use the Azure Cognitive Search admin api-key you noted earlier. search-service-name 设置为你使用的 Azure 认知搜索服务的名称。Set search-service-name to the name of the Azure Cognitive Search service you are using. 使用你的存储帐户的“访问密钥”选项卡上的值设置 storage-connection-string,并将 storage-container-name 设置为该存储帐户上存储加密文件的 blob 容器的名称。Set storage-connection-string by using the value on your storage account's Access Keys tab, and set storage-container-name to the name of the blob container on that storage account where the encrypted files are stored. function-uri 设置为你之前记下的 Azure 函数 URL,将 function-code 设置为你之前记下的 Azure 函数主机密钥代码。Set function-uri to the Azure Function URL you noted before, and set function-code to the Azure Function host key code you noted before. 其他值可保留默认设置。You can leave the defaults for the other values.

Postman 应用变量选项卡Postman app variables tab

变量Variable 从何处获取Where to get it
admin-key 在 Azure 认知搜索服务的“密钥”页上。On the Keys page of the Azure Cognitive Search service.
search-service-name Azure 认知搜索服务的名称。The name of the Azure Cognitive Search service. 该 URL 为 https://{{search-service-name}}.search.azure.cnThe URL is https://{{search-service-name}}.search.azure.cn.
storage-connection-string 在存储帐户中的“访问密钥”选项卡上,选择“密钥 1” > “连接字符串”。In the storage account, on the Access Keys tab, select key1 > Connection string.
storage-container-name 包含要编制索引的加密文件的 blob 容器的名称。The name of the blob container that has the encrypted files to be indexed.
function-uri 在主页面上“概要”下的 Azure 函数中。In the Azure Function under Essentials on the main page.
function-code 在 Azure 函数中,获取方法是:导航到“应用密钥”,单击以显示 默认 密钥,然后复制值。In the Azure Function, by navigating to App keys, clicking to show the default key, and copying the value.
api-version 保留为“2020-06-30”。Leave as 2020-06-30.
datasource-name 保留为“encrypted-blobs-ds”。Leave as encrypted-blobs-ds.
index-name 保留为“encrypted-blobs-idx”。Leave as encrypted-blobs-idx.
skillset-name 保留为“encrypted-blobs-ss”。Leave as encrypted-blobs-ss.
indexer-name 保留为“encrypted-blobs-ixr”。Leave as encrypted-blobs-ixr.

查看 Postman 中的请求集合Review the request collection in Postman

运行本指南时,必须发出四个 HTTP 请求:When you run this guide, you must issue four HTTP requests:

  • 创建索引的 PUT 请求:此索引保存 Azure 认知搜索使用和返回的数据。PUT request to create the index: This index holds the data that Azure Cognitive Search uses and returns.
  • 创建数据源的 POST 请求:此数据源会将你的 Azure 认知搜索服务连接到你的存储帐户,进而连接到加密的 blob 文件。POST request to create the datasource: This datasource connects your Azure Cognitive Search service to your storage account and therefore encrypted blob files.
  • 创建技能集的 PUT 请求:技能组为将解密 blob 文件数据的 Azure 函数指定自定义技能定义,并指定 DocumentExtractionSkill 以在解密每个文档后从其中提取文本。PUT request to create the skillset: The skillset specifies the custom skill definition for the Azure Function that will decrypt the blob file data, and a DocumentExtractionSkill to extract the text from each document after it has been decrypted.
  • 创建索引器的 PUT 请求:运行索引器可读取数据、应用技能集并存储结果。PUT request to create the indexer: Running the indexer reads the data, applies the skillset, and stores the results. 必须在最后运行此请求。You must run this request last.

源代码包含一个 Postman 集合,其中有这四个请求以及一些有用的后续请求。The source code contains a Postman collection that has the four requests, as well as some useful follow-up requests. 若要发出这些请求,请在 Postman 中选择与请求对应的选项卡,并针对每个请求选择“发送”。To issue the requests, in Postman, select the tab for the requests and select Send for each of them.

3 - 监视索引编制3 - Monitor indexing

提交“创建索引器”请求后,索引编制和扩充立即开始。Indexing and enrichment commence as soon as you submit the Create Indexer request. 编制索引可能需要一段时间,具体取决于存储帐户中的文档数量。Depending on how many documents are in your storage account, indexing can take a while. 若要确定索引器是否仍在运行,请使用作为 Postman 集合的一部分提供的“获取索引器状态”请求,并查看响应以了解索引器是否正在运行,或者查看错误和警告信息。To find out whether the indexer is still running, use the Get Indexer Status request provided as part of the Postman collection and review the response to learn whether the indexer is running, or to view error and warning information.

如果使用的是免费层,应会收到以下消息:"Could not extract content or metadata from your document. Truncated extracted text to '32768' characters"If you are using the Free tier, the following message is expected: "Could not extract content or metadata from your document. Truncated extracted text to '32768' characters". 出现此消息的原因是,免费层上的 Blob 索引编制存在 32K 字符提取限制This message appears because blob indexing on the Free tier has a 32K limit on character extraction. 在更高的层上,此数据集不会出现此消息。You won't see this message for this data set on higher tiers.

索引器执行完成后,可以运行一些查询来验证是否已将数据成功解密并编制索引。After indexer execution is finished, you can run some queries to verify that the data has been successfully decrypted and indexed. 在门户中导航到你的 Azure 认知搜索服务,并使用搜索资源管理器对已编制索引的数据运行查询。Navigate to your Azure Cognitive Search service in the portal, and use the search explorer to run queries over the indexed data.

后续步骤Next steps

成功地为加密的文件编制索引后,即可通过添加更多认知技能来循环访问此管道Now that you have successfully indexed encrypted files, you can iterate on this pipeline by adding more cognitive skills. 这将允许你扩充数据并获得对数据的更多见解。This will allow you to enrich and gain additional insights to your data.

如果你处理双重加密的数据,则可能需要研究 Azure 认知搜索中提供的索引加密功能。If you are working with doubly encrypted data, you might want to investigate the index encryption features available in Azure Cognitive Search. 尽管索引器需要使用解密的数据进行索引编制,但是索引一旦存在,就可以使用客户管理的密钥对其进行加密。Although the indexer needs decrypted data for indexing purposes, once the index exists, it can be encrypted using a customer-managed key. 这将确保数据在静止时始终加密。This will ensure that your data is always encrypted when at rest. 有关详细信息,请参阅在 Azure 认知搜索中为数据加密配置客户管理的密钥For more information, see Configure customer-managed keys for data encryption in Azure Cognitive Search.