对 Azure Blob 存储内容进行搜索Search over Azure Blob storage content

在 Azure Blob 存储中存储的各种内容类型之间进行搜索可能是一个很难解决的问题。Searching across the variety of content types stored in Azure Blob storage can be a difficult problem to solve. 本文介绍用于从 Blob 中提取内容和元数据并将其发送到 Azure 认知搜索中的搜索索引的基本工作流。In this article, review the basic workflow for extracting content and metadata from blobs and sending it to a search index in Azure Cognitive Search. 可以使用全文搜索查询生成的索引。The resulting index can be queried using full text search.


已熟悉工作流和组合?Already familiar with the workflow and composition? 下一步介绍如何配置 Blob 索引器How to configure a blob indexer is your next step.

将全文搜索添加到 Blob 数据意味着什么What it means to add full text search to blob data

Azure 认知搜索是一项搜索服务,它支持通过用户定义的索引进行的索引编制和查询工作负荷,其中包含在云中托管的远程可搜索内容。Azure Cognitive Search is a search service that supports indexing and query workloads over user-defined indexes that contains your remote searchable content hosted in the cloud. 为了提高性能,需将可搜索内容与查询引擎共置在一起,使返回结果的速度与用户预期的搜索查询速度相当。Co-locating your searchable content with the query engine is necessary for performance, returning results at a speed users have come to expect from search queries.

认知搜索在索引层与 Azure Blob 存储集成,可将 Blob 内容作为已编制成倒排索引的搜索文档导入,并可导入其他支持自由格式文本查询和筛选器表达式的查询结构。Cognitive Search integrates with Azure Blob storage at the indexing layer, importing your blob content as search documents that are indexed into inverted indexes and other query structures that support free form text queries and filter expressions. 由于 Blob 内容已索引到搜索索引中,因此可以使用 Azure 认知搜索中的所有查询功能来查找 Blob 内容中的信息。Because your blob content is indexed into a search index, you can use the full range of query features in Azure Cognitive Search to find information in your blob content.

输入是 Azure Blob 存储中单个容器内的 Blob。Inputs are your blobs, in a single container, in Azure Blob storage. Blob 几乎可以是任何类型的文本数据。Blobs can be almost any kind of text data. 如果 Blob 包含图像,可将 AI 扩充添加到 Blob 索引,以便从图像创建和提取文本。If your blobs contain images, you can add AI enrichment to blob indexing to create and extract text from images.

输出始终是 Azure 认知搜索索引,用于在客户端应用程序中快速执行搜索、检索和浏览。Output is always an Azure Cognitive Search index, used for fast text search, retrieval, and exploration in client applications. 输入与输出之间是索引管道体系结构本身。In between is the indexing pipeline architecture itself. 该管道基于本文将会详细介绍的索引器功能。The pipeline is based on the indexer feature, discussed further on in this article.

创建并填充索引后,该索引将独立于 blob 容器而存在,但你可以重新运行索引操作以基于更改的文档刷新索引。Once the index is created and populated, it exists independently of your blob container, but you can re-rerun indexing operations to refresh your index based on changed documents. 各个 Blob 中的时间戳信息用于执行更改检测。Timestamp information on individual blobs is used for change detection. 可以选择按计划执行或按需索引作为刷新机制。You can opt for either scheduled execution or on-demand indexing as the refresh mechanism.

所需资源Required resources

同时需要 Azure 认知搜索和 Azure Blob 存储。You need both Azure Cognitive Search and Azure Blob storage. 在 Blob 存储中,需要一个提供源内容的容器。Within Blob storage, you need a container that provides source content.

可以直接在存储帐户门户页中开始。You can start directly in your Storage account portal page. 在左侧导航页中的“Blob 服务”下,单击“添加 Azure 认知搜索”创建新服务或选择现有服务。 In the left navigation page, under Blob service click Add Azure Cognitive Search to create a new service or select an existing one.

将 Azure 认知搜索添加到存储帐户后,可以遵循标准过程为 Blob 数据编制索引。Once you add Azure Cognitive Search to your storage account, you can follow the standard process to index blob data. 我们建议使用 Azure 认知搜索中的“导入数据”向导以轻松完成初始引入,或使用 Postman 等工具调用 REST API。We recommend the Import data wizard in Azure Cognitive Search for an easy initial introduction, or call the REST APIs using a tool like Postman. 本教程将引导你完成在 Postman 中调用 REST API 的步骤:在 Azure 认知搜索中为半结构化数据 (JSON Blob) 编制索引以及搜索此类数据This tutorial walks you through the steps of calling the REST API in Postman: Index and search semi-structured data (JSON blobs) in Azure Cognitive Search.

使用 Blob 索引器Use a Blob indexer

索引器是认知搜索中的数据源感知型子服务,其中配备的内部逻辑可用于对数据采样、读取元数据、检索数据,以及将数据从本机格式序列化为 JSON 文档供以后导入。An indexer is a data-source-aware subservice in Cognitive Search, equipped with internal logic for sampling data, reading metadata data, retrieving data, and serializing data from native formats into JSON documents for subsequent import.

Azure 存储中的 Blob 使用 Azure 认知搜索 Blob 存储索引器编制索引。Blobs in Azure Storage are indexed using the Azure Cognitive Search Blob storage indexer. 可以使用“导入数据”向导、REST API 或 .NET SDK 调用此索引器。 You can invoke this indexer by using the Import data wizard, a REST API, or the .NET SDK. 在代码中,使用此索引器的方式是设置类型,并提供包括 Azure 存储帐户和 Blob 容器的连接信息。In code, you use this indexer by setting the type, and by providing connection information that includes an Azure Storage account along with a blob container. 可以通过创建虚拟目录(随后可将其作为参数传递),或者筛选文件类型扩展名,来指定 Blob 的子集。You can subset your blobs by creating a virtual directory, which you can then pass as a parameter, or by filtering on a file type extension.

索引器执行“文档破解”,会打开一个 Blob 来检查内容。An indexer does the "document cracking", opening a blob to inspect content. 这是连接到数据源后,在管道中发生的第一个步骤。After connecting to the data source, it's the first step in the pipeline. 对于 Blob 数据,此步骤会检测 PDF、Office 文档和其他内容类型。For blob data, this is where PDF, office docs, and other content types are detected. 文档破解和文本提取是免费的。Document cracking with text extraction is no charge. 如果 Blob 包含图像内容,则除非添加 AI 扩充,否则会忽略图像。If your blobs contain image content, images are ignored unless you add AI enrichment. 标准索引仅适用于文本内容。Standard indexing applies to text content only.

Blob 索引器附带配置参数,如果基础数据提供足够的信息,则索引器支持更改跟踪。The Blob indexer comes with configuration parameters and supports change tracking if the underlying data provides sufficient information. 可以在 Azure 认知搜索 Blob 存储索引器中详细了解核心功能。You can learn more about the core functionality in Azure Cognitive Search Blob storage indexer.

支持的内容类型Supported content types

通过对容器运行 Blob 索引器,只需运行单个查询就能从以下内容类型中提取文本和元数据:By running a Blob indexer over a container, you can extract text and metadata from the following content types with a single query:

为 Blob 元数据编制索引Indexing blob metadata

用于实现在任何内容类型的 Blob 中轻松进行排序的一个常用方案是为自定义元数据和每个 Blob 的系统属性编制索引。A common scenario that makes it easy to sort through blobs of any content type is to index both custom metadata and system properties for each blob. 通过这种方式,系统会对所有 Blob 的信息编制索引(无论文档类型是什么),并将信息存储在搜索服务中的索引内。In this way, information for all blobs is indexed regardless of document type, stored in an index in your search service. 然后,可以使用新索引对所有 Blob 存储内容继续执行排序、筛选和分面操作。Using your new index, you can then proceed to sort, filter, and facet across all Blob storage content.

为 JSON Blob 编制索引Indexing JSON blobs

可将索引器配置为提取包含 JSON 的 Blob 中的结构化内容。Indexers can be configured to extract structured content found in blobs that contain JSON. 索引器可以读取 JSON blob 并将结构化内容分析为搜索文档的相应字段。An indexer can read JSON blobs and parse the structured content into the appropriate fields of a search document. 索引器还可以获取包含 JSON 对象数组的 Blob 并将每个元素映射到单独的搜索文档。Indexers can also take blobs that contain an array of JSON objects and map each element to a separate search document. 可以设置分析模式,以影响索引器创建的 JSON 对象类型。You can set a parsing mode to affect the type of JSON object created by the indexer.

在搜索索引中搜索 Blob 内容Search blob content in a search index

索引器的输出是一个搜索索引,用于在客户端应用中通过自由文本和筛选的查询进行交互式浏览。The output of an indexer is a search index, used for interactive exploration using free text and filtered queries in a client app. 若要对内容进行初始浏览和验证,我们建议从门户中的搜索资源管理器开始,以检查文档结构。For initial exploration and verification of content, we recommend starting with Search Explorer in the portal to examine document structure. 可以在搜索资源管理器中使用简单查询语法完整查询语法筛选表达式语法You can use simple query syntax, full query syntax, and filter expression syntax in Search explorer.

更持久的解决方案是收集查询输入,并在客户端应用程序中以搜索结果的形式提供响应。A more permanent solution is to gather query inputs and present the response as search results in a client application. 以下 C# 教程介绍了如何生成搜索应用程序:在 Azure 认知搜索中创建第一个应用程序The following C# tutorial explains how to build a search application: Create your first application in Azure Cognitive Search.

后续步骤Next steps