数据导入概述 - Azure 认知搜索Data import overview - Azure Cognitive Search

在 Azure 认知搜索中,将会针对已加载和已保存到搜索索引中的内容执行查询。In Azure Cognitive Search, queries execute over your content loaded into and saved in a search index. 本文介绍填充索引的两种基本方法:一种是推送,即以编程方式将数据推送至索引; 另一种是拉取,即将 Azure 认知搜索索引器指向支持的数据源,以便拉取数据。This article examines the two basic approaches for populating an index: push your data into the index programmatically, or point an Azure Cognitive Search indexer at a supported data source to pull in the data.

不管使用哪种方法,目的都是将数据从外部数据源加载到 Azure 认知搜索索引中。With either approach, the objective is to load data from an external data source into an Azure Cognitive Search index. Azure 认知搜索会允许你创建空索引,但在你将数据推送到其中或从其拉取数据之前,该索引是不可查询的。Azure Cognitive Search will let you create an empty index, but until you push or pull data into it, it's not queryable.

备注

如果 AI 扩充是解决方案要求,则必须使用拉取模型(索引器)来加载索引。If AI enrichment is a solution requirement, you must use the pull model (indexers) to load an index. 只能通过附加到索引器的技能组来支持外部处理。External processing is supported only through skillsets attached to an indexer.

将数据推送至索引Pushing data to an index

推送模式用于以编程方式将数据发送到 Azure 认知搜索,是最灵活的方法。The push model, used to programmatically send your data to Azure Cognitive Search, is the most flexible approach. 首先,它对数据源类型没有限制。First, it has no restrictions on data source type. 任何由 JSON 文档组成的数据集都可以推送至 Azure 认知搜索索引,前提是数据集中的每个文档的字段都映射到索引架构中定义的字段。Any dataset composed of JSON documents can be pushed to an Azure Cognitive Search index, assuming each document in the dataset has fields mapping to fields defined in your index schema. 其次,它对执行频率没有限制。Second, it has no restrictions on frequency of execution. 可以根据需要选择相应的频率,将更改推送到索引。You can push changes to an index as often as you like. 对于具有极低延迟要求的应用程序(例如,如果需要搜索操作与动态库存数据库同步),只能选择推送模型。For applications having very low latency requirements (for example, if you need search operations to be in sync with dynamic inventory databases), the push model is your only option.

此方法相比拉模型更加灵活,因为可以单个或批量上传文档(每批最多 1000 个或 16MB,以先达到为准)。This approach is more flexible than the pull model because you can upload documents individually or in batches (up to 1000 per batch or 16 MB, whichever limit comes first). 推送模型还允许将文档上传到 Azure 认知搜索,而不考虑数据的位置。The push model also allows you to upload documents to Azure Cognitive Search regardless of where your data is.

如何将数据推送至 Azure 认知搜索索引How to push data to an Azure Cognitive Search index

可以使用以下 API,将单个或多个文档加载到一个索引中:You can use the following APIs to load single or multiple documents into an index:

目前尚没有支持通过门户推送数据的工具。There is currently no tool support for pushing data via the portal.

有关每种方法的简介,请参阅快速入门:使用 PowerShell 创建 Azure 认知搜索索引C# 快速入门:使用 .NET SDK 创建 Azure 认知搜索索引For an introduction to each methodology, see Quickstart: Create an Azure Cognitive Search index using PowerShell or C# Quickstart: Create an Azure Cognitive Search index using .NET SDK.

索引操作:上传、合并、mergeOrUpload、删除Indexing actions: upload, merge, mergeOrUpload, delete

可以按文档控制索引操作的类型,指定是应该完整地上传文档、与现有文档内容合并还是将其删除。You can control the type of indexing action on a per-document basis, specifying whether the document should be uploaded in full, merged with existing document content, or deleted.

在 REST API 中,向 Azure 认知搜索索引的终结点 URL 发出具有 JSON 请求正文的 HTTP POST 请求。In the REST API, issue HTTP POST requests with JSON request bodies to your Azure Cognitive Search index's endpoint URL. “value”数组中的每个 JSON 对象都包含文档的密钥,并指定索引操作是添加、更新还是删除文档内容。Each JSON object in the "value" array contains the document's key and specifies whether an indexing action adds, updates, or deletes document content. 有关代码示例,请参阅加载文档For a code example, see Load documents.

在 .NET SDK 中,请将数据打包到 IndexBatch 对象中。In the .NET SDK, package up your data into an IndexBatch object. IndexBatch 封装 IndexAction 对象的集合,其中每个对象均包含一个文档和一个属性,用于指示 Azure 认知搜索对该文档执行什么操作。An IndexBatch encapsulates a collection of IndexAction objects, each of which contains a document and a property that tells Azure Cognitive Search what action to perform on that document. 有关代码示例,请参阅 C# 快速入门For a code example, see the C# Quickstart.

@search.action 说明Description 每个文档必需的字段Necessary fields for each document 注释Notes
upload upload 操作类似于“upsert”,如果文档是新文档,则插入;如果文档已经存在,则进行更新/替换。An upload action is similar to an "upsert" where the document will be inserted if it is new and updated/replaced if it exists. 键,以及要定义的任何其他字段key, plus any other fields you wish to define 更新/替换现有文档时,会将请求中未指定的任何字段设置为 nullWhen updating/replacing an existing document, any field that is not specified in the request will have its field set to null. 即使该字段之前设置为了非 null 值也是如此。This occurs even when the field was previously set to a non-null value.
merge 使用指定的字段更新现有文档。Updates an existing document with the specified fields. 如果索引中不存在该文档,merge 会失败。If the document does not exist in the index, the merge will fail. 键,以及要定义的任何其他字段key, plus any other fields you wish to define merge 中指定的任何字段都将替换文档中的现有字段。Any field you specify in a merge will replace the existing field in the document. 在 .NET SDK 中,这包括 DataType.Collection(DataType.String) 类型的字段。In the .NET SDK, this includes fields of type DataType.Collection(DataType.String). 在 REST API 中,这包括 Collection(Edm.String) 类型的字段。In the REST API, this includes fields of type Collection(Edm.String). 例如,如果文档包含值为 ["budget"] 的字段 tags,并且已使用值 ["economy", "pool"]tags 执行合并,则 tags 字段的最终值将为 ["economy", "pool"]For example, if the document contains a field tags with value ["budget"] and you execute a merge with value ["economy", "pool"] for tags, the final value of the tags field will be ["economy", "pool"]. 而不会是 ["budget", "economy", "pool"]It will not be ["budget", "economy", "pool"].
mergeOrUpload 如果索引中已存在具有给定关键字段的文档,则此操作的行为类似于 mergeThis action behaves like merge if a document with the given key already exists in the index. 如果该文档不存在,则它的行为类似于对新文档进行 uploadIf the document does not exist, it behaves like upload with a new document. 键,以及要定义的任何其他字段key, plus any other fields you wish to define -
delete 从索引中删除指定文档。Removes the specified document from the index. 仅关键字段key only 所指定关键字段以外的所有字段都会被忽略。Any fields you specify other than the key field will be ignored. 如果要从文档中删除单个字段,请改用 merge,只需将该字段显式设置为 null。If you want to remove an individual field from a document, use merge instead and simply set the field explicitly to null.

表述查询Formulate your query

有两种方法可以 使用 REST API 搜索索引There are two ways to search your index using the REST API. 一种方法是发出 HTTP POST 请求,这种请求的查询参数在请求主题的 JSON 对象中定义。One way is to issue an HTTP POST request where your query parameters are defined in a JSON object in the request body. 另一种方法是发出 HTTP GET 请求,这种请求的查询参数在请求 URL 中定义。The other way is to issue an HTTP GET request where your query parameters are defined within the request URL. POST 的查询参数大小限制比 GET 宽松POST has more relaxed limits on the size of query parameters than GET. 因此建议使用 POST,使用 GET 更方便的特殊情况除外。For this reason, we recommend using POST unless you have special circumstances where using GET would be more convenient.

对于 POST 和 GET,都需要在请求 URL 中提供“服务名称”、“索引名称”和“API 版本”。For both POST and GET, you need to provide your service name, index name, and an API version in the request URL.

GET 的 URL 末尾为查询字符串,用于提供查询参数。For GET, the query string at the end of the URL is where you provide the query parameters. 有关 URL 格式,请参见以下内容:See below for the URL format:

https://[service name].search.azure.cn/indexes/[index name]/docs?[query string]&api-version=2020-06-30

POST 的 URL 格式相同,但查询字符串参数包含 api-versionThe format for POST is the same, but with api-version in the query string parameters.

将数据拉取到索引中Pulling data into an index

提取模型对支持的数据源进行爬网,将数据自动上传到索引中。The pull model crawls a supported data source and automatically uploads the data into your index. 在 Azure 认知搜索中,此功能是通过索引器实现的,目前适用于以下平台:In Azure Cognitive Search, this capability is implemented through indexers, currently available for these platforms:

索引器将索引连接到数据源(通常是表、视图或等效的结构),将源字段映射到索引中的等效字段。Indexers connect an index to a data source (usually a table, view, or equivalent structure), and map source fields to equivalent fields in the index. 在执行期间,行集会自动转换为 JSON 并载入指定的索引中。During execution, the rowset is automatically transformed to JSON and loaded into the specified index. 所有索引器支持计划,使用户能够指定数据的刷新频率。All indexers support scheduling so that you can specify how frequently the data is to be refreshed. 大多数索引器提供更改跟踪(如果受数据源的支持)。Most indexers provide change tracking if the data source supports it. 除了识别新文档外,通过跟踪对现有文档的更改和删除外,索引器免除了主动管理索引中数据的必要。By tracking changes and deletes to existing documents in addition to recognizing new documents, indexers remove the need to actively manage the data in your index.

如何将数据拉取至 Azure 认知搜索索引How to pull data into an Azure Cognitive Search index

索引器功能已在 Azure 门户REST API.NET SDK 中公开。Indexer functionality is exposed in the Azure portal, the REST API, and the .NET SDK.

使用门户的一个优势在于,Azure 认知搜索通常可以通过读取源数据集的元数据来生成默认的索引架构。An advantage to using the portal is that Azure Cognitive Search can usually generate a default index schema for you by reading the metadata of the source dataset. 在处理生成的索引之前可对其进行修改,此后,只能编辑不需要重建索引的架构。You can modify the generated index until the index is processed, after which the only schema edits allowed are those that do not require reindexing. 如果想要进行的更改会直接影响架构,则需要重建索引。If the changes you want to make impact the schema directly, you would need to rebuild the index.

使用搜索浏览器验证数据导入Verify data import with Search explorer

针对文档上传执行初步检查的捷径之一是在门户中使用搜索浏览器A quick way to perform a preliminary check on the document upload is to use Search explorer in the portal. 使用资源管理器可以直接查询索引,而无需编写任何代码。The explorer lets you query an index without having to write any code. 搜索体验取决于默认设置,例如简单语法和默认的 searchMode 查询参数The search experience is based on default settings, such as the simple syntax and default searchMode query parameter. 结果以 JSON 格式返回,方便用户检查整个文档。Results are returned in JSON so that you can inspect the entire document.

提示

有大量的 Azure 认知搜索代码示例包含了嵌入的或随时可用的数据集,帮助用户轻松入门。Numerous Azure Cognitive Search code samples include embedded or readily available datasets, offering an easy way to get started. 门户中还提供了一个示例索引器,以及一个由小型房地产数据集组成的数据源(名为“realestate-us-sample”)。The portal also provides a sample indexer and data source consisting of a small real estate dataset (named "realestate-us-sample"). 针对示例数据源运行预配置的索引器时,会创建索引并连同文档一起加载该索引,然后,可以使用搜索浏览器或编写的代码查询该索引。When you run the preconfigured indexer on the sample data source, an index is created and loaded with documents that can then be queried in Search explorer or by code that you write.

另请参阅See also