如何在 Azure 认知搜索中使用搜索结果How to work with search results in Azure Cognitive Search

本文介绍了如何在 Azure 认知搜索中规划查询响应。This article explains how to formulate a query response in Azure Cognitive Search. 响应的结构由查询中的参数决定:REST API 中的搜索文档,或 .NET SDK 中的 SearchResults 类The structure of a response is determined by parameters in the query: Search Document in the REST API, or SearchResults Class in the .NET SDK. 可通过以下方式使用查询上的参数来调整结果集的结构:Parameters on the query can be used to structure the result set in the following ways:

  • 对结果中的文档数量(默认为 50 个)进行限制或分批Limit or batch the number of documents in the results (50 by default)
  • 选择结果中要包含的字段Select fields to include in the results
  • 对结果进行排序Order results
  • 在搜索结果正文中突出显示完整匹配或部分匹配字词Highlight a matching whole or partial term in the body of the search results

结果的构成Result composition

尽管搜索文档可能由大量的字段组成,但通常只需少量的几个字段就能表示结果集中的每个文档。While a search document might consist of a large number of fields, typically only a few are needed to represent each document in the result set. 在查询请求中,追加 $select=<field list> 可以指定要在响应中显示的字段。On a query request, append $select=<field list> to specify which fields show up in the response. 必须在索引中通过某个属性将字段指定为“可检索的”,才能在结果中包含该字段。A field must be attributed as Retrievable in the index to be included in a result.

最合适的字段包括能够对比和区分文档,并提供足够的信息来邀请用户一端做出点击响应的字段。Fields that work best include those that contrast and differentiate among documents, providing sufficient information to invite a click-through response on the part of the user. 在电子商务网站上,这些字段可能是产品名称、说明、品牌、颜色、尺寸、价格和评级。On an e-commerce site, it might be a product name, description, brand, color, size, price, and rating. 对于 hotels-sample-index 内置示例,它们可能是以下示例中的字段:For the hotels-sample-index built-in sample, it might be fields in the following example:

POST /indexes/hotels-sample-index/docs/search?api-version=2020-06-30 
      "search": "sandy beaches",
      "select": "HotelId, HotelName, Description, Rating, Address/City"
      "count": true


若要在结果中包括图像文件(例如产品照片或徽标),请将其存储在 Azure 认知搜索外部,但在索引中包含一个字段,以引用搜索文档中的图像 URL。If want to include image files in a result, such as a product photo or logo, store them outside of Azure Cognitive Search, but include a field in your index to reference the image URL in the search document. 支持在结果中包含图像的示例索引包括此快速入门中专门介绍的 realestate-sample-us 演示,以及纽约市工作岗位演示应用Sample indexes that support images in the results include the realestate-sample-us demo, featured in this quickstart, and the New York City Jobs demo app.

意外结果提示Tips for unexpected results

有时可能会出现预料外的结果内容(而不是结构)。Occasionally, the substance and not the structure of results are unexpected. 如果查询结果不符合预期,可以尝试对查询进行以下修改,然后查看结果是否有所改善:When query outcomes are unexpected, you can try these query modifications to see if results improve:

  • searchMode=any (默认)更改为 searchMode=all 可获取符合所有条件而不是某个条件的匹配项。Change searchMode=any (default) to searchMode=all to require matches on all criteria instead of any of the criteria. 在查询包含布尔运算符时更应如此。This is especially true when boolean operators are included the query.

  • 使用不同的词法分析器或自定义分析器进行试验,看它是否改变了查询结果。Experiment with different lexical analyzers or custom analyzers to see if it changes the query outcome. 默认分析器会分解包含连字符的单词并将单词缩减为词根形式,这通常可以提高查询响应的稳定性。The default analyzer will break up hyphenated words and reduce words to root forms, which usually improves the robustness of a query response. 但是,如果需要保留连字符,或者字符串中包含特殊字符,则你可能需要配置自定义分析器,以确保索引包含正确格式的标记。However, if you need to preserve hyphens, or if strings include special characters, you might need to configure custom analyzers to ensure the index contains tokens in the right format. 有关详细信息,请参阅部分字词搜索和包含特殊字符(连接符、通配符、正则表达式、模式)的模式For more information, see Partial term search and patterns with special characters (hyphens, wildcard, regex, patterns).

分页结果Paging results

默认情况下,搜索引擎最多返回前 50 个匹配项。如果查询为全文搜索,将按搜索评分确定返回的结果;对于完全匹配查询,则按任意顺序返回结果。By default, the search engine returns up to the first 50 matches, as determined by search score if the query is full text search, or in an arbitrary order for exact match queries.

若要返回不同数量的匹配文档,请将 $top$skip 参数添加到查询请求。To return a different number of matching documents, add $top and $skip parameters to the query request. 以下列表解释了相关逻辑。The following list explains the logic.

  • 添加 $count=true 可获取索引中匹配文档的总计数。Add $count=true to get a count of the total number of matching documents within an index.

  • 返回第一组 15 个匹配文档,以及匹配项总计数:GET /indexes/<INDEX-NAME>/docs?search=<QUERY STRING>&$top=15&$skip=0&$count=trueReturn the first set of 15 matching documents plus a count of total matches: GET /indexes/<INDEX-NAME>/docs?search=<QUERY STRING>&$top=15&$skip=0&$count=true

  • 跳过前 15 个匹配文档,返回第二组 15 个文档:$top=15&$skip=15Return the second set, skipping the first 15 to get the next 15: $top=15&$skip=15. 跳过前两组匹配文档,返回第三组 15 个匹配文档:$top=15&$skip=30Do the same for the third set of 15: $top=15&$skip=30

如果基础索引会发生变化,则无法保证分页查询的结果稳定。The results of paginated queries are not guaranteed to be stable if the underlying index is changing. 分页会更改每页的 $skip 值,但每个查询是独立的,并针对查询时存在于索引中的当前数据视图运行(换言之,对于在常规用途数据库等位置中发现的结果,不存在结果缓存或快照)。Paging changes the value of $skip for each page, but each query is independent and operates on the current view of the data as it exists in the index at query time (in other words, there is no caching or snapshot of results, such as those found in a general purpose database).

以下示例展示了如何获取重复项。Following is an example of how you might get duplicates. 假设某个索引包含四个文档:Assume an index with four documents:

{ "id": "1", "rating": 5 }
{ "id": "2", "rating": 3 }
{ "id": "3", "rating": 2 }
{ "id": "4", "rating": 1 }

现在假设你希望每次返回按评级排序的两个结果。Now assume you want results returned two at a time, ordered by rating. 你将执行此查询来获取第一页结果:$top=2&$skip=0&$orderby=rating desc,将生成以下结果:You would execute this query to get the first page of results: $top=2&$skip=0&$orderby=rating desc, producing the following results:

{ "id": "1", "rating": 5 }
{ "id": "2", "rating": 3 }

在服务上,假设在两次查询调用之间将第五个文档添加到索引中:{ "id": "5", "rating": 4 }On the service, assume a fifth document is added to the index in between query calls: { "id": "5", "rating": 4 }. 片刻之后,你执行查询来提取第二页:$top=2&$skip=2&$orderby=rating desc,将获得以下结果:Shortly thereafter, you execute a query to fetch the second page: $top=2&$skip=2&$orderby=rating desc, and get these results:

{ "id": "2", "rating": 3 }
{ "id": "3", "rating": 2 }

请注意,文档 2 提取了两次。Notice that document 2 is fetched twice. 这是因为,新文档 5 的评级值较大,因此它排在文档 2 的前面,并出现在第一页中。This is because the new document 5 has a greater value for rating, so it sorts before document 2 and lands on the first page. 尽管这种行为可能让人意外,但它却是搜索引擎的典型行为。While this behavior might be unexpected, it's typical of how a search engine behaves.

对结果排序Ordering results

对于全文搜索查询,结果将按照搜索评分自动排名,搜索评分是根据文档中的字词频率和邻近性计算的,根据搜索字词,匹配项越多或者匹配程度越高的文档,其评分越高。For full text search queries, results are automatically ranked by a search score, calculated based on term frequency and proximity in a document, with higher scores going to documents having more or stronger matches on a search term.

搜索评分表达了对相关性的总体认知,反映了相对于同一结果集中的其他文档的匹配强度。Search scores convey general sense of relevance, reflecting the strength of match relative to other documents in the same result set. 但是,评分在不同的查询之间并非始终一致,因此,在处理查询时,你可能会注意到搜索文档的排序方式存在细微差别。But scores are not always consistent from one query to the next, so as you work with queries, you might notice small discrepancies in how search documents are ordered. 下面对发生这种情况的可能原因给出了几条解释。There are several explanations for why this might occur.

原因Cause 说明Description
数据易变性Data volatility 添加、修改或删除文档时,索引内容会发生变化。Index content varies as you add, modify, or delete documents. 在处理索引更新的过程中,字词频率会发生变化,从而影响了匹配文档的搜索评分。Term frequencies will change as index updates are processed over time, affecting the search scores of matching documents.
多个副本Multiple replicas 对于使用多个副本的服务,将并行针对每个副本发出查询。For services using multiple replicas, queries are issued against each replica in parallel. 用于计算搜索评分的索引统计信息是根据每个副本计算的,结果将在查询响应中合并和排序。The index statistics used to calculate a search score are calculated on a per-replica basis, with results merged and ordered in the query response. 副本基本上是彼此的镜像,但由于状态存在细微差别,因此统计信息可能有所不同。Replicas are mostly mirrors of each other, but statistics can differ due to small differences in state. 例如,一个副本可能删除了对统计信息有影响的文档,这些统计信息由其他副本合并而来。For example, one replica might have deleted documents contributing to their statistics, which were merged out of other replicas. 通常,按副本的统计信息的差异在较小索引中更明显。Typically, differences in per-replica statistics are more noticeable in smaller indexes.
相同的评分Identical scores 如果多个文档具有相同的评分,其中的任何一个文档都可能会首先出现。If multiple documents have the same score, any one of them might appear first.

如何获得一致的排序How to get consistent ordering

如果一致的排序是一项应用程序要求,则可以在字段上显式定义 $orderby 表达式If consistent ordering is an application requirement, you can explicitly define an $orderby expression on a field. 只有在编制索引时设置为 sortable 的字段才可用于对结果排序。Only fields that are indexed as sortable can be used to order results. 如果你指定了 orderby 参数的值来包括字段名称并且针对地理空间值调用了 geo.distance() 函数,则 $orderby 中常用的字段包括评级、日期和位置。 Fields commonly used in an $orderby include rating, date, and location fields if you specify the value of the orderby parameter to include field names and calls to the geo.distance() function for geospatial values.

提高一致性的另一种方法是使用自定义评分配置文件Another approach that promotes consistency is using a custom scoring profile. 使用评分配置文件可以提高在特定字段中有匹配项的项的分数,从而可以让你更好地控制搜索结果中各个项的排名。Scoring profiles give you more control over the ranking of items in search results, with the ability to boost matches found in specific fields. 这一附加的评分逻辑有助于覆盖副本之间的细微差异,因为每个文档的搜索评分会在更大程度上拉开差距。The additional scoring logic can help override minor differences among replicas because the search scores for each document are farther apart. 我们建议对此方法使用排名算法We recommend the ranking algorithm for this approach.

突出显示Hit highlighting

命中项突出显示是指对结果中的匹配字词应用文本格式设置(例如粗体或黄色突出显示),以便轻松找到匹配项。Hit highlighting refers to text formatting (such as bold or yellow highlights) applied to matching terms in a result, making it easy to spot the match. 查询请求上提供了命中词突出显示说明。Hit highlighting instructions are provided on the query request.

若要启用“命中词突出显示”,请添加 highlight=[comma-delimited list of string fields] 以指定将要使用突出显示的字段。To enable hit highlighting, add highlight=[comma-delimited list of string fields] to specify which fields will use highlighting. 突出显示适合用于较长的内容字段(如“说明”字段),因为在这些字段中,匹配内容不是立即就能看到。Highlighting is useful for longer content fields, such as a description field, where the match is not immediately obvious. 只有定义被特性化为“可搜索”的字段才适用“命中词突出显示”。Only field definitions that are attributed as searchable qualify for hit highlighting.

默认情况下,Azure 认知搜索最多为每个字段返回五处突出显示。By default, Azure Cognitive Search returns up to five highlights per field. 你可以通过在字段后面附加一个破折号并后跟一个整数来调整此数字。You can adjust this number by appending to the field a dash followed by an integer. 例如,highlight=Description-10 在“说明”字段中返回最多 10 个匹配内容的突出显示。For example, highlight=Description-10 returns up to 10 highlights on matching content in the Description field.

格式设置应用于整个字词查询。Formatting is applied to whole term queries. 格式设置类型由标记 highlightPreTaghighlightPostTag 决定,代码负责完成响应过程(例如,应用粗体字体或黄色背景)。The type of formatting is determined by tags, highlightPreTag and highlightPostTag, and your code handles the response (for example, applying a bold font or a yellow background).

在以下示例中,在 Description 字段中找到的字词“sandy”、“sand”、“beaches”和“beach”已标记为将突出显示。In the following example, the terms "sandy", "sand", "beaches", "beach" found within the Description field are tagged for highlighting. 在引擎中触发查询扩展的查询(例如模糊搜索和通配符搜索)对命中项突出显示的支持有限。Queries that trigger query expansion in the engine, such as fuzzy and wildcard search, have limited support for hit highlighting.

GET /indexes/hotels-sample-index/docs/search=sandy beaches&highlight=Description?api-version=2020-06-30 
POST /indexes/hotels-sample-index/docs/search?api-version=2020-06-30 
      "search": "sandy beaches",  
      "highlight": "Description"

新行为(从 7 月 15 日开始)New behavior (starting July 15)

在 2020 年 7 月 15 日之后创建的服务将提供不同的突出显示体验。Services created after July 15, 2020 will provide a different highlighting experience. 在该日期之前创建的服务的突出显示行为不会有变化。Services created before that date will not change in their highlighting behavior.

新行为:With the new behavior:

  • 仅返回与完整短语查询匹配的短语。Only phrases that match the full phrase query will be returned. 查询“super bowl”将返回如下所示的突出显示结果:The query "super bowl" will return highlights like this:

    '<em>super bowl</em> is super awesome with a bowl of chips'

    请注意,字词“bowl of chips”没有任何突出显示效果,因为它不与完整短语匹配。Note that the term bowl of chips does not have any highlighting because it does not match the full phrase.

编写实现命中项突出显示的客户端代码时,请注意此项更改。When you are writing client code that implements hit highlighting, be aware of this change. 请注意,除非创建全新的搜索服务,否则你不会受到影响。Note that this will not impact you unless you create a completely new search service.

后续步骤Next steps

若要快速为客户端生成搜索页面,请考虑以下选项:To quickly generate a search page for your client, consider these options:

  • 门户中的应用程序生成器可以创建带有搜索栏、分面导航和包含图像的结果区域的 HTML 页面。Application Generator, in the portal, creates an HTML page with a search bar, faceted navigation, and results area that includes images.
  • 使用 C# 创建你的第一个应用教程介绍了如何构建正常运行的客户端。Create your first app in C# is a tutorial that builds a functional client. 示例代码演示了分页查询、命中项突出显示和排序。Sample code demonstrates paginated queries, hit highlighting, and sorting.

多个代码示例包含一个 Web 前端接口,相关内容可参阅:实现了实时演示站点的 JavaScript 示例代码CognitiveSearchFrontEndSeveral code samples include a web front-end interface, which you can find here: JavaScript sample code with a live demo site, and CognitiveSearchFrontEnd.