在 Azure 认知搜索中配置相似性排名算法Configure the similarity ranking algorithm in Azure Cognitive Search

Azure 认知搜索支持两种相似性排名算法:Azure Cognitive Search supports two similarity ranking algorithms:

  • 经典相似性算法,在 2020 年 7 月 15 日之前由所有搜索服务使用。A classic similarity algorithm, used by all search services up until July 15, 2020.
  • Okapi BM25 算法的一种实现,在 7 月 15 日之后创建的所有搜索服务中使用。An implementation of the Okapi BM25 algorithm, used in all search services created after July 15.

BM25 排名是新的默认算法,因为它往往能够生成更符合用户预期的搜索排名。BM25 ranking is the new default because it tends to produce search rankings that align better with user expectations. 它附带了用于根据文档大小等因素优化结果的参数It comes with parameters for tuning results based on factors such as document size.

对于 2020 年 7 月 15 日之后创建的新服务,将自动使用 BM25,并且这是唯一的相似性算法。For new services created after July 15, 2020, BM25 is used automatically and is the sole similarity algorithm. 如果你尝试在新服务上将相似性算法设置为 ClassicSimilarity,则会返回 HTTP 400 错误,因为服务不支持该算法。If you try to set similarity to ClassicSimilarity on a new service, an HTTP 400 error will be returned because that algorithm is not supported by the service.

对于在 2020 年 7 月 15 日之前创建的旧服务,经典相似性仍然是默认算法。For older services created before July 15, 2020, classic similarity remains the default algorithm. 旧服务可以按索引升级到 BM25,如下所述。Older services can upgrade to BM25 on a per-index basis, as explained below. 如果从经典算法切换到 BM25,搜索结果的排序方式预期会出现一些差异。If you are switching from classic to BM25, you can expect to see some differences how search results are ordered.

备注

语义排名(目前对于选定区域中的标准服务以预览版提供)是有助于生成更相关结果的附加步骤。Semantic ranking, currently in preview for standard services in selected regions, is an additional step forward in producing more relevant results. 与其他算法不同,它是循环访问现有结果集的附加功能。Unlike the other algorithms, it is an add-on feature that iterates over an existing result set. 有关详细信息,请参阅语义搜索概述For more information, see Semantic search overview.

在旧服务上启用 BM25 评分Enable BM25 scoring on older services

如果你运行的搜索服务是在 2020 年 7 月 15 日之前创建的,则可以通过在新索引上设置 Similarity 属性来启用 BM25。If you are running a search service that was created prior to July 15, 2020, you can enable BM25 by setting a Similarity property on new indexes. 该属性仅在新索引上公开,因此,如果需要在现有索引上启用 BM25,则必须删除该索引,然后在将新的 Similarity 属性设置为“Microsoft.Azure.Search.BM25Similarity”的情况下重新生成索引The property is only exposed on new indexes, so if want BM25 on an existing index, you must drop and rebuild the index with a new Similarity property set to "Microsoft.Azure.Search.BM25Similarity".

存在具有 Similarity 属性的索引后,可以在 BM25Similarity 或 ClassicSimilarity 之间切换。Once an index exists with a Similarity property, you can switch between BM25Similarity or ClassicSimilarity.

以下链接介绍了 Azure SDK 中的 Similarity 属性。The following links describe the Similarity property in the Azure SDKs.

客户端库Client library Similarity 属性Similarity property
.NET.NET SearchIndex.SimilaritySearchIndex.Similarity
JavaJava SearchIndex.setSimilaritySearchIndex.setSimilarity
JavaScriptJavaScript SearchIndex.SimilaritySearchIndex.Similarity
PythonPython SearchIndex 上的 similarity 属性similarity property on SearchIndex

REST 示例REST example

还可以使用 REST API,如下面的示例所示:You can also use the REST API, as the following example illustrates:

PUT https://[search service name].search.azure.cn/indexes/[index name]?api-version=2020-06-30
{
    "name": "indexName",
    "fields": [
        {
            "name": "id",
            "type": "Edm.String",
            "key": true
        },
        {
            "name": "name",
            "type": "Edm.String",
            "searchable": true,
            "analyzer": "en.lucene"
        },
        ...
    ],
    "similarity": {
        "@odata.type": "#Microsoft.Azure.Search.BM25Similarity"
    }
}

设置 BM25 参数Set BM25 parameters

BM25 相似性添加了两个用户可自定义参数来控制计算出的相关性评分。BM25 similarity adds two user customizable parameters to control the calculated relevance score. 可以在创建索引期间设置 BM25 参数;如果在创建索引期间指定了 BM25 算法,则还可以将 BM25 参数设置为索引更新。You can set BM25 parameters during index creation, or as an index update if the BM25 algorithm was specified during index creation.

属性Property 类型Type 说明Description
k1k1 数字number 控制每个匹配字词的字词频率与文档-查询对的最终相关性分数之间的调整函数。Controls the scaling function between the term frequency of each matching terms to the final relevance score of a document-query pair. 值通常为 0.0 到 3.0,默认值为 1.2。Values are usually 0.0 to 3.0, with 1.2 as the default.

值 0.0 表示“二元模型”,其中,单个匹配字词的贡献对于所有匹配文档是相同的,而不管该字词在文本中出现多少次;使用较大的 k1 值可让评分随着在文档中找到同一字词的更多实例而不断提高。A value of 0.0 represents a "binary model", where the contribution of a single matching term is the same for all matching documents, regardless of how many times that term appears in the text, while a larger k1 value allows the score to continue to increase as more instances of the same term is found in the document.

如果我们预期要在搜索查询中包含多个字词,则使用较大的 k1 值可能很重要。Using a higher k1 value can be important in cases where we expect multiple terms to be part of a search query. 在这种情况下,我们可能倾向于使用与多个不同的搜索查询字词匹配的文档,而不使用只与单个字词匹配多次的文档。In those cases, we might want to favor documents that match many of the different query terms being searched over documents that only match a single one, multiple times. 例如,在索引中查询包含字词“阿波罗宇宙飞行”的文档时,面对一篇有关希腊神话的、几十次提到字词“阿波罗”但未提到“宇宙飞行”的文章,以及一篇明确提到了字词“阿波罗”和“宇宙飞行”,但只是提到了几次的文章,我们可能希望降低前一篇文章的评分。For example, when querying the index for documents containing the terms "Apollo Spaceflight", we might want to lower the score of an article about Greek Mythology that contains the term "Apollo" a few dozen times, without mentions of "Spaceflight", compared to another article that explicitly mentions both "Apollo" and "Spaceflight" a handful of times only.
bb 数字number 控制文档长度如何影响相关性分数。Controls how the length of a document affects the relevance score. 值介于 0 和 1 之间,默认值为 0.75。Values are between 0 and 1, with 0.75 as the default.

值 0.0 表示文档的长度不会影响评分,而值 1.0 则表示将按文档长度规范化字词频率对相关性评分的影响。A value of 0.0 means the length of the document will not influence the score, while a value of 1.0 means the impact of term frequency on relevance score will be normalized by the document's length.

如果我们想要惩罚较长的文档,则按文档长度规范化字词频率会很有用。Normalizing the term frequency by the document's length is useful in cases where we want to penalize longer documents. 在某些情况下,与短得多的文档相比,较长的文档(例如整篇小说)更有可能包含很多不相关的字词。In some cases, longer documents (such as a complete novel), are more likely to contain many irrelevant terms, compared to much shorter documents.

设置 k1 和 b 参数Setting k1 and b parameters

若要设置或修改 b 或 k1 值,请将它们添加到 BM25 相似性对象。To set or modify b or k1 values, add them to the BM25 similarity object. 设置或更改现有索引的这些值会使索引至少脱机几秒钟,导致处于活动状态的索引和查询请求失败。Setting or changing these values on an existing index will take the index offline for at least a few seconds, causing active indexing and query requests to fail. 因此,应设置更新请求的“allowIndexDowntime=true”参数:Consequently, you should set the "allowIndexDowntime=true" parameter of the update request:

PUT https://[search service name].search.azure.cn/indexes/[index name]?api-version=2020-06-30&allowIndexDowntime=true
{
    "similarity": {
        "@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
        "b" : 0.5,
        "k1" : 1.3
    }
}

另请参阅See also