Azure 认知搜索中的排名算法Ranking algorithm in Azure Cognitive Search

重要

从 2020 年 7 月 15 日起,新创建的搜索服务将自动使用 BM25 排名函数,经证实,在大多数情况下,该函数提供的搜索排名优于当前默认排名,可以更好地达到用户的期望。Starting July 15, 2020, newly created search services will use the BM25 ranking function automatically, which has proven in most cases to provide search rankings that align better with user expectations than the current default ranking. 除了卓越的排名准确度以外,BM25 还启用了根据文档大小等因素优化结果的配置选项。Beyond superior ranking, BM25 also enables configuration options for tuning results based on factors such as document size.

由于此项变更,你很可能会发现搜索结果的顺序稍有变化。With this change, you will most likely see slight changes in the ordering of your search results. 对于想要测试此项变更带来的影响的用户,可以使用 api-version 2019-05-06-Preview 和 2020-06-30 中的 BM25 算法。For those who want to test the impact of this change, the BM25 algorithm is available in the api-version 2019-05-06-Preview and in 2020-06-30.

本文介绍如何在现有搜索服务上,对使用预览版 API 创建和查询的新索引使用新的 BM25 排名算法。This article describes how you can use the new BM25 ranking algorithm on existing search services for new indexes created and queried using the preview API.

Azure 认知搜索正在使用 Okapi BM25 算法的官方 Lucene 实现 (BM25Similarity),该实现将取代以前使用的 ClassicSimilarity 实现 。Azure Cognitive Search is in the process of adopting the official Lucene implementation of the Okapi BM25 algorithm, BM25Similarity, which will replace the previously used ClassicSimilarity implementation. 与旧式 ClassicSimilarity 算法一样,BM25Similarity 是类似于 TF-IDF 的检索函数,它使用字词频率 (TF) 和逆向文档频率 (IDF) 作为变量来计算每个文档-查询对的相关性评分,然后使用这些评分进行排名。Like the older ClassicSimilarity algorithm, BM25Similarity is a TF-IDF-like retrieval function that uses the term frequency (TF) and the inverse document frequency (IDF) as variables to calculate relevance scores for each document-query pair, which is then used for ranking.

虽然 BM25 在概念上类似于旧式“经典相似性”算法,但它立足于概率信息检索来改善结果。While conceptually similar to the older Classic Similarity algorithm, BM25 takes its root in probabilistic information retrieval to improve upon it. BM25 还提供高级自定义选项,例如,允许用户确定如何根据匹配字词的字词频率调整相关性评分。BM25 also offers advanced customization options, such as allowing the user to decide how the relevance score scales with the term frequency of matched terms.

如何立即测试 BM25How to test BM25 today

创建新索引时,可以设置 similarity 属性来指定该算法。When you create a new index, you can set a similarity property to specify the algorithm. 可以使用 api-version=2019-05-06-Preview(如下所示)或 api-version=2020-06-30You can use the api-version=2019-05-06-Preview, as shown below, or api-version=2020-06-30.

PUT https://[search service name].search.azure.cn/indexes/[index name]?api-version=2019-05-06-Preview
{
    "name": "indexName",
    "fields": [
        {
            "name": "id",
            "type": "Edm.String",
            "key": true
        },
        {
            "name": "name",
            "type": "Edm.String",
            "searchable": true,
            "analyzer": "en.lucene"
        },
        ...
    ],
    "similarity": {
        "@odata.type": "#Microsoft.Azure.Search.BM25Similarity"
    }
}

当两种算法都可用时,similarity 属性在此过渡期间仅在现有服务上有用。The similarity property is useful during this interim period when both algorithms are available, on existing services only.

属性Property 说明Description
similaritysimilarity 可选。Optional. 有效值包括“#Microsoft.Azure.Search.ClassicSimilarity”或“#Microsoft.Azure.Search.BM25Similarity”。Valid values include "#Microsoft.Azure.Search.ClassicSimilarity" or "#Microsoft.Azure.Search.BM25Similarity".
对于 2020 年 7 月 15 日之前创建的搜索服务,需要安装 api-version=2019-05-06-Preview 或更高版本。Requires api-version=2019-05-06-Preview or later on a search service created prior to July 15, 2020.

对于 2020 年 7 月 15 日之后创建的新服务,将自动使用 BM25,并且这是唯一的相似性算法。For new services created after July 15, 2020, BM25 is used automatically and is the sole similarity algorithm. 如果尝试在新服务上将 similarity 设置为 ClassicSimilarity,则会返回 400 错误,因为新服务不支持该算法。If you try to set similarity to ClassicSimilarity on a new service, a 400 error will be returned because that algorithm is not supported on a new service.

对于 2020 年 7 月 15 日之前创建的现有服务,经典相似性仍然是默认算法。For existing services created before July 15, 2020, the Classic similarity remains the default algorithm. 如果 similarity 属性被省略或设置为 null,则索引将使用经典算法。If the similarity property is omitted or set to null, the index uses the Classic algorithm. 如果要使用新算法,则需要如上所述设置 similarity。If you want to use the new algorithm, you will need to set similarity as described above.

BM25 相似性参数BM25 similarity parameters

BM25 相似性添加了两个用户可自定义参数来控制计算出的相关性评分。BM25 similarity adds two user customizable parameters to control the calculated relevance score.

k1k1

k1 参数控制每个匹配字词的字词频率与文档-查询对的最终相关性评分之间的调整函数。The k1 parameter controls the scaling function between the term frequency of each matching terms to the final relevance score of a document-query pair.

零值表示“二元模型”,其中,单个匹配字词的贡献对于所有匹配文档是相同的,而不管该字词在文本中出现多少次;使用较大的 k1 值可让评分随着在文档中找到同一字词的更多实例而不断提高。A value of zero represents a "binary model", where the contribution of a single matching term is the same for all matching documents, regardless of how many times that term appears in the text, while a larger k1 value allows the score to continue to increase as more instances of the same term is found in the document. 默认情况下,Azure 认知搜索对 k1 参数使用值 1.2。By default, Azure Cognitive Search uses a value of 1.2 for the k1 parameter. 如果我们预期要在搜索查询中包含多个字词,则使用较大的 k1 值可能很重要。Using a higher k1 value can be important in cases where we expect multiple terms to be part of a search query. 在这种情况下,我们可能倾向于使用与多个不同的搜索查询字词匹配的文档,而不使用只与单个字词匹配多次的文档。In those cases, we might want to favor documents that match many of the different query terms being searched over documents that only match a single one, multiple times. 例如,在索引中查询包含字词“阿波罗宇宙飞行”的文档时,面对一篇有关希腊神话的、几十次提到字词“阿波罗”但未提到“宇宙飞行”的文章,以及一篇明确提到了字词“阿波罗”和“宇宙飞行”,但只是提到了几次的文章,我们可能希望降低前一篇文章的评分。For example, when querying the index for documents containing the terms "Apollo Spaceflight", we might want to lower the score of an article about Greek Mythology which contains the term "Apollo" a few dozen times, without mentions of "Spaceflight", compared to another article which explicitly mentions both "Apollo" and "Spaceflight" a handful of times only.

bb

b 参数控制文档长度如何影响相关性评分。The b parameter controls how the length of a document affects the relevance score.

值 0.0 表示文档的长度不会影响评分,而值 1.0 则表示将按文档长度规范化字词频率对相关性评分的影响。A value of 0.0 means the length of the document will not influence the score, while a value of 1.0 means the impact of term frequency on relevance score will be normalized by the document's length. Azure 认知搜索中对 b 参数使用的默认值为 0.75。The default value used in Azure Cognitive Search for the b parameter is 0.75. 如果我们想要惩罚较长的文档,则按文档长度规范化字词频率会很有用。Normalizing the term frequency by the document's length is useful in cases where we want to penalize longer documents. 在某些情况下,与短得多的文档相比,较长的文档(例如整篇小说)更有可能包含很多不相关的字词。In some cases, longer documents (such as a complete novel), are more likely to contain many irrelevant terms, compared to much shorter documents.

设置 k1 和 b 参数Setting k1 and b parameters

若要自定义 b 或 k1 值,只需在使用 BM25 时将其作为属性添加到 similarity 对象:To customize the b or k1 values, simply add them as properties to the similarity object when using BM25:

    "similarity": {
        "@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
        "b" : 0.5,
        "k1" : 1.3
    }

只能在创建索引时设置相似性算法。The similarity algorithm can only be set at index creation time. 这意味着,无法对现有索引更改正在使用的相似性算法。This means the similarity algorithm being used cannot be changed for existing indexes. 可以在更新使用 BM25 的现有索引定义时修改“b”和“k1”参数。 The "b" and "k1" parameters can be modified when updating an existing index definition that uses BM25. 更改现有索引的这些值会使索引至少脱机几秒钟,导致索引和查询请求失败。Changing those values on an existing index will take the index offline for at least a few seconds, causing your indexing and query requests to fail. 因此,需要在更新请求的查询字符串中设置“allowIndexDowntime=true”参数:Because of that, you will need to set the "allowIndexDowntime=true" parameter in the query string of your update request:

PUT https://[search service name].search.azure.cn/indexes/[index name]?api-version=[api-version]&allowIndexDowntime=true

另请参阅See also