使用模糊搜索来更正拼写错误Fuzzy search to correct misspellings and typos

Azure 认知搜索支持模糊搜索 - 这是可以纠正输入字符串中的拼写错误和拼错的字词的一种查询。Azure Cognitive Search supports fuzzy search, a type of query that compensates for typos and misspelled terms in the input string. 它通过扫描具有类似构成部分的字词来实现此功能。It does this by scanning for terms having a similar composition. 当差异只是体现在几个字符放错了位置时,扩展搜索以包括近似匹配项能够对自动更正拼写错误产生影响。Expanding search to cover near-matches has the effect of auto-correcting a typo when the discrepancy is just a few misplaced characters.

模糊搜索是一种扩展措施,可以根据具有类似构成部分的字词生成匹配项。It's an expansion exercise that produces a match on terms having a similar composition. 指定模糊搜索后,引擎将为查询中的所有完整字词生成具有类似构成的字词的图(基于确定性有限自动机理论)。When a fuzzy search is specified, the engine builds a graph (based on deterministic finite automaton theory) of similarly composed terms, for all whole terms in the query. 例如,如果查询包含三个字词“university of washington”,则会为查询 search=university~ of~ washington~ 中的每个字词创建一张图(模糊搜索中不会删除非索引字,因此“of”也会获得一张图)。For example, if your query includes three terms "university of washington", a graph is created for every term in the query search=university~ of~ washington~ (there is no stop-word removal in fuzzy search, so "of" gets a graph).

图中最多包含每个字词的 50 个扩展(或排列),在过程中捕获正确和错误的变体。The graph consists of up to 50 expansions, or permutations, of each term, capturing both correct and incorrect variants in the process. 然后,引擎在响应中返回最相关的匹配项。The engine then returns the topmost relevant matches in the response.

对于类似于“university”的字词,图中可能包含“unversty”、“universty”、“university”、“universe”、“inverse”。For a term like "university", the graph might have "unversty, universty, university, universe, inverse". 与图中的字词匹配的任何文档会包含在结果中。Any documents that match on those in the graph are included in results. 与对文本进行分析来处理同一单词的不同形式(“mice”和“mouse”)的其他查询不同,模糊查询中的比较是针对表面值进行的,不会对文本进行任何语言分析。In contrast with other queries that analyze the text to handle different forms of the same word ("mice" and "mouse"), the comparisons in a fuzzy query are taken at face value without any linguistic analysis on the text. 语义不同的“universe”和“inverse”相互匹配,因为两者的句法差异很小。"Universe" and "inverse", which are semantically different, will match because the syntactic discrepancies are small.

如果差异限制为两个或更少的编辑(其中的编辑是指插入、删除、替换或转换的字符),则匹配成功。A match succeeds if the discrepancies are limited to two or fewer edits, where an edit is an inserted, deleted, substituted, or transposed character. 指定差异的字符串更正算法为 Damerau-Levenshtein 距离指标,描述为“将一个单词更改为另一个单词所需的最少操作次数(插入、删除、替换或调换两个相邻字符)”。The string correction algorithm that specifies the differential is the Damerau-Levenshtein distance metric, described as the "minimum number of operations (insertions, deletions, substitutions, or transpositions of two adjacent characters) required to change one word into the other".

在 Azure 认知搜索中:In Azure Cognitive Search:

  • 模糊查询将应用于整个字词,但你可以通过 AND 构造来支持短语。Fuzzy query applies to whole terms, but you can support phrases through AND constructions. 例如,“Unviersty~ of~ "Wshington~”会与“University of Washington”匹配。For example, "Unviersty~ of~ "Wshington~" would match on "University of Washington".

  • 默认的编辑距离为 2。The default distance of an edit is 2. ~0 表示无扩展(仅将完全相同的字词视为匹配项),但你可以指定 ~1 表示一个差异程度或一个编辑。A value of ~0 signifies no expansion (only the exact term is considered a match), but you could specify ~1 for one degree of difference, or one edit.

  • 模糊查询可将一个字词扩展到最多 50 个附加排列。A fuzzy query can expand a term up to 50 additional permutations. 此限制不可配置,但通过将编辑距离减至 1 可以有效减少扩展数目。This limit is not configurable, but you can effectively reduce the number of expansions by decreasing the edit distance to 1.

  • 响应由包含相关匹配项的文档(最多 50 个)组成。Responses consist of documents containing a relevant match (up to 50).

图是作为针对索引中的标记的匹配条件统一提交的。Collectively, the graphs are submitted as match criteria against tokens in the index. 可以想象得出,模糊搜索的速度固有地比其他查询形式要慢。As you can imagine, fuzzy search is inherently slower than other query forms. 索引的大小和复杂性可以决定带来的优势是否足以抵消响应的延迟。The size and complexity of your index can determine whether the benefits are enough to offset the latency of the response.

备注

由于模糊搜索的速度往往较慢,因此,可能有必要探索 n 元语法索引等采用短字符序列(由两到三个字符组成的双元和三元标记序列)的替代方案。Because fuzzy search tends to be slow, it might be worthwhile to investigate alternatives such as n-gram indexing, with its progression of short character sequences (two and three character sequences for bigram and trigram tokens). n 元语法可能会提高性能,具体取决于所用的语言和查询面。Depending on your language and query surface, n-gram might give you better performance. 弊端是 n 元语法索引非常耗费存储,且生成的索引要大得多。The trade off is that n-gram indexing is very storage intensive and generates much bigger indexes.

如果你只是想要处理最极端的情况,另一种替代方案是使用同义词映射Another alternative, which you could consider if you want to handle just the most egregious cases, would be a synonym map. 例如,将“search”映射到“serach, serch, sarch”,或者将“retrieve”映射到“retreive”。For example, mapping "search" to "serach, serch, sarch", or "retrieve" to "retreive".

在查询处理过程中不会使用分析器来创建扩展图,但这并不意味着在模糊搜索方案中应该忽略分析器。Analyzers are not used during query processing to create an expansion graph, but that doesn't mean analyzers should be ignored in fuzzy search scenarios. 毕竟,在索引编制过程中将使用分析器来创建进行匹配时所依据的标记,不管查询是自由形式、筛选搜索还是使用图作为输入的模糊搜索。After all, analyzers are used during indexing to create tokens against which matching is done, whether the query is free form, filtered search, or a fuzzy search with a graph as input.

通常,当基于每个字段分配分析器时,将基于主要用例(筛选器或全文搜索)而不是基于模糊搜索等专用搜索形式来做出微调分析链的决策。Generally, when assigning analyzers on a per-field basis, the decision to fine-tune the analysis chain is based on the primary use case (a filter or full text search) rather than specialized query forms like fuzzy search. 因此,对于模糊搜索,没有具体的分析器建议。For this reason, there is not a specific analyzer recommendation for fuzzy search.

但是,如果测试查询没有生成预期的匹配项,你可以尝试改变索引分析器,将其设置为语言分析器,以查看是否获得了更好的结果。However, if test queries are not producing the matches you expect, you could try varying the indexing analyzer, setting it to a language analyzer, to see if you get better results. 某些语言(尤其是存在元音变化的语言)可以受益于 Microsoft 自然语言处理器生成的音调变化和不规则单词形式。Some languages, particularly those with vowel mutations, can benefit from the inflection and irregular word forms generated by the Microsoft natural language processors. 在某些情况下,使用正确的语言分析器可能关系到术语是否能够以与用户提供的值兼容的方式进行标记化。In some cases, using the right language analyzer can make a difference in whether a term is tokenized in a way that is compatible with the value provided by the user.

模糊查询是使用完整 Lucene 查询语法构造的,它调用 Lucene 查询分析程序Fuzzy queries are constructed using the full Lucene query syntax, invoking the Lucene query parser.

  1. 在查询中设置完整的 Lucene 分析程序 (queryType=full)。Set the full Lucene parser on the query (queryType=full).

  2. (可选)使用此参数将请求范围限定为特定的字段 (searchFields=<field1,field2>)。Optionally, scope the request to specific fields, using this parameter (searchFields=<field1,field2>).

  3. 将波浪符 (~) 运算符追加到完整字词的末尾 (search=<string>~)。Append the tilde (~) operator at the end of the whole term (search=<string>~).

    若要指定编辑距离,请包含一个可选参数,其值介于 0 和 2(默认值)之间 (~1)。Include an optional parameter, a number between 0 and 2 (default), if you want to specify the edit distance (~1). 例如“blue~”或“blue~1”会返回“blue”、“blues”和“glue”。For example, "blue~" or "blue~1" would return "blue", "blues", and "glue".

在 Azure 认知搜索中,除了字词和距离(最大为 2)以外,不需要在查询中设置其他参数。In Azure Cognitive Search, besides the term and distance (maximum of 2), there are no additional parameters to set on the query.

备注

在处理查询期间,模糊查询不会进行词法分析During query processing, fuzzy queries do not undergo lexical analysis. 查询输入会直接添加到查询树中,并进行扩展以创建字词图。The query input is added directly to the query tree and expanded to create a graph of terms. 执行的唯一变换是将字词小写。The only transformation performed is lower casing.

要进行简单测试,我们建议使用搜索浏览器Postman 来循环访问查询表达式。For simple testing, we recommend Search explorer or Postman for iterating over a query expression. 这两个工具都是交互式的,这意味着,可以快速分步测试某个字词的多个变体,并评估返回的响应。Both tools are interactive, which means you can quickly step through multiple variants of a term and evaluate the responses that come back.

当结果模糊时,命中项突出显示可以帮助你识别响应中的匹配项。When results are ambiguous, hit highlighting can help you identify the match in the response.

备注

使用命中项突出显示识别模糊匹配项存在局限性,仅适用于基本的模糊搜索。The use of hit highlighting to identify fuzzy matches has limitations and only works for basic fuzzy search. 如果索引具有评分配置文件,或者使用附加语法将查询分层,则命中项突出显示可能无法识别匹配项。If your index has scoring profiles, or if you layer the query with additional syntax, hit highlighting might fail to identify the match.

示例 1:对完全匹配的字词进行模糊搜索Example 1: fuzzy search with the exact term

假设某个搜索文档的 "Description" 字段中存在以下字符串:"Test queries with special characters, plus strings for MSFT, SQL and Java."Assume the following string exists in a "Description" field in a search document: "Test queries with special characters, plus strings for MSFT, SQL and Java."

首先对“special”执行模糊搜索,并向 Description 字段添加命中项突出显示:Start with a fuzzy search on "special" and add hit highlighting to the Description field:

search=special~&highlight=Description

在响应中,由于添加了命中项突出显示,因此格式设置将应用于作为匹配字词的“special”。In the response, because you added hit highlighting, formatting is applied to "special" as the matching term.

"@search.highlights": {
    "Description": [
        "Test queries with <em>special</em> characters, plus strings for MSFT, SQL and Java."
    ]

删掉“special”中的几个字母(“pe”)以将其误拼,然后再次尝试该请求:Try the request again, misspelling "special" by taking out several letters ("pe"):

search=scial~&highlight=Description

到目前为止,响应没有任何变化。So far, no change to the response. 使用默认的 2 度距离时,从“special”中删除两个字符“pe”后,仍可成功匹配该字词。Using the default of 2 degrees distance, removing two characters "pe" from "special" still allows for a successful match on that term.

"@search.highlights": {
    "Description": [
        "Test queries with <em>special</em> characters, plus strings for MSFT, SQL and Java."
    ]

再尝试一次请求,这次请如下所述进一步修改搜索字词:删除最后一个字符,也就是总共删除三个字符(使“special”变成“scal”):Trying one more request, further modify the search term by taking out one last character for a total of three deletions (from "special" to "scal"):

search=scal~&highlight=Description

可以看到,返回了相同的响应,但匹配不是针对“special”进行的,而是针对“SQL”进行了模糊匹配。Notice that the same response is returned, but now instead of matching on "special", the fuzzy match is on "SQL".

        "@search.score": 0.4232868,
        "@search.highlights": {
            "Description": [
                "Mix of special characters, plus strings for MSFT, <em>SQL</em>, 2019, Linux, Java."
            ]

此扩展示例的要点是演示命中项突出显示可为模糊结果带来的明确性。The point of this expanded example is to illustrate the clarity that hit highlighting can bring to ambiguous results. 在所有情况下,都将返回同一文档。In all cases, the same document is returned. 如果你依赖于文档 ID 来验证匹配,则可能会漏掉从“special”到“SQL”的变动。Had you relied on document IDs to verify a match, you might have missed the shift from "special" to "SQL".

另请参阅See also