部分字词搜索和包含特殊字符(通配符、正则表达式、模式)的模式Partial term search and patterns with special characters (wildcard, regex, patterns)

部分字词搜索是指由字词片段组成的查询,其中没有包含整个字词,而可能只是包含字词的开头、中间部分或末尾(有时称为前缀查询、中缀查询或后缀查询) 。A partial term search refers to queries consisting of term fragments, where instead of a whole term, you might have just the start, middle, or end of term (sometimes referred to as prefix, infix, or suffix queries). 部分字词搜索可能包括片段的组合,其中通常包含特殊字符,例如属于查询字符串一部分的短划线或斜杠。A partial term search might include a combination of fragments, often with special characters such as dashes or slashes that are part of the query string. 常见的用例包括电话号码、URL、代码或带连字符的组合词的一部分。Common use-cases include parts of a phone number, URL, codes, or hyphenated compound words.

如果索引未采用预期格式的标记,则包含特殊字符的部分字词搜索和查询字符串可能会出现问题。Partial term search and query strings that include special characters can be problematic if the index doesn't have tokens in the expected format. 在编制索引的词法分析阶段(假设使用默认的标准分析器),会丢弃特殊字符,拆分复合词,并删除空格;找不到任何匹配项时,所有这些操作都可能会导致查询失败。During the lexical analysis phase of indexing (assuming the default standard analyzer), special characters are discarded, compound words are split up, and whitespace is deleted; all of which can cause queries to fail when no match is found. 例如,类似于 +1 (425) 703-6214 的电话号码(标记化为 "1""425""703""6214")不会显示在 "3-62" 查询中,因为该内容并不实际存在于索引中。For example, a phone number like +1 (425) 703-6214 (tokenized as "1", "425", "703", "6214") won't show up in a "3-62" query because that content doesn't actually exist in the index.

解决方案是在索引编制期间调用一个分析器来保留完整的字符串(在必要的情况下包括空格和特殊字符),以便可以在查询字符串中包括空格和这些字符。The solution is to invoke an analyzer during indexing that preserves a complete string, including spaces and special characters if necessary, so that you can include the spaces and characters in your query string. 同样,使用未标记化为较小部分的完整字符串可以为“开头为”或“结尾为”查询启用模式匹配,因此可以根据未经词法分析转换的字词评估你提供的模式。Likewise, having a complete string that is not tokenized into smaller parts enables pattern matching for "starts with" or "ends with" queries, where the pattern you provide can be evaluated against a term that is not transformed by lexical analysis. 为完整的字符串创建另外一个字段并使用发出完整字词标记的内容保留分析器,是同时适用于模式匹配和查询字符串(包含特殊字符)匹配的解决方案。Creating an additional field for an intact string, plus using a content-preserving analyzer that emits whole-term tokens, is the solution for both pattern matching and for matching on query strings that include special characters.

提示

如果你熟悉 Postman 和 REST API,请下载查询示例集合以查询本文中所述的部分字词和特殊字符。If you are familiar with Postman and REST APIs, download the query examples collection to query partial terms and special characters described in this article.

Azure 认知搜索在索引中扫描完整的标记化字词,不会基于部分字词查找匹配项,除非你包括通配符占位符运算符(*?)或将查询格式设置为正则表达式。Azure Cognitive Search scans for whole tokenized terms in the index and won't find a match on a partial term unless you include wildcard placeholder operators (* and ?) , or format the query as a regular expression. 部分字词是使用以下方法指定的:Partial terms are specified using these techniques:

  • 正则表达式查询,可以是在 Apache Lucene 下有效的任何正则表达式。Regular expression queries can be any regular expression that is valid under Apache Lucene.

  • 进行前缀匹配的通配符运算符,指的是一种公认的模式,它包括一个字词的开头,后跟 *? 后缀运算符,例如,search=cap* 将匹配“Cap'n Jack's Waterfront Inn”或“Gacc Capital”。Wildcard operators with prefix matching refers to a generally recognized pattern that includes the beginning of a term, followed by * or ? suffix operators, such as search=cap* matching on "Cap'n Jack's Waterfront Inn" or "Gacc Capital". 简单和完整的 Lucene 查询语法都支持前缀匹配。Prefixing matching is supported in both simple and full Lucene query syntax.

  • 进行中缀和后缀匹配的通配符,它将 *? 运算符置于字词的内部或开头,并需要使用正则表达式语法(表达式用正斜杠围起来)。Wildcard with infix and suffix matching places the * and ? operators inside or at the beginning of a term, and requires regular expression syntax (where the expression is enclosed with forward slashes). 例如,查询字符串 (search=/.*numeric*./) 会返回将“alphanumeric”和“alphanumerical”作为后缀匹配项和中缀匹配项的结果。For example, the query string (search=/.*numeric*./) returns results on "alphanumeric" and "alphanumerical" as suffix and infix matches.

对于部分字词搜索或模式搜索以及一些其他的查询形式(例如模糊搜索),查询时不使用分析器。For partial term or pattern search, and a few other query forms like fuzzy search, analyzers are not used at query time. 对于分析程序根据运算符和分隔符的存在检测到的这些查询形式,查询字符串不经词法分析便直接传递给引擎。For these query forms, which the parser detects by the presence of operators and delimiters, the query string is passed to the engine without lexical analysis. 对于这些查询形式,将忽略在字段上指定的分析器。For these query forms, the analyzer specified on the field is ignored.

备注

当部分查询字符串在 URL 片段中包含诸如斜杠之类的字符时,你可能需要添加转义字符。When a partial query string includes characters, such as slashes in a URL fragment, you might need to add escape characters. 在 JSON 中,使用反斜杠 \ 来转义正斜杠 /In JSON, a forward slash / is escaped with a backward slash \. 因此,search=/.*microsoft.com\/azure\/.*/ 是 URL 片段“microsoft.com/azure/”的语法。As such, search=/.*microsoft.com\/azure\/.*/ is the syntax for the URL fragment "microsoft.com/azure/".

解决部分搜索/模式搜索的问题Solving partial/pattern search problems

如果需要根据片段、模式或特殊字符进行搜索,可将默认分析器替代为自定义分析器,后者按照更简单的标记化规则运行,在索引中保留整个字符串。When you need to search on fragments or patterns or special characters, you can override the default analyzer with a custom analyzer that operates under simpler tokenization rules, retaining the entire string in the index. 退一步讲,该方法如下所述:Taking a step back, the approach looks like this:

  • 定义一个字段,用于存储字符串的原有版本(假设在查询时需要已分析的和未分析的文本)Define a field to store an intact version of the string (assuming you want analyzed and non-analyzed text at query time)
  • 对各种可在适当粒度级别发出标记的分析器进行评估和选择Evaluate and choose among the various analyzers that emit tokens at the right level of granularity
  • 将分析器分配到字段Assign the analyzer to the field
  • 生成并测试索引Build and test the index

提示

评估分析器是需要频繁重建索引的迭代过程。Evaluating analyzers is an iterative process that requires frequent index rebuilds. 可以使用 Postman 以及创建索引删除索引加载文档搜索文档 REST API 来简化此步骤。You can make this step easier by using Postman, the REST APIs for Create Index, Delete Index,Load Documents, and Search Documents. 使用“加载文档”时,请求正文应包含要测试的小型代表性数据集(例如,包含电话号码或产品代码的字段)。For Load Documents, the request body should contain a small representative data set that you want to test (for example, a field with phone numbers or product codes). 在同一 Postman 集合中使用这些 API 可以快速循环执行这些步骤。With these APIs in the same Postman collection, you can cycle through these steps quickly.

不同方案的重复字段Duplicate fields for different scenarios

分析器将确定如何在索引中标记化字词。Analyzers determine how terms are tokenized in an index. 由于分析器是按字段分配的,因此你可以在索引中创建字段,以针对不同的方案进行优化。Since analyzers are assigned on a per-field basis, you can create fields in your index to optimize for different scenarios. 例如,可以定义“featureCode”和“featureCodeRegex”,这样就可以先执行常规的全文搜索,再进行高级模式匹配。For example, you might define "featureCode" and "featureCodeRegex" to support regular full text search on the first, and advanced pattern matching on the second. 分配给每个字段的分析器将确定如何在索引中标记化每个字段的内容。The analyzers assigned to each field determine how the contents of each field are tokenized in the index.

{
  "name": "featureCode",
  "type": "Edm.String",
  "retrievable": true,
  "searchable": true,
  "analyzer": null
},
{
  "name": "featureCodeRegex",
  "type": "Edm.String",
  "retrievable": true,
  "searchable": true,
  "analyzer": "my_custom_analyzer"
},

选择分析器Choose an analyzer

选择可生成完整字词标记的分析器时,以下分析器是常用选项:When choosing an analyzer that produces whole-term tokens, the following analyzers are common choices:

分析器Analyzer 行为Behaviors
语言分析器language analyzers 保留组合词或字符串、元音变化和动词形式中的连字符。Preserves hyphens in compound words or strings, vowel mutations, and verb forms. 如果查询模式包含短划线,则使用语言分析器可能已足够。If query patterns include dashes, using a language analyzer might be sufficient.
keywordkeyword 整个字段的内容将标记化为单个字词。Content of the entire field is tokenized as a single term.
whitespacewhitespace 仅在空白位置分隔。Separates on white spaces only. 包含短划线或其他字符的字词被视为单个标记。Terms that include dashes or other characters are treated as a single token.
自定义分析器custom analyzer (建议)创建自定义分析器可以指定标记器 (tokenizer) 和标记筛选器。(recommended) Creating a custom analyzer lets you specify both the tokenizer and token filter. 以前的分析器必须按原样使用。The previous analyzers must be used as-is. 使用自定义分析器可以选取要使用的标记器和标记筛选器。A custom analyzer lets you pick which tokenizers and token filters to use.

建议的组合是关键字标记器小写标记筛选器A recommended combination is the keyword tokenizer with a lower-case token filter. 预定义的关键字分析器本身不会将任何大写文本转换为小写,这可能导致查询失败。By itself, the predefined keyword analyzer does not lower-case any upper-case text, which can cause queries to fail. 自定义分析器提供一个机制用于添加小写标记筛选器。A custom analyzer gives you a mechanism for adding the lower-case token filter.

如果使用 Postman 等 Web API 测试工具,则可以添加测试分析器 REST 调用来检查标记化的输出。If you are using a web API test tool like Postman, you can add the Test Analyzer REST call to inspect tokenized output.

你必须有一个可供使用的已填充索引。You must have a populated index to work with. 对于现有的索引以及包含短划线或部分字词的字段,可以针对特定字词尝试各种分析器,以查看发出哪些标记。Given an existing index and a field containing dashes or partial terms, you can try various analyzers over specific terms to see what tokens are emitted.

  1. 首先,检查标准分析器,看默认情况下如何标记化字词。First, check the Standard analyzer to see how terms are tokenized by default.

    {
    "text": "SVP10-NOR-00",
    "analyzer": "standard"
    }
    
  2. 评估响应,以查看文本如何在索引中标记化。Evaluate the response to see how the text is tokenized within the index. 注意每个字词如何转换为小写和分解。Notice how each term is lower-cased and broken up. 只有与这些标记匹配的查询才会在结果中返回此文档。Only those queries that match on these tokens will return this document in the results. 包含“10-NOR”的查询会失败。A query that includes "10-NOR" will fail.

    {
        "tokens": [
            {
                "token": "svp10",
                "startOffset": 0,
                "endOffset": 5,
                "position": 0
            },
            {
                "token": "nor",
                "startOffset": 6,
                "endOffset": 9,
                "position": 1
            },
            {
                "token": "00",
                "startOffset": 10,
                "endOffset": 12,
                "position": 2
            }
        ]
    }
    
  3. 现在,请修改请求以使用 whitespacekeyword 分析器:Now modify the request to use the whitespace or keyword analyzer:

    {
    "text": "SVP10-NOR-00",
    "analyzer": "keyword"
    }
    
  4. 现在,响应包含单个大写的标记,其中包含作为字符串一部分保留的短划线。Now the response consists of a single token, upper-cased, with dashes preserved as a part of the string. 如果你需要基于某个模式或部分字词(例如 10-NOR)进行搜索,你会发现,查询引擎现在有了查找匹配项的基础。If you need to search on a pattern or a partial term such as "10-NOR", the query engine now has the basis for finding a match.

    {
    
        "tokens": [
            {
                "token": "SVP10-NOR-00",
                "startOffset": 0,
                "endOffset": 12,
                "position": 0
            }
        ]
    }
    

重要

请注意,在生成查询树时,查询分析程序往往会将搜索表达式中的字词小写。Be aware that query parsers often lower-case terms in a search expression when building the query tree. 如果使用的分析器在编制索引期间不会将文本输入小写,并且你未获得预期的结果,则原因可能就在于此。If you are using an analyzer that does not lower-case text inputs during indexing, and you are not getting expected results, this could be the reason. 解决方法是根据下面的“使用自定义分析器”部分所述添加小写标记筛选器。The solution is to add a lower-case token filter, as described in the "Use custom analyzers" section below.

配置分析器Configure an analyzer

无论你是在评估分析器还是继续进行特定的配置,都需要在字段定义中指定分析器,如果未使用内置分析器,则可能还需要配置分析器本身。Whether you are evaluating analyzers or moving forward with a specific configuration, you will need to specify the analyzer on the field definition, and possibly configure the analyzer itself if you are not using a built-in analyzer. 交换分析器时,通常需要重建索引(删除、重新创建并重新加载)。When swapping analyzers, you typically need to rebuild the index (drop, recreate, and reload).

使用内置分析器Use built-in analyzers

可以在字段定义的 analyzer 属性中按名称指定内置或预定义的分析器,不需要在索引中进行其他配置。Built-in or predefined analyzers can be specified by name on an analyzer property of a field definition, with no additional configuration required in the index. 以下示例演示如何在字段中设置 whitespace 分析器。The following example demonstrates how you would set the whitespace analyzer on a field.

有关其他方案以及其他内置分析器的详细信息,请参阅预定义分析器列表For other scenarios and to learn more about other built-in analyzers, see Predefined analyzers list.

    {
      "name": "phoneNumber",
      "type": "Edm.String",
      "key": false,
      "retrievable": true,
      "searchable": true,
      "analyzer": "whitespace"
    }

使用自定义分析器Use custom analyzers

如果使用自定义分析器,请在索引中使用用户定义的组合(其中包含标记器、标记筛选器和可能的配置设置)来定义该分析器。If you are using a custom analyzer, define it in the index with a user-defined combination of tokenizer, token filter, with possible configuration settings. 接下来,在字段定义中引用它,就像引用内置分析器一样。Next, reference it on a field definition, just as you would a built-in analyzer.

如果目标是完整字词标记化,我们建议使用一个由关键字标记器小写标记筛选器组成的自定义分析器。When the objective is whole-term tokenization, a custom analyzer that consists of a keyword tokenizer and lower-case token filter is recommended.

  • 关键字标记器为字段的整个内容创建单个标记。The keyword tokenizer creates a single token for the entire contents of a field.
  • 小写标记筛选器将大写字母转换为小写文本。The lowercase token filter transforms upper-case letters into lower-case text. 查询分析程序通常将任何大写文本输入小写。Query parsers typically lowercase any uppercase text inputs. 小写过程可将包含标记化字词的输入均匀化。Lower-casing homogenizes the inputs with the tokenized terms.

以下示例演示了一个提供关键字标记器和小写标记筛选器的自定义分析器。The following example illustrates a custom analyzer that provides the keyword tokenizer and a lowercase token filter.

{
"fields": [
  {
  "name": "accountNumber",
  "analyzer":"myCustomAnalyzer",
  "type": "Edm.String",
  "searchable": true,
  "filterable": true,
  "retrievable": true,
  "sortable": false,
  "facetable": false
  }
],

"analyzers": [
  {
  "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
  "name":"myCustomAnalyzer",
  "charFilters":[],
  "tokenizer":"keyword_v2",
  "tokenFilters":["lowercase"]
  }
],
"tokenizers":[],
"charFilters": [],
"tokenFilters": []

备注

keyword_v2 标记器和 lowercase 令牌筛选器对于系统是已知的,它们使用默认配置,正因如此,可以按名称引用它们,而无需先定义它们。The keyword_v2 tokenizer and lowercase token filter are known to the system and using their default configurations, which is why you can reference them by name without having to define them first.

生成和测试Build and test

使用支持方案的分析器和字段定义定义索引后,请加载包含代表性字符串的文档,以便可以测试部分字符串查询。Once you have defined an index with analyzers and field definitions that support your scenario, load documents that have representative strings so that you can test partial string queries.

前面的部分介绍了逻辑。The previous sections explained the logic. 本部分将逐步介绍在测试解决方案时应调用的每个 API。This section steps through each API you should call when testing your solution. 如前所述,如果使用 Postman 等交互式 Web 测试工具,则可快速逐步完成这些任务。As previously noted, if you use an interactive web test tool such as Postman, you can step through these tasks quickly.

  • 删除索引”会删除同名的现有索引,以便可以重新创建该索引。Delete Index removes an existing index of the same name so that you can recreate it.

  • 创建索引”会在搜索服务中创建索引结构,包括分析器定义以及附带分析器规范的字段。Create Index creates the index structure on your search service, including analyzer definitions and fields with an analyzer specification.

  • 加载文档”会导入与索引具有相同结构的文档以及可搜索的内容。Load Documents imports documents having the same structure as your index, as well as searchable content. 完成此步骤后,索引便可供查询或测试。After this step, your index is ready to query or test.

  • 选择分析器中介绍了测试分析器Test Analyzer was introduced in Choose an analyzer. 请使用各种分析器来测试索引中的一些字符串,以了解字词的标记化方式。Test some of the strings in your index using a variety of analyzers to understand how terms are tokenized.

  • 搜索文档中介绍了如何使用通配符和正则表达式的简单语法完整 Lucene 语法来构造查询请求。Search Documents explains how to construct a query request, using either simple syntax or full Lucene syntax for wildcard and regular expressions.

    若要进行部分字词查询,例如,查询“3-6214”以找到“+1 (425) 703-6214”的匹配项,可以使用简单语法:search=3-6214&queryType=simpleFor partial term queries, such as querying "3-6214" to find a match on "+1 (425) 703-6214", you can use the simple syntax: search=3-6214&queryType=simple.

    若要进行中缀和后缀查询,例如,查询“num”或“numeric”以找到“alphanumeric”的匹配项,请使用完整 Lucene 语法和正则表达式:search=/.*num.*/&queryType=fullFor infix and suffix queries, such as querying "num" or "numeric to find a match on "alphanumeric", use the full Lucene syntax and a regular expression: search=/.*num.*/&queryType=full

调整查询性能Tune query performance

如果实现包含 keyword_v2 标记器和小写标记筛选器的建议配置,可能会注意到查询性能下降,因为需要对索引中的现有标记进行额外的标记筛选处理。If you implement the recommended configuration that includes the keyword_v2 tokenizer and lower-case token filter, you might notice a decrease in query performance due to the additional token filter processing over existing tokens in your index.

以下示例添加一个 EdgeNGramTokenFilter,使前缀匹配速度更快。The following example adds an EdgeNGramTokenFilter to make prefix matches faster. 为包含字符的 2-25 个字符的组合生成额外的标记:(不局限于 MS、MSF、MSFT、MSFT/、MSFT/S、MSFT/SQ、MSFT/SQL)。Additional tokens are generated for in 2-25 character combinations that include characters: (not only MS, MSF, MSFT, MSFT/, MSFT/S, MSFT/SQ, MSFT/SQL).

可以想象到,额外的标记化操作会导致索引变得更大。As you can imagine, the additional tokenization results in a larger index. 如果有足够的容量来容纳较大的索引,则此方法的响应时间更短,可能是更好的解决方案。If you have sufficient capacity to accommodate the larger index, this approach with its faster response time might be a better solution.

{
"fields": [
  {
  "name": "accountNumber",
  "analyzer":"myCustomAnalyzer",
  "type": "Edm.String",
  "searchable": true,
  "filterable": true,
  "retrievable": true,
  "sortable": false,
  "facetable": false
  }
],

"analyzers": [
  {
  "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
  "name":"myCustomAnalyzer",
  "charFilters":[],
  "tokenizer":"keyword_v2",
  "tokenFilters":["lowercase", "my_edgeNGram"]
  }
],
"tokenizers":[],
"charFilters": [],
"tokenFilters": [
  {
  "@odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
  "name":"my_edgeNGram",
  "minGram": 2,
  "maxGram": 25,
  "side": "front"
  }
]

后续步骤Next steps

本文说明了分析器如何造成查询问题和解决查询问题。This article explains how analyzers both contribute to query problems and solve query problems. 接下来,请更详细地了解分析器对索引编制和查询处理的影响。As a next step, take a closer look at analyzer impact on indexing and query processing. 具体而言,请考虑使用分析文本 API 返回标记化的输出,以便可以确切地了解分析器正在为索引创建哪些内容。In particular, consider using the Analyze Text API to return tokenized output so that you can see exactly what an analyzer is creating for your index.