Azure 认知搜索中的 Lucene 查询语法Lucene query syntax in Azure Cognitive Search

可以基于用于专用查询窗体的丰富 Lucene 查询分析语法写入针对 Azure 认知搜索的查询:通配符、模糊搜索、邻近搜索、正则表达式等。You can write queries against Azure Cognitive Search based on the rich Lucene Query Parser syntax for specialized query forms: wildcard, fuzzy search, proximity search, regular expressions are a few examples. 除了通过 $filter 表达式在 Azure 认知搜索中构造的“范围搜索”之外,大部分 Lucene 查询分析器语法都在 Azure 认知搜索中完整实现Much of the Lucene Query Parser syntax is implemented intact in Azure Cognitive Search, with the exception of range searches which are constructed in Azure Cognitive Search through $filter expressions.

备注

完整 Lucene 语法用于在搜索文档 API 的搜索参数中传递的查询表达式,不要与用于该 API 的 $filter 参数的 OData 语法相混淆。The full Lucene syntax is used for query expressions passed in the search parameter of the Search Documents API, not to be confused with the OData syntax used for the $filter parameter of that API. 这两个不同的语法有各自的用于构造查询、转义字符串等操作的规则。These different syntaxes have their own rules for constructing queries, escaping strings, and so on.

调用完整分析Invoke full parsing

设置 queryType 搜索参数来指定要使用的分析。Set the queryType search parameter to specify which parser to use. 有效值包括 simple|full,其中默认值为 simplefull 则用于 Lucene。Valid values include simple|full, with simple as the default, and full for Lucene.

显示完整语法的示例Example showing full syntax

下面的示例使用 Lucene 查询语法在索引中查找文档,其在 queryType=full 参数中清晰易见。The following example finds documents in the index using the Lucene query syntax, evident in the queryType=full parameter. 此查询返回酒店,其中类别字段包含字词“budget”和所有包含短语“recently renovated”的可搜索字段。This query returns hotels where the category field contains the term "budget" and all searchable fields containing the phrase "recently renovated". 作为字词提升值 (3),包含短语“最近更新”的文档排名会更高。Documents containing the phrase "recently renovated" are ranked higher as a result of the term boost value (3).

searchMode=all 参数是在此示例中是相关的。The searchMode=all parameter is relevant in this example. 无论运算符何时出现在查询上,通常都应该设置 searchMode=all 以确保匹配所有条件 。Whenever operators are on the query, you should generally set searchMode=all to ensure that all of the criteria is matched.

GET /indexes/hotels/docs?search=category:budget AND \"recently renovated\"^3&searchMode=all&api-version=2020-06-30&querytype=full

或者使用 POST:Alternatively, use POST:

POST /indexes/hotels/docs/search?api-version=2020-06-30
{
  "search": "category:budget AND \"recently renovated\"^3",
  "queryType": "full",
  "searchMode": "all"
}

有关其他示例,请参阅在 Azure 认知搜索中生成查询的 Lucene 查询语法示例For additional examples, see Lucene query syntax examples for building queries in Azure Cognitive Search. 有关指定查询参数的完整条件的详细信息,请参阅搜索文档(Azure 认知搜索 REST API)For details about specifying the full contingent of query parameters, see Search Documents (Azure Cognitive Search REST API).

备注

Azure 认知搜索还支持简单查询语法,即可用于简单关键字搜索的简易可靠的查询语言。Azure Cognitive Search also supports Simple Query Syntax, a simple and robust query language that can be used for straightforward keyword search.

语法基础Syntax fundamentals

下面的语法基础适用于所有使用 Lucene 语法的查询。the following syntax fundamentals apply to all queries that use the Lucene syntax.

上下文中的运算符评估Operator evaluation in context

位置决定符号解释为运算符或者解释为字符串中的另一个字符。Placement determines whether a symbol is interpreted as an operator or just another character in a string.

例如,Lucene 完整语法中,波浪线 (~) 用于模糊搜索和邻近搜索。For example, in Lucene full syntax, the tilde (~) is used for both fuzzy search and proximity search. 如果放在引用短语之后,则 ~ 调用邻近搜索。When placed after a quoted phrase, ~ invokes proximity search. 如果放在术语末尾,则 ~ 调用模糊搜索。When placed at the end of a term, ~ invokes fuzzy search.

该术语中,例如“business~analyst”,字符不评估为运算符。Within a term, such as "business~analyst", the character is not evaluated as an operator. 在此情况下,假设查询是术语或短语查询,则使用词法分析全文搜索会删除 ~ 并将术语“business~analyst”分为两部分:business 或 analyst。In this case, assuming the query is a term or phrase query, full text search with lexical analysis strips out the ~ and breaks the term "business~analyst" in two: business OR analyst.

上面的示例是波形符 (~),不过相同原则也适用于每个运算符。The example above is the tilde (~), but the same principle applies to every operator.

转义特殊字符Escaping special characters

若要使用任何搜索运算符作为搜索文本的一部分,请使用一个反斜杠 (\) 作为前缀对该字符进行转义。In order to use any of the search operators as part of the search text, escape the character by prefixing it with a single backslash (\). 例如,若要执行对 https:// 的通配符搜索(其中 :// 是查询字符串的一部分),需要指定 search=https\:\/\/*For example, for a wildcard search on https://, where :// is part of the query string, you would specify search=https\:\/\/*. 同样,转义的电话号码模式可能类似于 \+1 \(800\) 642\-7676Similarly, an escaped phone number pattern might look like this \+1 \(800\) 642\-7676.

需要转义的特殊字符包括下列项:Special characters that require escaping include the following:
+ - & | ! ( ) { } [ ] ^ " ~ * ? : \ /

备注

尽管转义将标记保留在一起,但在编制索引期间,词法分析可能会将它们去除。例如,标准 Lucene 分析器会在连字符、空格和其他字符处断开单词。Although escaping keeps tokens together, lexical analysis during indexing may strip them out. For example, the standard Lucene analyzer will break words on hyphens, whitespace, and other characters. 如果需要在查询字符串中使用特殊字符,则可能需要使用会将它们保留在索引中的分析器。If you require special characters in the query string, you might need an analyzer that preserves them in the index. 可供选择的一些项包括 Microsoft 自然语言分析器(它会保留带连字符的单词)和自定义分析器(用于更复杂的模式)。Some choices include Microsoft natural language analyzers, which preserves hyphenated words, or a custom analyzer for more complex patterns. 有关详细信息,请参阅部分词语、模式和特殊字符For more information, see Partial terms, patterns, and special characters.

对 URL 中的不安全及保留字符进行编码Encoding unsafe and reserved characters in URLs

请确保对 URL 中的所有不安全和保留字符进行编码。Please ensure all unsafe and reserved characters are encoded in a URL. 例如,“#”是不安全字符,因为它是 URL 中的片段/定位标识符。For example, '#' is an unsafe character because it is a fragment/anchor identifier in a URL. 如果用于 URL,则该字符必须编码为 %23The character must be encoded to %23 if used in a URL. 由于“&”和“=”在 Azure 认知搜索中分隔参数并指定值,因而是保留字符的示例。'&' and '=' are examples of reserved characters as they delimit parameters and specify values in Azure Cognitive Search. 请参阅 RFC1738:统一资源定位器 (URL) 获取更多详细信息。Please see RFC1738: Uniform Resource Locators (URL) for more details.

不安全字符为 " ` < > # % { } | \ ^ ~ [ ]Unsafe characters are " ` < > # % { } | \ ^ ~ [ ]. 保留字符为 ; / ? : @ = + &Reserved characters are ; / ? : @ = + &.

查询大小限制Query size limits

存在对可以发送到 Azure 认知搜索的查询大小的限制。There is a limit to the size of queries that you can send to Azure Cognitive Search. 具体而言,最多可以有 1024 条子句(以 AND、OR 等分隔的表达式)。Specifically, you can have at most 1024 clauses (expressions separated by AND, OR, and so on). 此外,查询中任何单个术语的大小限制为大约 32 KB。There is also a limit of approximately 32 KB on the size of any individual term in a query. 如果应用程序以编程方式生成搜索查询,则建议将其设计为不会生成无限大小的查询。If your application generates search queries programmatically, we recommend designing it in such a way that it does not generate queries of unbounded size.

优先运算符(分组)Precedence operators (grouping)

可以使用圆括号创建子查询,其包括附加说明语句中的运算符。You can use parentheses to create subqueries, including operators within the parenthetical statement. 例如,motel+(wifi||luxury) 将搜索包含“motel”术语以及“wifi”或“luxury”(或两者)的文档。For example, motel+(wifi||luxury) will search for documents containing the "motel" term and either "wifi" or "luxury" (or both).

字段分组与之类似,但将分组范围限定为单个字段。Field grouping is similar but scopes the grouping to a single field. 例如,hotelAmenities:(gym+(wifi||pool)) 在“hotelAmenities”字段中搜索“gym”和“wifi”,或者“gym”和“pool”。For example, hotelAmenities:(gym+(wifi||pool)) searches the field "hotelAmenities" for "gym" and "wifi", or "gym" and "pool".

始终全部以大写字母指定文本布尔运算符 (AND、OR、NOT)。Always specify text boolean operators (AND, OR, NOT) in all caps.

OR 运算符 OR||OR operator OR or ||

OR 运算符是一个竖条或管状字符。The OR operator is a vertical bar or pipe character. 例如:wifi || luxury 将搜索包含"wifi"或"luxury"(或两者)的文档。For example: wifi || luxury will search for documents containing either "wifi" or "luxury" or both. 由于 OR 是默认连接运算符,因此也可以省略,这样 wifi luxury 等同于 wifi || luxuryBecause OR is the default conjunction operator, you could also leave it out, such that wifi luxury is the equivalent of wifi || luxury.

AND 运算符 AND&&+AND operator AND, && or +

AND 运算符为 & 号或加号。The AND operator is an ampersand or a plus sign. 例如:wifi && luxury 将搜索包含“wifi”和“luxury”的文档。For example: wifi && luxury will search for documents containing both "wifi" and "luxury". 加号字符 (+) 用于所需术语。The plus character (+) is used for required terms. 例如,+wifi +luxury 规定两个术语必须出现在单个文档的某个字段中。For example, +wifi +luxury stipulates that both terms must appear somewhere in the field of a single document.

NOT 运算符 NOT!-NOT operator NOT, ! or -

NOT 运算符是一个减号。The NOT operator is a minus sign. 例如:wifi –luxury 将搜索包含 wifi 词语且/或不包含 luxury 的文档。For example, wifi –luxury will search for documents that have the wifi term and/or do not have luxury.

查询请求中的 searchMode 参数控制具有 NOT 运算符的词语是通过 AND 运算符还是通过 OR 运算符与查询中的其他词语组合到一起(假定其他词语中没有 +| 运算符)。The searchMode parameter on a query request controls whether a term with the NOT operator is ANDed or ORed with other terms in the query (assuming there is no + or | operator on the other terms). 有效值包括 anyallValid values include any or all.

searchMode=any 通过包含更多结果来提高查询的查全率,且默认情况下 - 会被解释为“OR NOT”。searchMode=any increases the recall of queries by including more results, and by default - will be interpreted as "OR NOT". 例如,wifi -luxury 将匹配包含 wifi 词条或不包含 luxury 词条的文档。For example, wifi -luxury will match documents that either contain the term wifi or those that do not contain the term luxury.

searchMode=all 通过包含更少结果来提高查询的查准率,且默认情况下“-”会被解释为“AND NOT”。searchMode=all increases the precision of queries by including fewer results, and by default - will be interpreted as "AND NOT". 例如,wifi -luxury 将匹配包含 wifi 词条且不包含“luxury”词条的文档。For example, wifi -luxury will match documents that contain the term wifi and do not contain the term "luxury". 这对于 - 运算符来说可能是更直观的行为。This is arguably a more intuitive behavior for the - operator. 因此,如果想要优化搜索的查准率(而非查全率),且 用户在搜索中频繁使用 - 运算符,则应考虑使用 searchMode=all 而不是 searchMode=anyTherefore, you should consider using searchMode=all instead of searchMode=any if you want to optimize searches for precision instead of recall, and Your users frequently use the - operator in searches.

在决定 searchMode 设置时,请考虑不同应用程序中的查询的用户交互模式。When deciding on a searchMode setting, consider the user interaction patterns for queries in various applications. 搜索信息的用户更有可能在查询中包含运算符,相对而言,电子商务网站具有更多的内置导航结构。Users who are searching for information are more likely to include an operator in a query, as opposed to e-commerce sites that have more built-in navigation structures.

可以使用 fieldName:searchExpression 语法定义字段化搜索操作,其中的搜索表达式可以是单个词,也可以是一个短语,或者是括号中的更复杂的表达式,可以选择使用布尔运算符。You can define a fielded search operation with the fieldName:searchExpression syntax, where the search expression can be a single word or a phrase, or a more complex expression in parentheses, optionally with Boolean operators. 下面是部分示例:Some examples include the following:

  • 流派:爵士乐无历史记录genre:jazz NOT history

  • 艺术家:(“Miles Davis”、“John Coltrane”)artists:("Miles Davis" "John Coltrane")

如果想要两个字符串评估为单个实体,请务必将多个字符串放置在引号内,正如这个在 artists 字段中搜索两个不同艺术家的情况一样。Be sure to put multiple strings within quotation marks if you want both strings to be evaluated as a single entity, in this case searching for two distinct artists in the artists field.

fieldName:searchExpression 中指定的字段必须是 searchable 字段。The field specified in fieldName:searchExpression must be a searchable field. 有关如何在字段定义中使用索引属性的详细信息,请参阅创建索引See Create Index for details on how index attributes are used in field definitions.

备注

使用字段化搜索表达式时,不需使用 searchFields 参数,因为每个字段化搜索表达式都有一个显式指定的字段名称。When using fielded search expressions, you do not need to use the searchFields parameter because each fielded search expression has a field name explicitly specified. 但是,如果需要运行查询,则仍可使用 searchFields 参数,其中的某些部分局限于特定字段,其余部分可以应用到多个字段。However, you can still use the searchFields parameter if you want to run a query where some parts are scoped to a specific field, and the rest could apply to several fields. 例如,查询 search=genre:jazz NOT history&searchFields=description 只将 jazz 匹配到 genre 字段,而它则会将 NOT historydescription 字段匹配。For example, the query search=genre:jazz NOT history&searchFields=description would match jazz only to the genre field, while it would match NOT history with the description field. fieldName:searchExpression 中提供的字段名称始终优先于 searchFields 参数,这就是在此示例中我们不需在 searchFields 参数中包括 genre 的原因。The field name provided in fieldName:searchExpression always takes precedence over the searchFields parameter, which is why in this example, we do not need to include genre in the searchFields parameter.

模糊搜索查找字词中具有类似构造的匹配项,将一个字词最多扩展为符合距离条件(2 或更低)的 50 个字词。A fuzzy search finds matches in terms that have a similar construction, expanding a term up to the maximum of 50 terms that meet the distance criteria of two or less. 有关详细信息,请参阅模糊搜索For more information, see Fuzzy search.

若要进行模糊搜索,请在单个词末尾使用“~”波形符,另附带指定编辑距离的可选参数(0 到 2 [默认] 之间的值)。To do a fuzzy search, use the tilde "~" symbol at the end of a single word with an optional parameter, a number between 0 and 2 (default), that specifies the edit distance. 例如“blue~”或“blue~1”会返回“blue”、“blues”和“glue”。For example, "blue~" or "blue~1" would return "blue", "blues", and "glue".

模糊搜索只能应用于术语,不能应用于短语,但是你可以在包含多个部分的名称或短语中将波形符单独追加到每个术语。Fuzzy search can only be applied to terms, not phrases, but you can append the tilde to each term individually in a multi-part name or phrase. 例如,“Unviersty~ of~ "Wshington~”会与“University of Washington”匹配。For example, "Unviersty~ of~ "Wshington~" would match on "University of Washington".

邻近搜索用于搜索文档中彼此邻近的术语。Proximity searches are used to find terms that are near each other in a document. 在短语末尾插入波形符“~”,后跟创建邻近边界的词数。Insert a tilde "~" symbol at the end of a phrase followed by the number of words that create the proximity boundary. 例如 "hotel airport"~5 将查找文档中彼此相距 5 个字以内的术语“酒店”和“机场”。For example, "hotel airport"~5 will find the terms "hotel" and "airport" within 5 words of each other in a document.

术语提升Term boosting

术语提升是指相对于不包含术语的文档,提高包含提升术语的文档排名。Term boosting refers to ranking a document higher if it contains the boosted term, relative to documents that do not contain the term. 这不同于计分配置文件,因为计分配置文件提升某些字段,而非特定术语。This differs from scoring profiles in that scoring profiles boost certain fields, rather than specific terms.

以下示例有助于解释这些差异。The following example helps illustrate the differences. 假设某个字段中存在提升匹配度的计分概要文件,例如 musicstoreindex 示例中的“流派” 。Suppose that there's a scoring profile that boosts matches in a certain field, say genre in the musicstoreindex example. 术语提升可用于进一步提升高于其他术语的某些搜索词。Term boosting could be used to further boost certain search terms higher than others. 例如 rock^2 electronic 将提升“流派”字段(高于搜索中其他搜索字段)中包含搜索词的文档。For example, rock^2 electronic will boost documents that contain the search terms in the genre field higher than other searchable fields in the index. 另外,由于术语提升值 (2),包含搜索词“rock”的文档的排名要比包含搜索词“electronic”的要高 。Further, documents that contain the search term rock will be ranked higher than the other search term electronic as a result of the term boost value (2).

若要提升术语,请使用插入符号“^”,并且所搜索术语末尾还要附加提升系数(数字)。To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. 还可以提升短语。You can also boost phrases. 提升系数越高,术语相对于其他搜索词的相关性也越大。The higher the boost factor, the more relevant the term will be relative to other search terms. 默认情况下,提升系数是 1。By default, the boost factor is 1. 虽然提升系数必须是正数,但可以小于 1(例如 0.20)。Although the boost factor must be positive, it can be less than 1 (for example, 0.20).

正则表达式搜索根据在 Apache Lucene 下有效的模式找到匹配项,如 RegExp 类中所述。A regular expression search finds a match based on patterns that are valid under Apache Lucene, as documented in the RegExp class. 在 Azure 认知搜索中,正则表达式包含在正斜杠 / 之间。In Azure Cognitive Search, a regular expression is enclosed between forward slashes /.

例如,若要查找包含“汽车旅馆”或“酒店”的文档,请指定 /[mh]otel/For example, to find documents containing "motel" or "hotel", specify /[mh]otel/. 正则表达式搜索与单个词匹配。Regular expression searches are matched against single words.

某些工具和语言施加了额外的转义字符要求。Some tools and languages impose additional escape character requirements. 对于 JSON,包含正斜杠的字符串将使用反斜杠进行转义:“microsoft.com/azure/”变成 search=/.*microsoft.com\/azure\/.*/,其中 search=/.* <string-placeholder>.*/ 设置正则表达式,microsoft.com\/azure\/ 是包含转义后的正斜杠的字符串。For JSON, strings that include a forward slash are escaped with a backward slash: "microsoft.com/azure/" becomes search=/.*microsoft.com\/azure\/.*/ where search=/.* <string-placeholder>.*/ sets up the regular expression, and microsoft.com\/azure\/ is the string with an escaped forward slash.

可将通常可识别的语法用于多个 (*) 或单个 (?) 字符通配符搜索。You can use generally recognized syntax for multiple (*) or single (?) character wildcard searches. 例如,search=alpha* 的查询表达式返回“alphanumeric”或“alphabetical”。For example, a query expression of search=alpha* returns "alphanumeric" or "alphabetical". 请注意,Lucene 查询分析器支持将这些符号与单个术语一起使用,但不能与短语一起使用。Note the Lucene query parser supports the use of these symbols with a single term, and not a phrase.

完整的 Lucene 语法支持前缀、中缀和后缀匹配。Full Lucene syntax supports prefix, infix, and suffix matching. 但是,如果只需要前缀匹配,则可以使用简单的语法(两者都支持前缀匹配)。However, if all you need is prefix matching, you can use the simple syntax (prefix matching is supported in both).

后缀匹配,其中 *? 在字符串之前(如 search=/.*numeric./)或中缀匹配需要完整的 Lucene 语法以及正则表达式正斜杠 / 分隔符。Suffix matching, where * or ? precedes the string (as in search=/.*numeric./) or infix matching requires full Lucene syntax, as well as the regular expression forward slash / delimiters. 不得将 * 或 ?You cannot use a * or ? 符号作为搜索词的第一个字符,或在不含 / 的搜索词中。symbol as the first character of a term, or within a term, without the /.

备注

通常,模式匹配很慢,因此你可能需要使用其他方法,例如边缘 n 元标记化,为搜索词中的字符序列创建标记。As a rule, pattern matching is slow so you might want to explore alternative methods, such as edge n-gram tokenization that creates tokens for sequences of characters in a term. 索引将更大,但是查询的执行速度可能更快,具体取决于模式构造和要编制索引的字符串的长度。The index will be larger, but queries might execute faster, depending on the pattern construction and the length of strings you are indexing.

分析器对通配符查询的影响Impact of an analyzer on wildcard queries

在查询分析期间,以前缀、后缀、通配符或正则表达式形式构建的查询将绕过词法分析,按原样传递到查询树。During query parsing, queries that are formulated as prefix, suffix, wildcard, or regular expressions are passed as-is to the query tree, bypassing lexical analysis. 仅当索引包含查询所指定的格式的字符串时,才会查找匹配项。Matches will only be found if the index contains the strings in the format your query specifies. 在大多数情况下,在编制索引期间需要使用一个可以保留字符串完整性的分析器,使部分字词和模式匹配能够成功。In most cases, you will need an analyzer during indexing that preserves string integrity so that partial term and pattern matching succeeds. 有关详细信息,请参阅 Azure 认知搜索查询中的部分字词搜索For more information, see Partial term search in Azure Cognitive Search queries.

考虑这样一种情况:你可能希望搜索查询“terminate*”返回包含“terminate”、“termination”和“terminates”等术语的结果。Consider a situation where you may want the search query 'terminat*' to return results that contain terms such as 'terminate', 'termination' and 'terminates'.

如果你要用 en.lucene(Lucene 英文版)分析器,它将对每个术语应用主动的词干提取。If you were to use the en.lucene (English Lucene) analyzer, it would apply aggressive stemming of each term. 例如,将“terminate”、“termination”和“terminates”标记到索引中的标记“termi”。For example, 'terminate', 'termination', 'terminates' will all be tokenized down to the token 'termi' in your index. 另一方面,根本不会分析使用通配符或模糊搜索的查询中的术语,因此不会有与“terminate*”查询匹配的结果。On the other side, terms in queries using wildcards or fuzzy search are not analyzed at all., so there would be no results that would match the 'terminat*' query.

另一方面,Microsoft 分析器(在本例中是 en.microsoft 分析器)更高级一些,使用词形还原而不是词干提取。On the other side, the Microsoft analyzers (in this case, the en.microsoft analyzer) are a bit more advanced and use lemmatization instead of stemming. 这意味着所有生成的标记都应该是有效的英语单词。This means that all generated tokens should be valid English words. 例如,“terminate”、“terminates”和“termination”在索引中几乎保持完整,对于严重依赖于通配符和模糊搜索的场景,这是更好的选择。For example, 'terminate', 'terminates' and 'termination' will mostly stay whole in the index, and would be a preferable choice for scenarios that depend a lot on wildcards and fuzzy search.

对通配符和正则表达式查询评分Scoring wildcard and regex queries

Azure 认知搜索使用基于频率的评分 (TF-IDF) 进行文本查询。Azure Cognitive Search uses frequency-based scoring (TF-IDF for text queries. 但是,对于术语范围可能很广的通配符和正则表达式查询,则忽略频率因子,以防止排名偏向于比较少见的术语匹配。However, for wildcard and regex queries where scope of terms can potentially be broad, the frequency factor is ignored to prevent the ranking from biasing towards matches from rarer terms. 通配符和正则表达式搜索对所有匹配项和正则表达式搜索进行相同处理。All matches are treated equally for wildcard and regex searches.

另请参阅See also