Azure 认知搜索中的简单查询语法Simple query syntax in Azure Cognitive Search

Azure 认知搜索实现两种基于 Lucene 的查询语言:简单查询分析器Lucene 查询分析器Azure Cognitive Search implements two Lucene-based query languages: Simple Query Parser and the Lucene Query Parser.

简单分析器更加灵活,并且即使请求撰写得不够完美,它也会尝试对其进行解释。The simple parser is more flexible and will attempt to interpret a request even if it's not perfectly composed. 这种灵活性使其成为了 Azure 认知搜索中查询的默认设置。Because of this flexibility, it is the default for queries in Azure Cognitive Search.

简单语法用于在搜索文档请求search 参数中传递的查询表达式,不要与用于该搜索文档 API 的 $filter 表达式中的 OData 语法相混淆。The simple syntax is used for query expressions passed in the search parameter of a Search Documents request, not to be confused with the OData syntax used for the $filter expressions parameter of the same Search Documents API. search$filter 参数具有不同的语法,有各自的用于构造查询、转义字符串等操作的规则。The search and $filter parameters have different syntax, with their own rules for constructing queries, escaping strings, and so on.

尽管简单分析器基于 Apache Lucene 简单查询分析器类,但 Azure 认知搜索中的实现排除了模糊搜索。Although the simple parser is based on the Apache Lucene Simple Query Parser class, the implementation in Azure Cognitive Search excludes fuzzy search. 如果需要模糊搜索或其他高级查询形式,请考虑改用完整 Lucene 查询语法If you need fuzzy search or other advanced query forms, consider the alternative full Lucene query syntax instead.

调用简单分析Invoke simple parsing

简单语法为默认语法。Simple syntax is the default. 仅当将语法从“完整”重置为“简单”时才需要调用。Invocation is only necessary if you are resetting the syntax from full to simple. 若要显式设置语法,请使用 queryType 搜索参数。To explicitly set the syntax, use the queryType search parameter. 有效值包括 queryType=simplequeryType=full(其中 simple 为默认值),full 调用完整的 Lucene 查询分析程序进行更高级的查询。Valid values include queryType=simple or queryType=full, where simple is the default, and full invokes the full Lucene query parser for more advanced queries.

语法基础知识Syntax fundamentals

包含一个或多个词条的任何文本都被视为查询执行的有效起点。Any text with one or more terms is considered a valid starting point for query execution. Azure 认知搜索将匹配包含任何或所有词条的文档,其中包括在分析文本期间发现的任何变体。Azure Cognitive Search will match documents containing any or all of the terms, including any variations found during analysis of the text.

尽管听起来很简单,但 Azure 认知搜索中的查询执行的一个方面可能会产生意外结果,导致搜索结果增加而不是减少,因为更多的词条和运算符被添加到输入字符串中。As straightforward as this sounds, there is one aspect of query execution in Azure Cognitive Search that might produce unexpected results, increasing rather than decreasing search results as more terms and operators are added to the input string. 这种扩展是否会实际发生取决于是否包含 NOT 运算符,以及组合使用的 searchMode 参数设置,该参数设置决定了如何根据 AND 或 OR 行为解释 NOT。Whether this expansion actually occurs depends on the inclusion of a NOT operator, combined with a searchMode parameter setting that determines how NOT is interpreted in terms of AND or OR behaviors. 有关详细信息,请参阅 NOT 运算符For more information, see NOT operator.

优先运算符(分组)Precedence operators (grouping)

可以使用圆括号创建子查询,其包括附加说明语句中的运算符。You can use parentheses to create subqueries, including operators within the parenthetical statement. 例如,motel+(wifi|luxury) 将搜索包含“motel”术语以及“wifi”或“luxury”(或两者)的文档。For example, motel+(wifi|luxury) will search for documents containing the "motel" term and either "wifi" or "luxury" (or both).

字段分组与之类似,但将分组范围限定为单个字段。Field grouping is similar but scopes the grouping to a single field. 例如,hotelAmenities:(gym+(wifi|pool)) 在“hotelAmenities”字段中搜索“gym”和“wifi”,或者“gym”和“pool”。For example, hotelAmenities:(gym+(wifi|pool)) searches the field "hotelAmenities" for "gym" and "wifi", or "gym" and "pool".

转义搜索运算符Escaping search operators

在简单语法中,搜索运算符包含以下字符:+ | " ( ) ' \In the simple syntax, search operators include these characters: + | " ( ) ' \

如果其中的任何字符是索引中的令牌的一部分,请在查询中为其添加一个反斜杠 (\) 作为前缀对其进行转义。If any of these characters are part of a token in the index, escape it by prefixing it with a single backslash (\) in the query. 例如,假设你使用自定义分析器完成了对整个术语的词汇切分,并且索引包含字符串“Luxury+Hotel”。For example, suppose you used a custom analyzer for whole term tokenization, and your index contains the string "Luxury+Hotel". 若要完全匹配此令牌,请插入转义字符: search=luxury\+hotelTo get an exact match on this token, insert an escape character: search=luxury\+hotel.

为了让更典型的情况变得简单,此规则有两个不需要进行转义的例外:To make things simple for the more typical cases, there are two exceptions to this rule where escaping is not needed:

  • 仅当 NOT 运算符 - 是空格之后的第一个字符时才需要对其进行转义。The NOT operator - only needs to be escaped if it's the first character after a whitespace. 如果 - 位于中间(例如在 3352CDD0-EF30-4A2E-A512-3B30AF40F3FD 中),则不需要对其进行转义。If the - appears in the middle (for example, in 3352CDD0-EF30-4A2E-A512-3B30AF40F3FD), you can skip escaping.

  • 仅当后缀运算符 * 是空格之前的最后一个字符时才需要对其进行转义。The suffix operator * only needs to be escaped if it's the last character before a whitespace. 如果 * 位于中间(例如在 4*4=16 中),则不需要对其进行转义。If the * appears in the middle (for example, in 4*4=16), no escaping is needed.

备注

默认情况下,标准分析器会在词法分析时删除连字符、空格、& 符和其他字符,并在这些字符处拆分单词。By default, the standard analyzer will delete and break words on hyphens, whitespace, ampersands, and other characters during lexical analysis. 如果需要在查询字符串中保留特殊字符,则可能需要使用一个分析器将它们保留在索引中。If you require special characters to remain in the query string, you might need an analyzer that preserves them in the index. 可供选择的一些项包括 Microsoft 自然语言分析器(它会保留带连字符的单词)和自定义分析器(用于更复杂的模式)。Some choices include Microsoft natural language analyzers, which preserves hyphenated words, or a custom analyzer for more complex patterns. 有关详细信息,请参阅部分词语、模式和特殊字符For more information, see Partial terms, patterns, and special characters.

对 URL 中的不安全及保留字符进行编码Encoding unsafe and reserved characters in URLs

请确保对 URL 中的所有不安全和保留字符进行编码。Please ensure all unsafe and reserved characters are encoded in a URL. 例如,“#”是不安全字符,因为它是 URL 中的片段/定位标识符。For example, '#' is an unsafe character because it is a fragment/anchor identifier in a URL. 如果用于 URL,则该字符必须编码为 %23The character must be encoded to %23 if used in a URL. 由于“&”和“=”在 Azure 认知搜索中分隔参数并指定值,因而是保留字符的示例。'&' and '=' are examples of reserved characters as they delimit parameters and specify values in Azure Cognitive Search. 请参阅 RFC1738:统一资源定位器 (URL) 获取更多详细信息。Please see RFC1738: Uniform Resource Locators (URL) for more details.

不安全字符为 " ` < > # % { } | \ ^ ~ [ ]Unsafe characters are " ` < > # % { } | \ ^ ~ [ ]. 保留字符为 ; / ? : @ = + &Reserved characters are ; / ? : @ = + &.

查询特殊字符Querying for special characters

在某些情况下,可能需要搜索特殊字符,如“❤”表情符号或“€”符号。In some circumstances, you may want to search for a special character, like the '❤' emoji or the '€' sign. 在此类情况下,请确保所使用的分析器不会筛选掉这些字符。标准分析器会忽略很多特殊字符,因此这些字符不会成为索引中的标记。In those case, make sure that the analyzer you use does not filter those characters out. The standard analyzer ignores many of the special characters so they would not become tokens in your index.

因此,第一步是确保所使用的分析器会考虑这些元素标记。So the first step is to make sure you use an analyzer that will consider those elements tokens. 例如,“空格”分析器将由空格分隔的任何字符序列视为标记,因此“❤”字符串会被视为标记。For instance, the "whitespace" analyzer takes into consideration any character sequences separated by whitespaces as tokens, so the "❤" string would be considered a token. 另外,诸如 Microsoft 英语分析器(“en.microsoft”)之类的分析器会将“€”字符串视为标记。Also, an analyzer like the Microsoft English analyzer ("en.microsoft"), would take into consideration the "€" string as a token. 可以测试分析器,看它为给定的查询生成什么标记。You can test an analyzer to see what tokens it generates for a given query.

使用 Unicode 字符时,请确保在查询 URL 中正确转义了符号(例如,对于“❤”,将使用转义序列 %E2%9D%A4+)。When using Unicode characters, make sure symbols are properly escaped in the query url (for instance for "❤" would use the escape sequence %E2%9D%A4+). Postman 会自动执行此转换。Postman does this translation automatically.

查询大小限制Query size limits

存在对可以发送到 Azure 认知搜索的查询大小的限制。There is a limit to the size of queries that you can send to Azure Cognitive Search. 具体而言,最多可以有 1024 条子句(以 AND、OR 等分隔的表达式)。Specifically, you can have at most 1024 clauses (expressions separated by AND, OR, and so on). 此外,查询中任何单个术语的大小限制为大约 32 KB。There is also a limit of approximately 32 KB on the size of any individual term in a query. 如果应用程序以编程方式生成搜索查询,则建议将其设计为不会生成无限大小的查询。If your application generates search queries programmatically, we recommend designing it in such a way that it does not generate queries of unbounded size.

可以在查询字符串中嵌入布尔运算符(AND、OR、NOT),以生成丰富的一组用于查找匹配文档的标准。You can embed Boolean operators (AND, OR, NOT) in a query string to build a rich set of criteria against which matching documents are found.

AND 运算符 +AND operator +

AND 运算符是一个加号。The AND operator is a plus sign. 例如,wifi + luxury 将搜索包含 wifiluxury 的文档。For example, wifi + luxury will search for documents containing both wifi and luxury.

OR 运算符 |OR operator |

OR 运算符是一个竖条或管状字符。The OR operator is a vertical bar or pipe character. 例如,wifi | luxury 将搜索包含 wifiluxury 或两者的文档。For example, wifi | luxury will search for documents containing either wifi or luxury or both.

NOT 运算符 -NOT operator -

NOT 运算符是一个减号。The NOT operator is a minus sign. 例如:wifi –luxury 将搜索包含 wifi 词语且/或不包含 luxury 的文档。For example, wifi –luxury will search for documents that have the wifi term and/or do not have luxury.

查询请求中的 searchMode 参数控制具有 NOT 运算符的词语是通过 AND 运算符还是通过 OR 运算符与查询中的其他词语组合到一起(假定其他词语中没有 +| 运算符)。The searchMode parameter on a query request controls whether a term with the NOT operator is ANDed or ORed with other terms in the query (assuming there is no + or | operator on the other terms). 有效值包括 anyallValid values include any or all.

searchMode=any 通过包含更多结果来提高查询的查全率,且默认情况下 - 会被解释为“OR NOT”。searchMode=any increases the recall of queries by including more results, and by default - will be interpreted as "OR NOT". 例如,wifi -luxury 将匹配包含 wifi 词条或不包含 luxury 词条的文档。For example, wifi -luxury will match documents that either contain the term wifi or those that do not contain the term luxury.

searchMode=all 通过包含更少结果来提高查询的查准率,且默认情况下“-”会被解释为“AND NOT”。searchMode=all increases the precision of queries by including fewer results, and by default - will be interpreted as "AND NOT". 例如,wifi -luxury 将匹配包含 wifi 词条且不包含“luxury”词条的文档。For example, wifi -luxury will match documents that contain the term wifi and do not contain the term "luxury". 这对于 - 运算符来说可能是更直观的行为。This is arguably a more intuitive behavior for the - operator. 因此,如果想要优化搜索的查准率(而非查全率),且用户在搜索中频繁使用 - 运算符,则应考虑使用 searchMode=all 而不是 searchMode=anyTherefore, you should consider using searchMode=all instead of searchMode=any if you want to optimize searches for precision instead of recall, and Your users frequently use the - operator in searches.

在决定 searchMode 设置时,请考虑不同应用程序中的查询的用户交互模式。When deciding on a searchMode setting, consider the user interaction patterns for queries in various applications. 搜索信息的用户更有可能在查询中包含运算符,相对而言,电子商务网站具有更多的内置导航结构。Users who are searching for information are more likely to include an operator in a query, as opposed to e-commerce sites that have more built-in navigation structures.

通配符前缀匹配(*、?)Wildcard prefix matching (*, ?)

对于“开头为”查询,请添加后缀运算符作为词条剩余部分的占位符。For "starts with" queries, add a suffix operator as the placeholder for the remainder of a term. 使用星号 * 表示多个字符,或使用 ? 表示单个字符。Use an asterisk * for multiple characters or ? for single characters. 例如,lingui* 会匹配“linguistic”或“linguini”(忽略大小写)。For example, lingui* will match on "linguistic" or "linguini", ignoring case.

与筛选器类似,前缀查询查找完全匹配项。Similar to filters, a prefix query looks for an exact match. 因此,不存在相关性评分(所有结果的搜索分数均为 1.0)。As such, there is no relevance scoring (all results receive a search score of 1.0). 请注意,前缀查询可能会很慢,尤其是在索引较大且前缀包含的字符数较少的情况下。Be aware that prefix queries can be slow, especially if the index is large and the prefix consists of a small number of characters. 另一种方法(如“边缘 n 元语法标记化”)执行速度可能较快。An alternative methodology, such as edge n-gram tokenization, might perform faster.

对于其他通配符查询变体,比如后缀或中缀与一个词的末尾或中间匹配,请使用适用于通配符搜索的完整 Lucene 语法For other wildcard query variants, such as suffix or infix matching against the end or middle of a term, use the full Lucene syntax for wildcard search.

短语搜索 "Phrase search "

词语搜索是针对一个或多个词语的查询,其中任何词语都被视为一个匹配项。A term search is a query for one or more terms, where any of the terms are considered a match. 短语搜索是用引号 " " 引起来的精确短语。A phrase search is an exact phrase enclosed in quotation marks " ". 例如,Roach Motel(没有引号)会以任何顺序在任何位置搜索包含 Roach 和/或 Motel 的文档,而 "Roach Motel"(带引号)则只会匹配包含整个短语并按该顺序排列的文档(词法分析仍然适用)。For example, while Roach Motel (without quotes) would search for documents containing Roach and/or Motel anywhere in any order, "Roach Motel" (with quotes) will only match documents that contain that whole phrase together and in that order (lexical analysis still applies).

另请参阅See also