了解 Azure 认知搜索中的 OData 集合筛选器Understanding OData collection filters in Azure Cognitive Search

若要根据 Azure 认知搜索中的集合字段进行筛选,可以结合 Lambda 表达式使用 anyall 运算符To filter on collection fields in Azure Cognitive Search, you can use the any and all operators together with lambda expressions. Lambda 表达式是引用范围变量的布尔表达式。Lambda expressions are Boolean expressions that refer to a range variable. anyall 运算符类似于大多数编程语言中的 for 循环,范围变量充当循环变量的角色,而 Lambda 表达式充当循环的主体。The any and all operators are analogous to a for loop in most programming languages, with the range variable taking the role of loop variable, and the lambda expression as the body of the loop. 范围变量在循环迭代期间采用集合的“当前”值。The range variable takes on the "current" value of the collection during iteration of the loop.

最起码,这在概念上解释了此类筛选器的工作原理。At least that's how it works conceptually. 在事实上,Azure 认知搜索实现筛选器的方式与 for 循环的工作原理有很大的不同。In reality, Azure Cognitive Search implements filters in a very different way to how for loops work. 理想情况下,你察觉不到这种差异,但在某些情况下你会感受到这种差异。Ideally, this difference would be invisible to you, but in certain situations it isn't. 最终结果是,在编写 Lambda 表达式时必须遵循某些规则。The end result is that there are rules you have to follow when writing lambda expressions.

本文将会探讨 Azure 认知搜索如何执行这些筛选器,并以此解释为何需要为集合筛选器创建规则。This article explains why the rules for collection filters exist by exploring how Azure Cognitive Search executes these filters. 如果你正在使用复杂的 Lambda 表达式编写高级筛选器,本文可以帮助你了解筛选器中支持哪些元素,以及使用这些元素的原因。If you're writing advanced filters with complex lambda expressions, you may find this article helpful in building your understanding of what's possible in filters and why.

有关集合筛选器有哪些可用的规则(包括示例)的信息,请参阅排查 Azure 认知搜索中的 OData 集合筛选器问题For information on what the rules for collection filters are, including examples, see Troubleshooting OData collection filters in Azure Cognitive Search.

集合筛选器为何受到限制Why collection filters are limited

出于以下三个根本原因,并非所有筛选器功能都受所有集合类型的支持:There are three underlying reasons why not all filter features are supported for all types of collections:

  1. 特定的数据类型仅支持特定的运算符。Only certain operators are supported for certain data types. 例如,使用 ltgt 等运算符比较布尔值 truefalse 是没有意义的。For example, it doesn't make sense to compare the Boolean values true and false using lt, gt, and so on.
  2. Azure 认知搜索不支持对 Collection(Edm.ComplexType) 类型的字段进行关联搜索Azure Cognitive Search doesn't support correlated search on fields of type Collection(Edm.ComplexType).
  3. Azure 认知搜索使用倒排索引对所有类型的数据(包括集合)执行筛选器。Azure Cognitive Search uses inverted indexes to execute filters over all types of data, including collections.

第一个原因只是 OData 语言和 EDM 类型系统的定义方式造成的。The first reason is just a consequence of how the OData language and EDM type system are defined. 本文的余下部分将更详细地解释后两个原因。The last two are explained in more detail in the rest of this article.

对复杂对象集合应用多个筛选条件时,这些条件是关联的,因为它们将应用到集合中的每个对象When applying multiple filter criteria over a collection of complex objects, the criteria are correlated since they apply to each object in the collection. 例如,以下筛选器将返回至少提供一间价格低于 100 的豪华客房的酒店:For example, the following filter will return hotels that have at least one deluxe room with a rate less than 100:

    Rooms/any(room: room/Type eq 'Deluxe Room' and room/BaseRate lt 100)

如果筛选不相关联,上述筛选器可能会返回提供一间豪华客房并提供基本价格低于 100 的另一间客房的酒店。If filtering was uncorrelated, the above filter might return hotels where one room is deluxe and a different room has a base rate less than 100. 该结果没有任何意义,因为 Lambda 表达式的两个子句将应用到同一个范围变量,即 roomThat wouldn't make sense, since both clauses of the lambda expression apply to the same range variable, namely room. 这就是此类筛选器相关联的原因。This is why such filters are correlated.

但是,对于全文搜索,无法引用特定的范围变量。However, for full-text search, there's no way to refer to a specific range variable. 如果使用字段搜索发出如下所示的完整 Lucene 查询If you use fielded search to issue a full Lucene query like this one:

    Rooms/Type:deluxe AND Rooms/Description:"city view"

可能会返回提供一间豪华客房,并在另一间客房的描述中提到“市景”的酒店。you may get hotels back where one room is deluxe, and a different room mentions "city view" in the description. 例如,以下 Id1 的文档将与查询相匹配:For example, the document below with Id of 1 would match the query:

{
  "value": [
    {
      "Id": "1",
      "Rooms": [
        { "Type": "deluxe", "Description": "Large garden view suite" },
        { "Type": "standard", "Description": "Standard city view room" }
      ]
    },
    {
      "Id": "2",
      "Rooms": [
        { "Type": "deluxe", "Description": "Courtyard motel room" }
      ]
    }
  ]
}

原因在于,Rooms/Type 引用了整个文档中 Rooms/Type 字段的所有已分析字词,Rooms/Description 与此类似,如下表所示。The reason is that Rooms/Type refers to all the analyzed terms of the Rooms/Type field in the entire document, and similarly for Rooms/Description, as shown in the tables below.

如何为全文搜索存储 Rooms/TypeHow Rooms/Type is stored for full-text search:

Rooms/Type 中的字词Term in Rooms/Type 文档 IDDocument IDs
豪华deluxe 1, 21, 2
标准standard 11

如何为全文搜索存储 Rooms/DescriptionHow Rooms/Description is stored for full-text search:

Rooms/Description 中的字词Term in Rooms/Description 文档 IDDocument IDs
庭院courtyard 22
citycity 11
花园garden 11
large 11
汽车旅馆motel 22
客房room 1, 21, 2
标准standard 11
套间suite 11
viewview 11

简单而言,上述筛选器指出“匹配其中某间客房的 Type 等于‘豪华客房’,且同一间客房BaseRate 小于 100 的文档”,而该搜索查询则与此不同,它指出“匹配其中的 Rooms/Type 包含字词‘豪华’且 Rooms/Description 包含短语‘市景’的文档”。So unlike the filter above, which basically says "match documents where a room has Type equal to 'Deluxe Room' and that same room has BaseRate less than 100", the search query says "match documents where Rooms/Type has the term "deluxe" and Rooms/Description has the phrase "city view". 对于后面的查询,可关联哪些客房的字段没有概念。There's no concept of individual rooms whose fields can be correlated in the latter case.

备注

如果你希望在 Azure 认知搜索中添加对关联搜索的支持,请为此 User Voice 项投票。If you would like to see support for correlated search added to Azure Cognitive Search, please vote for this User Voice item.

倒排索引和集合Inverted indexes and collections

你可能已注意到,基于复杂集合的 Lambda 表达式的限制,要比基于简单集合(例如 Collection(Edm.Int32)Collection(Edm.GeographyPoint) 等)的表达式的限制要少得多。You may have noticed that there are far fewer restrictions on lambda expressions over complex collections than there are for simple collections like Collection(Edm.Int32), Collection(Edm.GeographyPoint), and so on. 这是因为,Azure 认知搜索会将复杂集合存储为子文档的实际集合,而简单集合根本不会存储为集合。This is because Azure Cognitive Search stores complex collections as actual collections of sub-documents, while simple collections aren't stored as collections at all.

例如,假设某个在线零售商的索引中包含类似于 seasons 的可筛选字符串集合字段。For example, consider a filterable string collection field like seasons in an index for an online retailer. 上传到此索引的某些文档可能如下所示:Some documents uploaded to this index might look like this:

{
  "value": [
    {
      "id": "1",
      "name": "Hiking boots",
      "seasons": ["spring", "summer", "fall"]
    },
    {
      "id": "2",
      "name": "Rain jacket",
      "seasons": ["spring", "fall", "winter"]
    },
    {
      "id": "3",
      "name": "Parka",
      "seasons": ["winter"]
    }
  ]
}

seasons 字段的值存储在称作“倒排索引”的结构中,如下所示: The values of the seasons field are stored in a structure called an inverted index, which looks something like this:

术语Term 文档 IDDocument IDs
spring 1, 21, 2
summer 11
fall 1, 21, 2
winter 2, 32, 3

此数据结构旨在快速解答一个问题:给定的字词出现在哪些文档中?This data structure is designed to answer one question with great speed: In which documents does a given term appear? 解答此问题的过程更像是一个普通的相等性检查,而不是基于集合的循环。Answering this question works more like a plain equality check than a loop over a collection. 事实上,正因如此,对于字符串集合,Azure 认知搜索仅允许使用 eq 作为 any 的 Lambda 表达式中的比较运算符。In fact, this is why for string collections, Azure Cognitive Search only allows eq as a comparison operator inside a lambda expression for any.

基于相等性,接下来我们将探讨如何使用 or 来合并多个针对同一范围变量执行的相等性检查。Building up from equality, next we'll look at how it's possible to combine multiple equality checks on the same range variable with or. 此方法得益于代数原理以及限定符的分布属性。It works thanks to algebra and the distributive property of quantifiers. 此表达式:This expression:

    seasons/any(s: s eq 'winter' or s eq 'fall')

等效于:is equivalent to:

    seasons/any(s: s eq 'winter') or seasons/any(s: s eq 'fall')

可以使用倒排索引有效执行两个 any 子表达式中的每一个。and each of the two any sub-expressions can be efficiently executed using the inverted index. 另外,得益于限定符的求反法则,此表达式:Also, thanks to the negation law of quantifiers, this expression:

    seasons/all(s: s ne 'winter' and s ne 'fall')

等效于:is equivalent to:

    not seasons/any(s: s eq 'winter' or s eq 'fall')

正因如此,我们可以将 allneand 一起使用。which is why it's possible to use all with ne and and.

备注

尽管本文档并未详细介绍,但相同的原理也可以延伸到地理空间点集合的距离和交集测试Although the details are beyond the scope of this document, these same principles extend to distance and intersection tests for collections of geo-spatial points as well. 因此,在 any 中:This is why, in any:

  • geo.intersects 不可求反geo.intersects cannot be negated
  • 必须使用 ltle 比较 geo.distancegeo.distance must be compared using lt or le
  • 必须使用 or 而不是 and 合并表达式expressions must be combined with or, not and

相反的规则适用于 allThe converse rules apply for all.

在对支持 ltgtlege 运算符的数据类型集合进行筛选时,可以使用更多种表达式,例如 Collection(Edm.Int32)A wider variety of expressions are allowed when filtering on collections of data types that support the lt, gt, le, and ge operators, such as Collection(Edm.Int32) for example. 具体而言,可以在 any 中使用 and 以及 or,前提是使用 and 将基础比较表达式合并到范围比较,然后使用 or 进一步合并。Specifically, you can use and as well as or in any, as long as the underlying comparison expressions are combined into range comparisons using and, which are then further combined using or. 这种布尔表达式结构称为“析取范式 (DNF)”也称为“AND 的 OR”。This structure of Boolean expressions is called Disjunctive Normal Form (DNF), otherwise known as "ORs of ANDs". 相反,这些数据类型的all Lambda 表达式必须采用合取范式 (CNF),也称为“OR 的 AND”。Conversely, lambda expressions for all for these data types must be in Conjunctive Normal Form (CNF), otherwise known as "ANDs of ORs". Azure 认知搜索之所以允许此类范围比较,是因为它可以有效地使用倒排索引执行这些比较,就像它可以针对字符串执行快速字词查找一样。Azure Cognitive Search allows such range comparisons because it can execute them using inverted indexes efficiently, just like it can do fast term lookup for strings.

下面汇总了有关 Lambda 表达式中允许的元素的经验法则:In summary, here are the rules of thumb for what's allowed in a lambda expression:

  • any 中,始终允许正检查,例如相等性、范围比较、geo.intersects,或者 geo.distanceltle 的比较(在检查距离时,将“靠近程度”视为相等性)。Inside any, positive checks are always allowed, like equality, range comparisons, geo.intersects, or geo.distance compared with lt or le (think of "closeness" as being like equality when it comes to checking distance).
  • any 中,始终允许 orInside any, or is always allowed. 只能对可以表达范围检查的数据类型使用 and,并且只能在使用“AND 的 OR”(DNF) 时使用它。You can use and only for data types that can express range checks, and only if you use ORs of ANDs (DNF).
  • all 中,规则将会反转 - 只允许负检查,始终可以使用 and,并只能对表达为“OR 的 AND”(CNF) 的范围检查使用 orInside all, the rules are reversed -- only negative checks are allowed, you can use and always, and you can use or only for range checks expressed as ANDs of ORs (CNF).

在实践中,你最有可能使用这些筛选器类型。In practice, these are the types of filters you're most likely to use anyway. 不过,了解筛选器选项的界限仍有帮助。It's still helpful to understand the boundaries of what's possible though.

有关允许和不允许的筛选器类型的具体示例,请参阅如何编写有效的集合筛选器For specific examples of which kinds of filters are allowed and which aren't, see How to write valid collection filters.

后续步骤Next steps