如何在 Azure 认知搜索中为复杂数据类型建模How to model complex data types in Azure Cognitive Search

用于填充 Azure 认知搜索索引的外部数据集可以采用多种形状。External datasets used to populate an Azure Cognitive Search index can come in many shapes. 有时它们包含分层或嵌套的子结构。Sometimes they include hierarchical or nested substructures. 示例包括单个客户的多个地址、单个 SKU 的多个颜色和大小、一本书籍的多位作者等等。Examples might include multiple addresses for a single customer, multiple colors and sizes for a single SKU, multiple authors of a single book, and so on. 在建模术语中,这些结构可能称作复杂、组合、复合或聚合数据类型。In modeling terms, you might see these structures referred to as complex, compound, composite, or aggregate data types. Azure 认知搜索对此概念使用的术语是“复杂类型”。The term Azure Cognitive Search uses for this concept is complex type. 在 Azure 认知搜索中,复杂类型是使用复杂字段建模的。In Azure Cognitive Search, complex types are modeled using complex fields. 复杂字段是包含子级(子字段)的字段,这些子级可以是任何数据类型(包括其他复杂类型)。A complex field is a field that contains children (sub-fields) which can be of any data type, including other complex types. 其工作原理类似于编程语言中的结构化数据类型。This works in a similar way as structured data types in a programming language.

复杂字段表示文档中的单个对象,或对象的数组,具体取决于数据类型。Complex fields represent either a single object in the document, or an array of objects, depending on the data type. Edm.ComplexType 类型的字段表示单个对象,而 Collection(Edm.ComplexType) 类型的字段表示对象的数组。Fields of type Edm.ComplexType represent single objects, while fields of type Collection(Edm.ComplexType) represent arrays of objects.

Azure 认知搜索原生支持复杂类型和集合。Azure Cognitive Search natively supports complex types and collections. 使用这些类型几乎可为 Azure 认知搜索索引中的任何 JSON 结构建模。These types allow you to model almost any JSON structure in an Azure Cognitive Search index. 在旧版的 Azure 认知搜索 API 中,只能导入平展的行集。In previous versions of Azure Cognitive Search APIs, only flattened row sets could be imported. 在最新版本中,索引可以更密切地对应于源数据。In the newest version, your index can now more closely correspond to source data. 换言之,如果源数据使用复杂类型,则索引也可以使用复杂类型。In other words, if your source data has complex types, your index can have complex types also.

若要开始,我们建议使用 Hotels 数据集,可以在 Azure 门户上“导入数据”向导中加载该数据集。To get started, we recommend the Hotels data set, which you can load in the Import data wizard in the Azure portal. 该向导会检测源中的复杂类型,并根据检测到的结构建议一个索引架构。The wizard detects complex types in the source and suggests an index schema based on the detected structures.

备注

api-version=2019-05-06 开始正式提供对复杂类型的支持。Support for complex types became generally available starting in api-version=2019-05-06.

如果你的搜索解决方案是基于以前的解决方法(集合中的平展数据集)生成的,应更改索引,使之包含最新 API 版本支持的复杂类型。If your search solution is built on earlier workarounds of flattened datasets in a collection, you should change your index to include complex types as supported in the newest API version. 有关升级的 API 版本的详细信息,请参阅升级到最新的 REST API 版本升级到最新的 .NET SDK 版本For more information about upgrading API versions, see Upgrade to the newest REST API version or Upgrade to the newest .NET SDK version.

复杂结构的示例Example of a complex structure

以下 JSON 文档由简单字段和复杂字段构成。The following JSON document is composed of simple fields and complex fields. 复杂字段(例如 AddressRooms)包含子字段。Complex fields, such as Address and Rooms, have sub-fields. Address 包含这些子字段的单个值集,因为它是文档中的单个对象。Address has a single set of values for those sub-fields, since it's a single object in the document. 相反,Rooms 包含其子字段的多个值集,集合中的每个对象各有一个值集。In contrast, Rooms has multiple sets of values for its sub-fields, one for each object in the collection.

{
  "HotelId": "1",
  "HotelName": "Secret Point Motel",
  "Description": "Ideally located on the main commercial artery of the city in the heart of New York.",
  "Address": {
    "StreetAddress": "677 5th Ave",
    "City": "New York",
    "StateProvince": "NY"
  },
  "Rooms": [
    {
      "Description": "Budget Room, 1 Queen Bed (Cityside)",
      "Type": "Budget Room",
      "BaseRate": 96.99
    },
    {
      "Description": "Deluxe Room, 2 Double Beds (City View)",
      "Type": "Deluxe Room",
      "BaseRate": 150.99
    },
  ]
}

创建复杂字段Creating complex fields

与处理任何索引定义时一样,可以使用门户、REST API.NET SDK 创建包含复杂类型的架构。As with any index definition, you can use the portal, REST API, or .NET SDK to create a schema that includes complex types.

以下示例演示了包含简单字段、集合与复杂类型的 JSON 索引架构。The following example shows a JSON index schema with simple fields, collections, and complex types. 请注意,在复杂类型中,与顶级字段一样,每个子字段包含一个类型,有时还包含属性。Notice that within a complex type, each sub-field has a type and may have attributes, just as top-level fields do. 架构对应于以上示例数据。The schema corresponds to the example data above. Address 是一个非集合的复杂字段(一家酒店只有一个地址)。Address is a complex field that isn't a collection (a hotel has one address). Rooms 是复杂集合字段(一家酒店有多间客房)。Rooms is a complex collection field (a hotel has many rooms).

{
  "name": "hotels",
  "fields": [
    { "name": "HotelId", "type": "Edm.String", "key": true, "filterable": true },
    { "name": "HotelName", "type": "Edm.String", "searchable": true, "filterable": false },
    { "name": "Description", "type": "Edm.String", "searchable": true, "analyzer": "en.lucene" },
    { "name": "Address", "type": "Edm.ComplexType",
      "fields": [
        { "name": "StreetAddress", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "searchable": true },
        { "name": "City", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true, "facetable": true },
        { "name": "StateProvince", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true, "facetable": true }
      ]
    },
    { "name": "Rooms", "type": "Collection(Edm.ComplexType)",
      "fields": [
        { "name": "Description", "type": "Edm.String", "searchable": true, "analyzer": "en.lucene" },
        { "name": "Type", "type": "Edm.String", "searchable": true },
        { "name": "BaseRate", "type": "Edm.Double", "filterable": true, "facetable": true }
      ]
    }
  ]
}

更新复杂字段Updating complex fields

一般情况下,应用于字段的所有重建索引规则仍会应用于复杂字段。All of the reindexing rules that apply to fields in general still apply to complex fields. 在此处重述一些主要规则以及添加字段并不需要重建索引,但大多数修改操作需要重建索引。Restating a few of the main rules here, adding a field doesn't require an index rebuild, but most modifications do.

对定义的结构更新Structural updates to the definition

随时可以将新的子字段添加到复杂字段,而无需重建索引。You can add new sub-fields to a complex field at any time without the need for an index rebuild. 例如,允许将“ZipCode”添加到 Address或者将“Amenities”添加到 Rooms,就如同将顶级字段添加到索引一样。For example, adding "ZipCode" to Address or "Amenities" to Rooms is allowed, just like adding a top-level field to an index. 在通过更新数据显式填充新字段之前,现有文档将对这些字段使用 null 值。Existing documents have a null value for new fields until you explicitly populate those fields by updating your data.

请注意,在复杂类型中,与顶级字段一样,每个子字段包含一个类型,有时还包含属性Notice that within a complex type, each sub-field has a type and may have attributes, just as top-level fields do

数据更新Data updates

对于复杂字段和简单字段而言,使用 upload 操作更新索引中现有文档的过程是相同的 -- 将替换所有字段。Updating existing documents in an index with the upload action works the same way for complex and simple fields -- all fields are replaced. 但是,merge(应用于现有文档时使用 mergeOrUpload)对所有字段的运行方式不同。However, merge (or mergeOrUpload when applied to an existing document) doesn't work the same across all fields. 具体而言,merge 不支持合并集合中的元素。Specifically, merge doesn't support merging elements within a collection. 基元类型集合与复杂集合存在此限制。This limitation exists for collections of primitive types and complex collections. 若要更新集合,需要检索整个集合值,进行更改,然后在索引 API 请求中包含新的集合。To update a collection, you'll need to retrieve the full collection value, make changes, and then include the new collection in the Index API request.

搜索复杂字段Searching complex fields

可按预期方式对复杂类型运行自由形式的搜索表达式。Free-form search expressions work as expected with complex types. 如果文档中的任何位置存在任何一个匹配的可搜索字段或子字段,则该文档本身就是一个匹配项。If any searchable field or sub-field anywhere in a document matches, then the document itself is a match.

如果使用多个字词或运算符,并且某些字词指定了字段名(可以使用 Lucene 语法来指定),则查询会变得更微妙。Queries get more nuanced when you have multiple terms and operators, and some terms have field names specified, as is possible with the Lucene syntax. 例如,此查询尝试将两个字词“Portland”和“OR”与 Address 字段的两个子字段相匹配:For example, this query attempts to match two terms, "Portland" and "OR", against two sub-fields of the Address field:

search=Address/City:Portland AND Address/State:OR

此类查询对于全文搜索是不相关联的,这与筛选器不同。Queries like this are uncorrelated for full-text search, unlike filters. 在筛选器中,基于复杂集合子字段的查询将通过 anyall 中的范围变量相关联。In filters, queries over sub-fields of a complex collection are correlated using range variables in any or all. 上述 Lucene 查询返回包含“Portland, Maine”和“Portland, Oregon”以及 Oregon 中其他城市的文档。The Lucene query above returns documents containing both "Portland, Maine" and "Portland, Oregon", along with other cities in Oregon. 之所以返回此结果,是因为每个子句将应用到其在整个文档中的字段的所有值,因此没有“当前子文档”的概念。This happens because each clause applies to all values of its field in the entire document, so there's no concept of a "current sub-document". 有关此方面内容的详细信息,请参阅了解 Azure 认知搜索中的 OData 集合筛选器For more information on this, see Understanding OData collection filters in Azure Cognitive Search.

选择复杂字段Selecting complex fields

$select 参数用于选择要在搜索结果中返回哪些字段。The $select parameter is used to choose which fields are returned in search results. 若要使用此参数来选择复杂字段的特定子字段,请包含斜杠 (/) 分隔的父字段和子字段。To use this parameter to select specific sub-fields of a complex field, include the parent field and sub-field separated by a slash (/).

$select=HotelName, Address/City, Rooms/BaseRate

如果希望这些字段在搜索结果中出现,必须在索引中将其标记为可检索。Fields must be marked as Retrievable in the index if you want them in search results. 只有标记为可检索的字段才能在 $select 语句中使用。Only fields marked as Retrievable can be used in a $select statement.

筛选、分面和排序复杂字段Filter, facet, and sort complex fields

用作筛选和带字段搜索的 OData 路径语法同样也可用于分面、排序和选择搜索请求中的字段。The same OData path syntax used for filtering and fielded searches can also be used for faceting, sorting, and selecting fields in a search request. 对于复杂类型,可以应用规则来控制可将哪些子字段标记为可排序或可分面。For complex types, rules apply that govern which sub-fields can be marked as sortable or facetable. 有关这些规则的详细信息,请参阅创建索引 API 参考For more information on these rules, see the Create Index API reference.

分面子字段Faceting sub-fields

除非类型为 Edm.GeographyPointCollection(Edm.GeographyPoint),否则任何子字段都可标记为可分面。Any sub-field can be marked as facetable unless it is of type Edm.GeographyPoint or Collection(Edm.GeographyPoint).

分面结果中返回的文档计数是根据父文档(酒店)计算的,而不是根据复杂集合中的子文档(客房)计算的。The document counts returned in the facet results are calculated for the parent document (a hotel), not the sub-documents in a complex collection (rooms). 例如,假设某家酒店有 20 间“套房”类型的客房。For example, suppose a hotel has 20 rooms of type "suite". 如果此分面参数为 facet=Rooms/Type,则分面计数将是 1 家酒店,而不是 20 间客房。Given this facet parameter facet=Rooms/Type, the facet count will be one for the hotel, not 20 for the rooms.

排序复杂字段Sorting complex fields

排序操作将应用于文档(酒店)而不是子文档(客房)。Sort operations apply to documents (Hotels) and not sub-documents (Rooms). 使用复杂类型集合(例如客房)时必须认识到,根据无法按“客房”排序。When you have a complex type collection, such as Rooms, it's important to realize that you can't sort on Rooms at all. 事实上,无法按任何集合进行排序。In fact, you can't sort on any collection.

如果每个文档的字段是单值的,不管字段是简单字段,还是复杂类型中的子字段,排序操作都可正常运行。Sort operations work when fields have a single value per document, whether the field is a simple field, or a sub-field in a complex type. 例如,允许 Address/City 可排序,因为每家酒店只有一个地址,因此 $orderby=Address/City 将按城市对酒店排序。For example, Address/City is allowed to be sortable because there's only one address per hotel, so $orderby=Address/City will sort hotels by city.

根据复杂字段进行筛选Filtering on complex fields

可以在筛选表达式中引用复杂字段的子字段。You can refer to sub-fields of a complex field in a filter expression. 只需使用对分面、排序和选择字段所用的相同 OData 路径语法Just use the same OData path syntax that's used for faceting, sorting, and selecting fields. 例如,以下筛选器将返回位于加拿大的所有酒店:For example, the following filter will return all hotels in Canada:

$filter=Address/Country eq 'Canada'

若要根据复杂集合字段进行筛选,可以结合 anyall 运算符使用 Lambda 表达式To filter on a complex collection field, you can use a lambda expression with the any and all operators. 在这种情况下,Lambda 表达式的范围变量是包含子字段的对象。In that case, the range variable of the lambda expression is an object with sub-fields. 可以使用标准 OData 路径语法来引用这些子字段。You can refer to those sub-fields with the standard OData path syntax. 例如,以下筛选器将返回至少提供一间豪华客房,且所有客房都禁止吸烟的所有酒店:For example, the following filter will return all hotels with at least one deluxe room and all non-smoking rooms:

$filter=Rooms/any(room: room/Type eq 'Deluxe Room') and Rooms/all(room: not room/SmokingAllowed)

与顶级简单字段一样,仅当已在索引定义中将复杂字段的简单子字段的 filterable 属性设置为 true 时,才能在筛选器中包含这些子字段。As with top-level simple fields, simple sub-fields of complex fields can only be included in filters if they have the filterable attribute set to true in the index definition. 有关详细信息,请参阅创建索引 API 参考For more information, see the Create Index API reference.

后续步骤Next steps

尝试在“导入数据”向导中练习 Hotels 数据集Try the Hotels data set in the Import data wizard. 需要使用自述文件中提供的 Cosmos DB 连接信息来访问这些数据。You'll need the Cosmos DB connection information provided in the readme to access the data.

获取该信息后,向导中的第一步是创建新的 Azure Cosmos DB 数据源。With that information in hand, your first step in the wizard is to create a new Azure Cosmos DB data source. 然后,在向导中进入目标索引页后,会看到使用复杂类型的索引。Further on in the wizard, when you get to the target index page, you'll see an index with complex types. 请创建并加载此索引,然后执行查询来了解新结构。Create and load this index, and then execute queries to understand the new structure.