如何为 Azure 认知搜索中要导入和编制索引的 SQL 关系数据建模How to model relational SQL data for import and indexing in Azure Cognitive Search

Azure 认知搜索接受使用平展行集作为索引编制管道的输入。Azure Cognitive Search accepts a flat rowset as input to the indexing pipeline. 如果源数据源自 SQL Server 关系数据库中的联接表,你可以参考本文来构造结果集,以及在 Azure 认知搜索索引中为父子关系建模。If your source data originates from joined tables in a SQL Server relational database, this article explains how to construct the result set, and how to model a parent-child relationship in an Azure Cognitive Search index.

为便于演示,我们将基于演示数据引用一个虚构的酒店数据库。As an illustration, we'll refer to a hypothetical hotels database, based on demo data. 假设该数据库包括一个 Hotels$ 表(其中包含 50 家酒店),以及一个 Rooms$ 表(其中包含各种类型的提供不同价格和设施的总共 750 间客房)。Assume the database consists of a Hotels$ table with 50 hotels, and a Rooms$ table with rooms of varying types, rates, and amenities, for a total of 750 rooms. 表之间存在一对多关系。There is a one-to-many relationship between the tables. 在我们的方法中,某个视图将提供返回 50 行(每家酒店各对应一行)以及嵌入到每行中的关联客房详细信息的查询。In our approach, a view will provide the query that returns 50 rows, one row per hotel, with associated room detail embedded into each row.

酒店数据库中的表和视图Tables and view in the Hotels database

反规范化数据的问题The problem of denormalized data

使用一对多关系的一个难题在于,基于联接表生成的标准查询将返回反规范化的数据,这在 Azure 认知搜索方案中无法顺利运行。One of the challenges in working with one-to-many relationships is that standard queries built on joined tables will return denormalized data, which doesn't work well in an Azure Cognitive Search scenario. 考虑以下联接酒店和客房的示例。Consider the following example that joins hotels and rooms.

ON Rooms$.HotelID = Hotels$.HotelID

此查询的结果返回所有酒店字段,后接所有客房字段,以及针对每个客房值重复的初步酒店信息。Results from this query return all of the Hotel fields, followed by all Room fields, with preliminary hotel information repeating for each room value.

反规范化数据,添加客房字段后多余的酒店数据Denormalized data, redundant hotel data when room fields are added

尽管此查询表面上可以成功(提供平整行集中的所有数据),但它无法为预期的搜索体验提供正确的文档结构。While this query succeeds on the surface (providing all of the data in a flat row set), it fails in delivering the right document structure for the expected search experience. 在编制索引期间,Azure 认知搜索将为引入的每个行创建一个搜索文档。During indexing, Azure Cognitive Search will create one search document for each row ingested. 如果搜索文档类似于上述结果,则你会看到重复项 - 单单是 Twin Dome 酒店就有七个不同的文档。If your search documents looked like the above results, you would have perceived duplicates - seven separate documents for the Twin Dome hotel alone. 根据“佛罗里达州的酒店”进行查询只会返回 Twin Dome 酒店的七条结果,而其他相关酒店则深藏在搜索结果中。A query on "hotels in Florida" would return seven results for just the Twin Dome hotel, pushing other relevant hotels deep into the search results.

若要获得每家酒店只有一个文档的预期体验,应以适当的粒度提供一个行集,但同时需要提供完整的信息。To get the expected experience of one document per hotel, you should provide a rowset at the right granularity, but with complete information. 幸运的是,采用本文中所述的方法可以轻松实现此目的。Fortunately, you can do this easily by adopting the techniques in this article.

定义返回嵌入式 JSON 的查询Define a query that returns embedded JSON

若要提供预期的搜索体验,数据集应为 Azure 认知搜索中的每个搜索文档包括一行。To deliver the expected search experience, your data set should consist of one row for each search document in Azure Cognitive Search. 在本示例中,我们希望每家酒店只有一行,但同时希望用户能够搜索他们所关心的其他客房相关字段,例如,每晚的价格、床位的尺寸和数量,或者沙滩景观,所有这些信息都是客房详细信息的一部分。In our example, we want one row for each hotel, but we also want our users to be able to search on other room-related fields they care about, such as the nightly rate, size and number of beds, or a view of the beach, all of which are part of a room detail.

解决方法是捕获嵌套 JSON 格式的客房详细信息,然后将 JSON 结构插入到视图中的字段,如第二步中所述。The solution is to capture the room detail as nested JSON, and then insert the JSON structure into a field in a view, as shown in the second step.

  1. 假设你有两个联接表:Hotels$ 和 Rooms$,其中包含 50 家酒店和 750 间客房的详细信息,这两个表已在 HotelID 字段中联接。Assume you have two joined tables, Hotels$ and Rooms$, that contain details for 50 hotels and 750 rooms, and are joined on the HotelID field. 这些表各自包含 50 家酒店和 750 间相关的客房。Individually, these tables contain 50 hotels and 750 related rooms.

    CREATE TABLE [dbo].[Hotels$](
      [HotelID] [nchar](10) NOT NULL,
      [HotelName] [nvarchar](255) NULL,
      [Description] [nvarchar](max) NULL,
      [Description_fr] [nvarchar](max) NULL,
      [Category] [nvarchar](255) NULL,
      [Tags] [nvarchar](255) NULL,
      [ParkingIncluded] [float] NULL,
      [SmokingAllowed] [float] NULL,
      [LastRenovationDate] [smalldatetime] NULL,
      [Rating] [float] NULL,
      [StreetAddress] [nvarchar](255) NULL,
      [City] [nvarchar](255) NULL,
      [State] [nvarchar](255) NULL,
      [ZipCode] [nvarchar](255) NULL,
      [GeoCoordinates] [nvarchar](255) NULL
    CREATE TABLE [dbo].[Rooms$](
      [HotelID] [nchar](10) NULL,
      [Description] [nvarchar](255) NULL,
      [Description_fr] [nvarchar](255) NULL,
      [Type] [nvarchar](255) NULL,
      [BaseRate] [float] NULL,
      [BedOptions] [nvarchar](255) NULL,
      [SleepsCount] [float] NULL,
      [SmokingAllowed] [float] NULL,
      [Tags] [nvarchar](255) NULL
    ) ON [PRIMARY]
  2. 创建一个视图,其中包含父表 (SELECT * from dbo.Hotels$) 中的所有字段,加上包含嵌套查询的输出的新 Rooms 字段。Create a view composed of all fields in the parent table (SELECT * from dbo.Hotels$), with the addition of a new Rooms field that contains the output of a nested query. SELECT * from dbo.Rooms$ 中的 FOR JSON AUTO 子句将输出结构化为 JSON。A FOR JSON AUTO clause on SELECT * from dbo.Rooms$ structures the output as JSON.

    CREATE VIEW [dbo].[HotelRooms]
             FROM dbo.Rooms$
             WHERE dbo.Rooms$.HotelID = dbo.Hotels$.HotelID FOR JSON AUTO) AS Rooms
    FROM dbo.Hotels$

    以下屏幕截图显示了生成的视图,底部是 Rooms nvarchar 字段。The following screenshot shows the resulting view, with the Rooms nvarchar field at the bottom. “客房”字段仅存在于 HotelRooms 视图中。The Rooms field exists only in the HotelRooms view.

    HotelRooms 视图HotelRooms view

  3. 运行 SELECT * FROM dbo.HotelRooms 检索行集。Run SELECT * FROM dbo.HotelRooms to retrieve the row set. 此查询返回 50 行(每家酒店对应一行),以及 JSON 集合形式的关联客房信息。This query returns 50 rows, one per hotel, with associated room information as a JSON collection.

    HotelRooms 视图中的行集Rowset from HotelRooms view

此行集现已准备好导入 Azure 认知搜索。This rowset is now ready for import into Azure Cognitive Search.


此方法假设嵌入的 JSON 低于 SQL Server 的最大列大小限制This approach assumes that embedded JSON is under the maximum column size limits of SQL Server.

为一对多关系的“多”端使用复杂集合Use a complex collection for the "many" side of a one-to-many relationship

在 Azure 认知搜索端,创建一个使用嵌套式 JSON 为一对多关系建模的索引架构。On the Azure Cognitive Search side, create an index schema that models the one-to-many relationship using nested JSON. 在上一部分创建的结果集通常对应于下面提供的索引架构(为简洁起见,我们截掉了一些字段)。The result set you created in the previous section generally corresponds to the index schema provided below (we cut some fields for brevity).

以下示例类似于如何为复杂数据类型建模中的示例。The following example is similar to the example in How to model complex data types. Rooms 结构(本文的重点所在)位于索引的名为 hotels 的字段集合中。The Rooms structure, which has been the focus of this article, is in the fields collection of an index named hotels. 此示例还显示了 Address 的复杂类型,该类型与 Rooms 的不同之处在于,它包含一组固定的项,而不是集合中允许的多个任意数量的项。This example also shows a complex type for Address, which differs from Rooms in that it is composed of a fixed set of items, as opposed to the multiple, arbitrary number of items allowed in a collection.

  "name": "hotels",
  "fields": [
    { "name": "HotelId", "type": "Edm.String", "key": true, "filterable": true },
    { "name": "HotelName", "type": "Edm.String", "searchable": true, "filterable": false },
    { "name": "Description", "type": "Edm.String", "searchable": true, "analyzer": "en.lucene" },
    { "name": "Description_fr", "type": "Edm.String", "searchable": true, "analyzer": "fr.lucene" },
    { "name": "Category", "type": "Edm.String", "searchable": true, "filterable": false },
    { "name": "ParkingIncluded", "type": "Edm.Boolean", "filterable": true, "facetable": true },
    { "name": "Address", "type": "Edm.ComplexType",
      "fields": [
        { "name": "StreetAddress", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "searchable": true },
        { "name": "City", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true, "facetable": true },
        { "name": "StateProvince", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true, "facetable": true }
    { "name": "Rooms", "type": "Collection(Edm.ComplexType)",
      "fields": [
        { "name": "Description", "type": "Edm.String", "searchable": true, "analyzer": "en.lucene" },
        { "name": "Description_fr", "type": "Edm.String", "searchable": true, "analyzer": "fr.lucene" },
        { "name": "Type", "type": "Edm.String", "searchable": true },
        { "name": "BaseRate", "type": "Edm.Double", "filterable": true, "facetable": true },
        { "name": "BedOptions", "type": "Edm.String", "searchable": true, "filterable": true, "facetable": true },
        { "name": "SleepsCount", "type": "Edm.Int32", "filterable": true, "facetable": true },
        { "name": "SmokingAllowed", "type": "Edm.Boolean", "filterable": true, "facetable": true },
        { "name": "Tags", "type": "Edm.Collection", "searchable": true }

在给定上述结果集和上述索引架构后,你便获得了成功完成索引编制操作所需的全部组件。Given the previous result set and the above index schema, you have all the required components for a successful indexing operation. 平展数据集符合索引编制要求,同时可以保留详细信息。The flattened data set meets indexing requirements yet preserves detail information. 在 Azure 认知搜索索引中,搜索结果很容易变成基于酒店的实体,同时可以保留各个客房及其属性的上下文。In the Azure Cognitive Search index, search results will fall easily into hotel-based entities, while preserving the context of individual rooms and their attributes.

后续步骤Next steps

可以使用自己的数据集通过导入数据向导来创建和加载索引。Using your own data set, you can use the Import data wizard to create and load the index. 该向导将检测嵌入的 JSON 集合(例如 Rooms 中包含的集合),并推断包含复杂类型集合的索引架构。The wizard detects the embedded JSON collection, such as the one contained in Rooms, and infers an index schema that includes a complex type collection.

导入数据向导推断的索引Index inferred by Import data wizard

请尝试学习以下快速入门,以了解导入数据向导的基本步骤。Try the following quickstart to learn the basic steps of the Import data wizard.