在 Azure 认知搜索中创建基本搜索索引Create a basic search index in Azure Cognitive Search

在 Azure 认知搜索中,搜索索引会存储用于全文查询和筛选后查询的可搜索内容。In Azure Cognitive Search, a search index stores searchable content used for full text and filtered queries. 索引由架构定义并保存到服务,紧接着的第二步是数据导入。An index is defined by a schema and saved to the service, with data import following as a second step.

索引包含文档。Indexes contain documents. 从概念上讲,文档是索引中的一个可搜索数据单元。Conceptually, a document is a single unit of searchable data in your index. 零售商可能有每件产品的文档,新闻机构可能有每篇报道的文档。A retailer might have a document for each product, a news organization might have a document for each article, and so forth. 将这些概念对应到更为熟悉的数据库等效对象:搜索索引等同于表,文档大致相当于表中的行 。Mapping these concepts to more familiar database equivalents: a search index equates to a table, and documents are roughly equivalent to rows in a table.

索引的物理结构由架构确定,标记为“可搜索”的字段会得到一个为其创建的倒排索引。The physical structure of an index is determined by the schema, with fields marked as "searchable" resulting in an inverted index created for that field.

可以使用以下工具和 API 创建索引:You can create an index with the following tools and APIs:

使用门户工具可方便学习。It's easier to learn with a portal tool. 门户强制实施针对特定数据类型的要求和架构规则,例如,对数值字段禁用全文搜索功能。The portal enforces requirements and schema rules for specific data types, such as disallowing full text search capabilities on numeric fields. 获得可使用的索引后,可以通过使用获取索引 (REST API) 从服务检索 JSON 定义并将其添加到解决方案,来转换到代码。Once you have a workable index, you can transition to code by retrieving the JSON definition from the service using Get Index (REST API) and adding it to your solution.

形成最终索引设计的过程是一个不断更迭的过程。Arriving at a final index design is an iterative process. 常见的方法是,先从门户创建初始索引,然后切换到代码,将索引置于源代码管理下。It's common to start with the portal to create the initial index and then switch to code to place the index under source control.

  1. 确定是否可以使用“导入数据”Determine whether you can use Import data. 如果源数据来自 Azure 中受支持的数据源类型,向导会编制基于索引器的一体式索引。The wizard performs all-in-one indexer-based indexing if the source data is from a supported data source type in Azure.

  2. 如果无法使用“导入数据”,可以先使用“添加索引”来定义架构 。If you can't use Import data, start with Add Index to define the schema.

    添加索引命令Add index command

  3. 提供用于对索引中每个搜索文档进行唯一标识的名称和键。Provide a name and key used to uniquely identify each search document in the index. 键是必需项并且必须为 Edm.String 类型。The key is mandatory and must be of type Edm.String. 在导入过程中,应规划将源数据中的唯一字段映射到此字段。During import, you should plan on mapping a unique field in source data to this field.

    门户会提供键的 id 字段。The portal gives you an id field for the key. 若要替代默认 id,请创建新字段(例如,名为 HotelId 的新字段定义),然后在“键”中选择它。To override the default id, create a new field (for example, a new field definition called HotelId) and then select it in Key.

    填写所需属性Fill in required properties

  4. 添加更多字段。Add more fields. 门户会显示不同数据类型可用的字段属性The portal shows you which field attributes are available for different data types. 如果你不太熟悉索引设计,此功能非常有用。If you're new to index design, this is helpful.

    如果传入数据在本质上是分层的,则分配复杂类型数据类型来表示嵌套结构。If incoming data is hierarchical in nature, assign the complex type data type to represent the nested structures. 内置的示例数据集 Hotels 演示的复杂类型使用一个 Address(包含多个子字段),其中有与每个酒店的一对一关系,并且有一个 Rooms 复杂集合,其中有多个房间与每个酒店相关联。The built-in sample data set, Hotels, illustrates complex types using an Address (contains multiple sub-fields) that has a one-to-one relationship with each hotel, and a Rooms complex collection, where multiple rooms are associated with each hotel.

  5. 在创建索引之前,将所有分析器分配给字符串字段。Assign any Analyzers to string fields before the index is created. 如果要在特定字段上启用自动完成功能,请为建议器执行相同的操作。Do the same for suggesters if you want to enabled autocomplete on specific fields.

  6. 单击“创建”在搜索服务中生成物理结构。Click Create to build the physical structures in your search service.

  7. 创建索引后,使用其他命令查看定义或添加更多元素。After an index is created, use additional commands to review definitions or add more elements.

    “添加索引”页,其中按数据类型显示了属性Add index page showing attributes by data type

  8. 使用获取索引 (REST API)Postman 等 Web 测试工具下载索引架构。Download the index schema using Get Index (REST API) and a web testing tool like Postman. 现在你有可以为代码进行调整的索引的 JSON 表示形式。You now have a JSON representation of the index that you can adapt for code.

  9. 加载索引和数据Load your index with data. Azure 认知搜索接受 JSON 文档。Azure Cognitive Search accepts JSON documents. 若要以编程方式加载数据,可以在请求有效负载中使用包含 JSON 文档的 Postman。To load your data programmatically, you can use Postman with JSON documents in the request payload. 如果无法轻松将数据表示为 JSON,此步骤耗费的精力是最大的。If your data is not easily expressed as JSON, this step will be the most labor intensive.

    将索引随数据一并加载后,对现有字段执行的大多数编辑操作都需要删除和重新生成索引。Once an index is loaded with data, most edits to existing fields will require that you drop and rebuild an index.

  10. 查询索引,检查结果,并进一步迭代索引架构,直到开始看到预期的结果。Query your index, examine results, and further iterate on the index schema until you begin to see the results you expect. 可以使用搜索资源管理器或 Postman 来查询索引。You can use Search explorer or Postman to query your index.

在开发过程中,规划频繁的重新生成。During development, plan on frequent rebuilds. 由于物理结构是在服务中创建的,对现有的字段定义进行的许多更改,都必须删除并重新创建索引Because physical structures are created in the service, dropping and recreating indexes is necessary for most modifications to an existing field definition. 可以考虑使用一部分数据来加快重新生成的速度。You might consider working with a subset of your data to make rebuilds go faster.

提示

建议在索引设计和数据导入中都使用代码(而不是门户方法)。Code, rather than a portal approach, is recommended for working on index design and data import simultaneously. 如果开发项目仍处于早期阶段,Postman 和 REST API 等备选工具也有助于完成概念证明测试。As an alternative, tools like Postman and the REST API are helpful for proof-of-concept testing when development projects are still in early phases. 可对请求正文中的索引定义进行增量更改,然后将请求发送到服务,以使用更新的架构重新创建索引。You can make incremental changes to an index definition in a request body, and then send the request to your service to recreate an index using an updated schema.

索引架构Index schema

索引需要在字段集合中有一个名称和一个指定的键字段(为 Edm.string)。An index is required to have a name and one designated key field (of Edm.string) in the fields collection. 字段集合通常是索引的最大组成部分,其中每个字段都已命名、类型化,并具有允许行为的属性(确定该字段的用法)。The fields collection is typically the largest part of an index, where each field is named, typed, and attributed with allowable behaviors that determine how it is used.

其他元素包括建议器计分概要文件分析器,分析器用于根据语言规则或分析器支持的其他特性将字符串处理成令牌,以及跨域远程脚本 (CORS) 设置。Other elements include suggesters, scoring profiles, analyzers used to process strings into tokens according to linguistic rules or other characteristics supported by the analyzer, and cross-origin remote scripting (CORS) settings.

{
  "name": (optional on PUT; required on POST) "name_of_index",
  "fields": [
    {
      "name": "name_of_field",
      "type": "Edm.String | Collection(Edm.String) | Edm.Int32 | Edm.Int64 | Edm.Double | Edm.Boolean | Edm.DateTimeOffset | Edm.GeographyPoint",
      "searchable": true (default where applicable) | false (only Edm.String and Collection(Edm.String) fields can be searchable),
      "filterable": true (default) | false,
      "sortable": true (default where applicable) | false (Collection(Edm.String) fields cannot be sortable),
      "facetable": true (default where applicable) | false (Edm.GeographyPoint fields cannot be facetable),
      "key": true | false (default, only Edm.String fields can be keys),
      "retrievable": true (default) | false,
      "analyzer": "name_of_analyzer_for_search_and_indexing", (only if 'searchAnalyzer' and 'indexAnalyzer' are not set)
      "searchAnalyzer": "name_of_search_analyzer", (only if 'indexAnalyzer' is set and 'analyzer' is not set)
      "indexAnalyzer": "name_of_indexing_analyzer", (only if 'searchAnalyzer' is set and 'analyzer' is not set)
      "synonymMaps": [ "name_of_synonym_map" ] (optional, only one synonym map per field is currently supported)
    }
  ],
  "suggesters": [
    {
      "name": "name of suggester",
      "searchMode": "analyzingInfixMatching",
      "sourceFields": ["field1", "field2", ...]
    }
  ],
  "scoringProfiles": [
    {
      "name": "name of scoring profile",
      "text": (optional, only applies to searchable fields) {
        "weights": {
          "searchable_field_name": relative_weight_value (positive #'s),
          ...
        }
      },
      "functions": (optional) [
        {
          "type": "magnitude | freshness | distance | tag",
          "boost": # (positive number used as multiplier for raw score != 1),
          "fieldName": "...",
          "interpolation": "constant | linear (default) | quadratic | logarithmic",
          "magnitude": {
            "boostingRangeStart": #,
            "boostingRangeEnd": #,
            "constantBoostBeyondRange": true | false (default)
          },
          "freshness": {
            "boostingDuration": "..." (value representing timespan leading to now over which boosting occurs)
          },
          "distance": {
            "referencePointParameter": "...", (parameter to be passed in queries to use as reference location)
            "boostingDistance": # (the distance in kilometers from the reference location where the boosting range ends)
          },
          "tag": {
            "tagsParameter": "..." (parameter to be passed in queries to specify a list of tags to compare against target fields)
          }
        }
      ],
      "functionAggregation": (optional, applies only when functions are specified) 
        "sum (default) | average | minimum | maximum | firstMatching"
    }
  ],
  "analyzers":(optional)[ ... ],
  "charFilters":(optional)[ ... ],
  "tokenizers":(optional)[ ... ],
  "tokenFilters":(optional)[ ... ],
  "defaultScoringProfile": (optional) "...",
  "corsOptions": (optional) {
    "allowedOrigins": ["*"] | ["origin_1", "origin_2", ...],
    "maxAgeInSeconds": (optional) max_age_in_seconds (non-negative integer)
  },
  "encryptionKey":(optional){
    "keyVaultUri": "azure_key_vault_uri",
    "keyVaultKeyName": "name_of_azure_key_vault_key",
    "keyVaultKeyVersion": "version_of_azure_key_vault_key",
    "accessCredentials":(optional){
      "applicationId": "azure_active_directory_application_id",
      "applicationSecret": "azure_active_directory_application_authentication_key"
    }
  }
}

字段集合与字段属性Fields collection and field attributes

字段具有名称、用于对存储数据进行分类的类型,以及用于指定如何使用字段的属性。Fields have a name, a type that classifies the stored data, and attributes that specify how the field is used.

数据类型Data types

类型Type 说明Description
Edm.StringEdm.String 全文搜索可以选择性地标记化(断词、词干提取等)的文本。Text that can optionally be tokenized for full-text search (word-breaking, stemming, and so forth).
集合 (Edm.String)Collection(Edm.String) 全文搜索可以选择性标记化的字符串列表。A list of strings that can optionally be tokenized for full-text search. 理论上,集合中的项目数没有上限,但集合的有效负载大小上限为 16 MB。There is no theoretical upper limit on the number of items in a collection, but the 16 MB upper limit on payload size applies to collections.
Edm.BooleanEdm.Boolean 包含 true/false 值。Contains true/false values.
Edm.Int32Edm.Int32 32 位整数值。32-bit integer values.
Edm.Int64Edm.Int64 64 位整数值。64-bit integer values.
Edm.DoubleEdm.Double 双精度数字数据。Double-precision numeric data.
Edm.DateTimeOffsetEdm.DateTimeOffset 以 OData V4 格式表示的日期时间值(例如 yyyy-MM-ddTHH:mm:ss.fffZyyyy-MM-ddTHH:mm:ss.fff[+/-]HH:mm)。Date time values represented in the OData V4 format (for example, yyyy-MM-ddTHH:mm:ss.fffZ or yyyy-MM-ddTHH:mm:ss.fff[+/-]HH:mm).
Edm.GeographyPointEdm.GeographyPoint 表示地球上的地理位置的点。A point representing a geographic location on the globe.

有关详细信息,请参阅支持的数据类型For more information, see supported data types.

特性Attributes

字段属性决定了字段的使用方式,例如,是否用于全文搜索、分面导航和排序等操作中。Field attributes determine how a field is used, such as whether it is used in full text search, faceted navigation, sort operations, and so forth.

字符串字段通常标记为“searchable”和“retrievable”。String fields are often marked as "searchable" and "retrievable". 用来缩小搜索结果范围的字段包括“sortable”、“filterable”和“facetable”。Fields used to narrow search results include "sortable", "filterable", and "facetable".

AttributeAttribute 描述Description
“searchable”"searchable" 可全文搜索,在编制索引期间遵从语法分析,例如分词。Full-text searchable, subject to lexical analysis such as word-breaking during indexing. 如果将某个可搜索字段设置为“sunny day”之类的值,在内部它将拆分为单独的标记“sunny”和“day”。If you set a searchable field to a value like "sunny day", internally it will be split into the individual tokens "sunny" and "day". 有关详细信息,请参阅全文搜索工作原理For details, see How full text search works.
“filterable”"filterable" 在 $filter 查询中引用。Referenced in $filter queries. Edm.StringCollection(Edm.String) 类型的可筛选字段不进行分词,因此,比较仅用于查找完全匹配项。Filterable fields of type Edm.String or Collection(Edm.String) do not undergo word-breaking, so comparisons are for exact matches only. 例如,如果将此类字段 f 设置为“sunny day”,则 $filter=f eq 'sunny' 将找不到任何匹配项,但 $filter=f eq 'sunny day' 可找到。For example, if you set such a field f to "sunny day", $filter=f eq 'sunny' will find no matches, but $filter=f eq 'sunny day' will.
“sortable”"sortable" 默认情况下,系统按分数对结果进行排序,但可以配置基于文档中字段的排序。By default the system sorts results by score, but you can configure sort based on fields in the documents. Collection(Edm.String) 类型的字段不能为“sortable”。Fields of type Collection(Edm.String) cannot be "sortable".
“facetable”"facetable" 通常用于包括了按类别(例如特定城市中的宾馆)的命中次数的搜索结果呈现中。Typically used in a presentation of search results that includes a hit count by category (for example, hotels in a specific city). 此选项无法与 Edm.GeographyPoint 类型的字段一起使用。This option cannot be used with fields of type Edm.GeographyPoint. Edm.String 类型的字段为可筛选,“sortable”或“facetable”字段的长度最多可以是 32 千字节。Fields of type Edm.String that are filterable, "sortable", or "facetable" can be at most 32 kilobytes in length. 有关详细信息,请参阅创建索引 (REST API)For details, see Create Index (REST API).
“key”"key" 文档在索引内的唯一标识符。Unique identifier for documents within the index. 必须仅选择单个字段作为键字段,并且它必须是 Edm.String 类型的。Exactly one field must be chosen as the key field and it must be of type Edm.String.
“retrievable”"retrievable" 决定了是否可以在搜索结果中返回此字段。Determines whether the field can be returned in a search result. 当希望将某个字段(例如“利润”)用作筛选器、排序或评分机制,但不希望该字段显示给最终用户时,这很有用。**This is useful when you want to use a field (such as profit margin) as a filter, sorting, or scoring mechanism, but do not want the field to be visible to the end user. 对于 key 字段,此属性必须为 trueThis attribute must be true for key fields.

尽管可以随时添加新字段,但在索引的生存期内现有字段定义将被锁定。Although you can add new fields at any time, existing field definitions are locked in for the lifetime of the index. 因此,开发人员通常使用门户创建简单索引、测试创意,或者使用门户页面来查找设置。For this reason, developers typically use the portal for creating simple indexes, testing ideas, or using the portal pages to look up a setting. 如果采用基于代码的方式以便可以轻松重新生成索引,那么对索引进行频繁迭代的设计就更为高效。Frequent iteration over an index design is more efficient if you follow a code-based approach so that you can rebuild the index easily.

备注

用于生成索引的 API 具有不同的默认行为。The APIs you use to build an index have varying default behaviors. 对于 REST API,大多数属性在默认情况下处于启用状态(例如,“searchable”和“retrievable”对于字符串字段为 true),并且通常只需要在要关闭它们时设置它们。For the REST APIs, most attributes are enabled by default (for example, "searchable" and "retrievable" are true for string fields) and you often only need to set them if you want to turn them off. 对于 .NET SDK,情况恰恰相反。For the .NET SDK, the opposite is true. 对于未显式设置的任何属性,默认情况下禁用相应的搜索行为,除非你特别启用它。On any property you do not explicitly set, the default is to disable the corresponding search behavior unless you specifically enable it.

analyzers

分析器元素设置用于字段的语言分析器的名称。The analyzers element sets the name of the language analyzer to use for the field. 有关可用的分析器范围的详细信息,请参阅将分析器添加到 Azure 认知搜索索引For more information about the range of analyzers available to you, see Adding analyzers to an Azure Cognitive Search index. 分析器仅适用于可搜索字段。Analyzers can only be used with searchable fields. 将分析器分配到字段后,除非重新生成索引,否则无法更改分析器。Once the analyzer is assigned to a field, it cannot be changed unless you rebuild the index.

suggesters

建议器是定义要使用索引中的哪些字段来支持搜索中的自动填写或提前键入查询的架构部分。A suggester is a section of the schema that defines which fields in an index are used to support auto-complete or type-ahead queries in searches. 通常,当用户尝试键入搜索查询时,部分搜索字符串将发送到建议 (REST API),并且该 API 会返回一组建议的文档或短语。Typically, partial search strings are sent to the Suggestions (REST API) while the user is typing a search query, and the API returns a set of suggested documents or phrases.

添加到建议器的字段用于生成自动提示搜索词。Fields added to a suggester are used to build type-ahead search terms. 在索引编制期间创建所有搜索词,并单独存储它们。All of the search terms are created during indexing and stored separately. 有关创建建议器结构的详细信息,请参阅添加建议器For more information about creating a suggester structure, see Add suggesters.

corsOptions

默认情况下,客户端 JavaScript 无法调用任何 API,因为浏览器将阻止所有跨域请求。Client-side JavaScript cannot call any APIs by default since the browser will prevent all cross-origin requests. 若要允许对索引进行跨域查询,请通过设置 corsOptions 来启用 CORS(跨域资源共享)。To allow cross-origin queries to your index, enable CORS (Cross-Origin Resource Sharing) by setting the corsOptions attribute. 出于安全原因,只有查询 API 才支持 CORS。For security reasons, only query APIs support CORS.

可为 CORS 设置以下选项:The following options can be set for CORS:

  • allowedOrigins(必需):这是会被授予索引访问权限的来源的列表。allowedOrigins (required): This is a list of origins that will be granted access to your index. 这意味着,将允许从这些来源提供的任何 JavaScript 代码查询索引(假设它提供正确的 api-key)。This means that any JavaScript code served from those origins will be allowed to query your index (assuming it provides the correct api-key). 每个来源通常采用 protocol://<fully-qualified-domain-name>:<port> 格式,不过往往会省略 <port>Each origin is typically of the form protocol://<fully-qualified-domain-name>:<port> although <port> is often omitted. 有关更多详细信息,请参阅“跨域资源共享”。See Cross-origin resource sharing for more details.

    若要允许访问所有来源,请将 * 作为单个项目包含在 allowedOrigins 数组中。If you want to allow access to all origins, include * as a single item in the allowedOrigins array. 不建议对生产搜索服务采用这种做法,但它在开发和调试中却很有用。 This is not recommended practice for production search services but it is often useful for development and debugging.

  • maxAgeInSeconds(可选):浏览器使用此值确定缓存 CORS 预检响应的持续时间(以秒为单位)。maxAgeInSeconds (optional): Browsers use this value to determine the duration (in seconds) to cache CORS preflight responses. 此值必须是非负整数。This must be a non-negative integer. 此值越大,性能越好,但 CORS 策略更改生效所需的时间也越长。The larger this value is, the better performance will be, but the longer it will take for CORS policy changes to take effect. 如果未设置此值,将使用 5 分钟的默认持续时间。If it is not set, a default duration of 5 minutes will be used.

scoringProfiles

评分配置文件是定义自定义评分行为,方便用户影响搜索结果中排名更高的项的架构部分。A scoring profile is a section of the schema that defines custom scoring behaviors that let you influence which items appear higher in the search results. 计分配置文件由字段权重和函数组成。Scoring profiles are made up of field weights and functions. 若要使用它们,请在查询字符串上按名称指定配置文件。To use them, you specify a profile by name on the query string.

在幕后运行的默认计分概要文件,用于为结果集中的每个项目计算搜索分数。A default scoring profile operates behind the scenes to compute a search score for every item in a result set. 可使用内部、未命名的计分概要文件。You can use the internal, unnamed scoring profile. 或者,将 defaultScoringProfile 设置为使用自定义配置文件作为默认配置文件,每当未在查询字符串上指定自定义配置文件时,将调用该配置文件。Alternatively, set defaultScoringProfile to use a custom profile as the default, invoked whenever a custom profile is not specified on the query string.

属性和索引大小(存储影响)Attributes and index size (storage implications)

索引大小由上传的文档的大小以及索引配置(例如,是否包括建议器,以及如何在各个字段上设置属性)决定。The size of an index is determined by the size of the documents you upload, plus index configuration, such as whether you include suggesters, and how you set attributes on individual fields.

以下屏幕截图演示了各种属性组合产生的索引存储模式。The following screenshot illustrates index storage patterns resulting from various combinations of attributes. 索引基于“房地产示例索引”,你可以使用“导入数据”向导轻松创建此索引。The index is based on the real estate sample index, which you can create easily using the Import data wizard. 尽管未显示索引架构,但可以基于索引名称推断属性。Although the index schemas are not shown, you can infer the attributes based on the index name. 例如,只选择了 realestate-searchable 索引中的“searchable”属性,只选择了 realestate-retrievable 索引中的“retrievable”属性,等等 。For example, realestate-searchable index has the "searchable" attribute selected and nothing else, realestate-retrievable index has the "retrievable" attribute selected and nothing else, and so forth.

基于属性选择的索引大小Index size based on attribute selection

尽管这些索引变体是人造的,但我们可以参考这些变体来对属性影响存储的方式进行广泛比较。Although these index variants are artificial, we can refer to them for broad comparisons of how attributes affect storage. 设置“retrievable”是否会增大索引大小?Does setting "retrievable" increase index size? 否。No. 将字段添加到 suggester 是否会增大索引大小?Does adding fields to a suggester increase index size? 可以。Yes.

支持筛选和排序的索引在比例上大于仅支持全文搜索的索引。Indexes that support filter and sort are proportionally larger than indexes supporting just full text search. 这是因为筛选和排序操作扫描完全匹配项,并要求存在逐字文本字符串。This is because filter and sort operations scan for exact matches, requiring the presence of verbatim text strings. 相比之下,支持全文查询的可搜索字段使用倒排索引,而这些索引中填充了空间占用量比整个文档更小的标记化字词。In contrast, searchable fields supporting full-text queries use inverted indexes, which are populated with tokenized terms that consume less space than whole documents.

备注

存储体系结构被视为 Azure 认知搜索的实现细节,随时可能在不另行通知的情况下进行更改。Storage architecture is considered an implementation detail of Azure Cognitive Search and could change without notice. 不保证将来仍会保持当前的行为。There is no guarantee that current behavior will persist in the future.

后续步骤Next steps

了解索引的构成后,可以继续在门户中创建第一个索引。With an understanding of index composition, you can continue in the portal to create your first index. 建议从“导入数据”向导开始,选择承载 realestate-us-sample 或 hotels-sample 的数据源 。We recommend starting with Import data wizard, choosing either the realestate-us-sample or hotels-sample hosted data sources.

对于这两个数据集,向导可以推断索引架构,导入数据,并输出可使用搜索资源管理器查询的可搜索索引。For both data sets, the wizard can infer an index schema, import the data, and output a searchable index that you can query using Search Explorer. 在“导入数据”向导的“连接到数据”页中查找这些数据源 。Find these data sources in the Connect to your data page of the Import data wizard.

创建示例索引Create a sample index