Azure 认知搜索中的索引器Indexers in Azure Cognitive Search

Azure 认知搜索中的索引器是一种爬网程序,它从外部 Azure 数据源提取可搜索的数据和元数据,并根据索引与数据源之间字段到字段映射填充索引。An indexer in Azure Cognitive Search is a crawler that extracts searchable data and metadata from an external Azure data source and populates an index based on field-to-field mappings between the index and your data source. 由于不需要编写任何将数据添加到索引的代码,该服务就能拉取数据,因此这种方法有时也称为“拉取模式”。This approach is sometimes referred to as a 'pull model' because the service pulls data in without you having to write any code that adds data to an index.

索引器基于数据源类型或平台,单个索引器适用于 Azure 上的 SQL Server、Cosmos DB、Azure 表存储和 Blob 存储。Indexers are based on data source types or platforms, with individual indexers for SQL Server on Azure, Cosmos DB, Azure Table Storage and Blob Storage. Blob 存储索引器有特定于 Blob 内容类型的其他属性。Blob storage indexers have additional properties specific to blob content types.

可以单独使用索引器来引入数据,也可以结合索引器使用多种技术来加载索引中的部分字段。You can use an indexer as the sole means for data ingestion, or use a combination of techniques that include the use of an indexer for loading just some of the fields in your index.

可以按需运行索引器,也可以采用每 5 分钟运行一次的定期数据刷新计划来运行索引器。You can run indexers on demand or on a recurring data refresh schedule that runs as often as every five minutes. 要进行更频繁的更新,则需要采用“推送模式”,便于同时更新 Azure 认知搜索和外部数据源中的数据。More frequent updates require a push model that simultaneously updates data in both Azure Cognitive Search and your external data source.

创建及管理索引器的方法Approaches for creating and managing indexers

可以使用以下方法创建和管理索引器:You can create and manage indexers using these approaches:

一开始会将新的索引器宣布为预览版功能。Initially, a new indexer is announced as a preview feature. 预览版功能首先在 API(REST 和 .NET)中引入,然在逐渐公开发行以后再集成到门户中。Preview features are introduced in APIs (REST and .NET) and then integrated into the portal after graduating to general availability. 如果评估的是新索引器,则应做好编写代码的计划。If you're evaluating a new indexer, you should plan on writing code.

权限Permissions

与索引器相关的所有操作(包括对状态或定义的 GET 请求)都需要管理员 api-keyAll operations related to indexers, including GET requests for status or definitions, require an admin api-key.

支持的数据源Supported data sources

索引器在 Azure 上抓取数据存储。Indexers crawl data stores on Azure.

基本配置步骤Basic configuration steps

索引器可提供数据源独有的功能。Indexers can offer features that are unique to the data source. 因此,索引器或数据源配置的某些方面会因索引器类型而不同。In this respect, some aspects of indexer or data source configuration will vary by indexer type. 但是,所有索引器的基本构成元素和要求都相同。However, all indexers share the same basic composition and requirements. 下面介绍所有索引器都适用的共同步骤。Steps that are common to all indexers are covered below.

步骤 1:创建数据源Step 1: Create a data source

索引器从数据源对象获取数据源连接。An indexer obtains data source connection from a data source object. 数据源定义提供连接字符串和可能的凭据。The data source definition provides a connection string and possibly credentials. 调用创建数据源 REST API 或 DataSource 类以创建资源。Call the Create Datasource REST API or DataSource class to create the resource.

数据源的配置和管理独立于使用数据源的索引器,这意味着多个索引器可使用一个数据源,同时加载多个索引。Data sources are configured and managed independently of the indexers that use them, which means a data source can be used by multiple indexers to load more than one index at a time.

步骤 2:创建索引Step 2: Create an index

索引器会自动执行某些与数据引入相关的任务,但通常不会自动创建索引。An indexer will automate some tasks related to data ingestion, but creating an index is generally not one of them. 先决条件是必须具有预定义的索引,且索引的字段必须与外部数据源中的字段匹配。As a prerequisite, you must have a predefined index with fields that match those in your external data source. 字段需按名称和数据类型进行匹配。Fields need to match by name and data type. 有关构建索引的详细信息,请参阅 创建索引(Azure 认知搜索 REST API)索引类For more information about structuring an index, see Create an Index (Azure Cognitive Search REST API) or Index class. 如需字段关联方面的帮助,请参阅 Azure 认知搜索索引器中的字段映射For help with field associations, see Field mappings in Azure Cognitive Search indexers.

提示

虽然不能使用索引器来生成索引,但可以使用门户中的导入数据向导。Although indexers cannot generate an index for you, the Import data wizard in the portal can help. 大多数情况下,该向导可以根据源中现有的元数据推断索引架构,提供一个初级索引架构,该架构在向导处于活动状态时可以进行内联编辑。In most cases, the wizard can infer an index schema from existing metadata in the source, presenting a preliminary index schema which you can edit in-line while the wizard is active. 在服务上创建索引以后,若要在门户中进一步进行编辑,多数情况下只能添加新字段。Once the index is created on the service, further edits in the portal are mostly limited to adding new fields. 可以将向导视为索引的创建工具而非修订工具。Consider the wizard for creating, but not revising, an index. 如需手动方式的学习,请一步步完成门户演练For hands-on learning, step through the portal walkthrough.

步骤 3:创建和计划索引器Step 3: Create and schedule the indexer

索引器定义是一种构造,它将与数据引入相关的所有元素组合在一起。The indexer definition is a construct that brings together all of the elements related to data ingestion. 必需元素包括数据源和索引。Required elements include a data source and index. 可选元素包括计划和字段映射。Optional elements include a schedule and field mappings. 只有在源字段和索引字段明确对应的情况下,字段映射才是可选的。Field mapping are only optional if source fields and index fields clearly correspond. 有关构建索引器的详细信息,请参阅 创建索引器(Azure 认知搜索 REST API)For more information about structuring an indexer, see Create Indexer (Azure Cognitive Search REST API).

按需运行索引器Run indexers on-demand

虽然通常会对索引操作进行计划,但也可使用 Run 命令按需调用索引器:While it's common to schedule indexing, an indexer can also be invoked on demand using the Run command:

POST https://[service name].search.azure.cn/indexers/[indexer name]/run?api-version=2019-05-06
api-key: [Search service admin key]

备注

“运行 API”成功返回时,已计划索引器调用,但实际处理过程以异步方式发生。When Run API returns successfully, the indexer invocation has been scheduled, but the actual processing happens asynchronously.

可以通过门户或“获取索引器状态 API”监视索引器状态。You can monitor the indexer status in the portal or through Get Indexer Status API.

获取索引器状态Get indexer status

可以通过“获取索引器状态”命令检索索引器的状态和执行历史记录:You can retrieve the status and execution history of an indexer through the Get Indexer Status command:

GET https://[service name].search.azure.cn/indexers/[indexer name]/status?api-version=2019-05-06
api-key: [Search service admin key]

响应包含总体索引器状态、最后一次(或正在进行的)索引器调用以及最近索引器调用的历史记录。The response contains overall indexer status, the last (or in-progress) indexer invocation, and the history of recent indexer invocations.

{
    "status":"running",
    "lastResult": {
        "status":"success",
        "errorMessage":null,
        "startTime":"2018-11-26T03:37:18.853Z",
        "endTime":"2018-11-26T03:37:19.012Z",
        "errors":[],
        "itemsProcessed":11,
        "itemsFailed":0,
        "initialTrackingState":null,
        "finalTrackingState":null
     },
    "executionHistory":[ {
        "status":"success",
         "errorMessage":null,
        "startTime":"2018-11-26T03:37:18.853Z",
        "endTime":"2018-11-26T03:37:19.012Z",
        "errors":[],
        "itemsProcessed":11,
        "itemsFailed":0,
        "initialTrackingState":null,
        "finalTrackingState":null
    }]
}

执行历史记录包含最多 50 个最近完成的执行,它们被按反向时间顺序排序(因此,最新执行出现在响应中的第一个)。Execution history contains up to the 50 most recent completed executions, which are sorted in reverse chronological order (so the latest execution comes first in the response).

后续步骤Next steps

了解基本概念后,下一步是查看每种数据源特定的要求和任务。Now that you have the basic idea, the next step is to review requirements and tasks specific to each data source type.