Azure 认知搜索中的索引器Indexers in Azure Cognitive Search

Azure 认知搜索中的索引器是一种爬网程序,它从外部 Azure 数据源提取可搜索的数据和元数据,并根据索引与数据源之间字段到字段映射填充索引。An indexer in Azure Cognitive Search is a crawler that extracts searchable data and metadata from an external Azure data source and populates an index based on field-to-field mappings between the index and your data source. 由于不需要编写任何将数据添加到索引的代码,该服务就能拉取数据,因此这种方法有时也称为“拉取模式”。This approach is sometimes referred to as a 'pull model' because the service pulls data in without you having to write any code that adds data to an index.

索引器基于数据源类型或平台,单个索引器适用于 Azure 上的 SQL Server、Cosmos DB、Azure 表存储和 Blob 存储。Indexers are based on data source types or platforms, with individual indexers for SQL Server on Azure, Cosmos DB, Azure Table Storage and Blob Storage. Blob 存储索引器有特定于 Blob 内容类型的其他属性。Blob storage indexers have additional properties specific to blob content types.

可以单独使用索引器来引入数据,也可以结合索引器使用多种技术来加载索引中的部分字段。You can use an indexer as the sole means for data ingestion, or use a combination of techniques that include the use of an indexer for loading just some of the fields in your index.

可以按需运行索引器,也可以采用每 5 分钟运行一次的定期数据刷新计划来运行索引器。You can run indexers on demand or on a recurring data refresh schedule that runs as often as every five minutes. 要进行更频繁的更新,则需要采用“推送模式”,便于同时更新 Azure 认知搜索和外部数据源中的数据。More frequent updates require a push model that simultaneously updates data in both Azure Cognitive Search and your external data source.

创建及管理索引器的方法Approaches for creating and managing indexers

可以使用以下方法创建和管理索引器:You can create and manage indexers using these approaches:

一开始会将新的索引器宣布为预览版功能。Initially, a new indexer is announced as a preview feature. 预览版功能首先在 API(REST 和 .NET)中引入,然在逐渐公开发行以后再集成到门户中。Preview features are introduced in APIs (REST and .NET) and then integrated into the portal after graduating to general availability. 如果评估的是新索引器,则应做好编写代码的计划。If you're evaluating a new indexer, you should plan on writing code.

权限Permissions

与索引器相关的所有操作(包括对状态或定义的 GET 请求)都需要管理员 api-keyAll operations related to indexers, including GET requests for status or definitions, require an admin api-key.

支持的数据源Supported data sources

索引器在 Azure 上抓取数据存储。Indexers crawl data stores on Azure.

索引器阶段Indexer Stages

在首次运行时,如果索引为空,索引器将读取表或容器中提供的所有数据。On an initial run, when the index is empty, an indexer will read in all of the data provided in the table or container. 在后续运行中,索引器通常可以只检测并检索已更改的数据。On subsequent runs, the indexer can usually detect and retrieve just the data that has changed. 对于 blob 数据,更改检测是自动进行的。For blob data, change detection is automatic. 对于其他数据源(如 Azure SQL 或 Cosmos DB),必须启用更改检测。For other data sources like Azure SQL or Cosmos DB, change detection must be enabled.

对于它引入的每个文档,索引器将执行或协调多个步骤来编制索引,从文档检索到最终的搜索引擎“移交”。For each of the document it ingests, an indexer implements or coordinates multiple steps, from document retrieval to a final search engine "handoff" for indexing. (可选)如果定义了技能组,索引器还有助于推动技能组的执行和输出。Optionally, an indexer is also instrumental in driving skillset execution and outputs, assuming a skillset is defined.

索引器阶段Indexer Stages

第 1 阶段:文档破解Stage 1: Document cracking

文档破解是打开文件并提取内容的过程。Document cracking is the process of opening files and extracting content. 根据数据源的类型,索引器将尝试执行不同的操作来提取可能可索引的内容。Depending on the type of data source, the indexer will try performing different operations to extract potentially indexable content.

示例:Examples:

  • 如果文档是 Azure SQL 数据源中的记录,则索引器将提取记录中的每个字段。When the document is a record in an Azure SQL data source, the indexer will extract each of the fields for the record.
  • 如果文档是 Azure Blob 存储数据源中的 PDF 文件,则索引器将提取该文件的文本、图像和元数据。When the document is a PDF file in an Azure Blob Storage data source, the indexer will extract the text, images and metadata for the file.
  • 如果文档是 Cosmos DB 数据源中的记录,则索引器将提取 Cosmos DB 文档中的字段和子字段。When the document is a record in a Cosmos DB data source, the indexer will extract the fields and subfields from the Cosmos DB document.

第 2 阶段:字段映射Stage 2: Field mappings

索引器提取源字段中的文本,并将其发送到索引或知识存储中的目标字段。An indexer extracts text from a source field and sends it to a destination field in an index or knowledge store. 当字段名称和类型一致时,路径会被清除。When field names and types coincide, the path is clear. 不过,如果希望输出中有不同的名称或类型,则需要告知索引器如何映射字段。However, you might want different names or types in the output, in which case you need to tell the indexer how to map the field. 当索引器从源文档读取时,此步骤需在文档破解后、转换之前进行。This step occurs after document cracking, but before transformations, when the indexer is reading from the source documents. 在定义字段映射时,源字段的值将按原样发送到目标字段,而不进行任何修改。When you define a field mapping, the value of the source field is sent as-is to the destination field with no modifications. 字段映射是可选的。Field mappings are optional.

第 3 阶段:技能组执行Stage 3: Skillset execution

技能组执行是一个可选步骤,它调用内置或自定义 AI 处理。Skillset execution is an optional step that invokes built-in or custom AI processing. 你可能需要它以图像分析的形式进行光学字符识别 (OCR),或者可能需要语言翻译。You might need it for optical character recognition (OCR) in the form of image analysis, or you might need language translation. 无论是哪种转换,技能组执行都是扩充的途径。Whatever the transformation, skillset execution is where enrichment occurs. 如果索引器是管道,则可将技能组视为“管道内的管道”。If an indexer is a pipeline, you can think of a skillset as a "pipeline within the pipeline". 技能组有自己的一系列步骤,称为技能。A skillset has its own sequence of steps called skills.

阶段 4:输出字段映射Stage 4: Output field mappings

技能组的输出实际上是一棵称为“扩充文档”的信息树。The output of a skillset is really a tree of information called the enriched document. 通过输出字段映射,可以选择此树中哪些部分要映射到索引中的字段。Output field mappings allow you to select which parts of this tree to map into fields in your index. 了解如何定义输出字段映射Learn how to define output field mappings.

就像将源字段中的原义值关联到目标字段的字段映射一样,输出字段映射会告知索引器如何将扩充文档中的已转换值关联到索引中的目标字段。Just like field mappings that associate verbatim values from source to destination fields, output field mappings tell the indexer how to associate the transformed values in the enriched document to destination fields in the index. 与被视为可选的字段映射不同,你始终需要为需要驻留在索引中的任何已转换内容定义输出字段映射。Unlike field mappings, which are considered optional, you will always need to define an output field mapping for any transformed content that needs to reside in an index.

示例调试会话

基本配置步骤Basic configuration steps

索引器可提供数据源独有的功能。Indexers can offer features that are unique to the data source. 因此,索引器或数据源配置的某些方面会因索引器类型而不同。In this respect, some aspects of indexer or data source configuration will vary by indexer type. 但是,所有索引器的基本构成元素和要求都相同。However, all indexers share the same basic composition and requirements. 下面介绍所有索引器都适用的共同步骤。Steps that are common to all indexers are covered below.

步骤 1:创建数据源Step 1: Create a data source

索引器从数据源对象获取数据源连接。An indexer obtains data source connection from a data source object. 数据源定义提供连接字符串和可能的凭据。The data source definition provides a connection string and possibly credentials. 调用创建数据源 REST API 或 DataSource 类以创建资源。Call the Create Datasource REST API or DataSource class to create the resource.

数据源的配置和管理独立于使用数据源的索引器,这意味着多个索引器可使用一个数据源,同时加载多个索引。Data sources are configured and managed independently of the indexers that use them, which means a data source can be used by multiple indexers to load more than one index at a time.

步骤 2:创建索引Step 2: Create an index

索引器会自动执行某些与数据引入相关的任务,但通常不会自动创建索引。An indexer will automate some tasks related to data ingestion, but creating an index is generally not one of them. 先决条件是必须具有预定义的索引,且索引的字段必须与外部数据源中的字段匹配。As a prerequisite, you must have a predefined index with fields that match those in your external data source. 字段需按名称和数据类型进行匹配。Fields need to match by name and data type. 有关构建索引的详细信息,请参阅 创建索引(Azure 认知搜索 REST API)索引类For more information about structuring an index, see Create an Index (Azure Cognitive Search REST API) or Index class. 如需字段关联方面的帮助,请参阅 Azure 认知搜索索引器中的字段映射For help with field associations, see Field mappings in Azure Cognitive Search indexers.

提示

虽然不能使用索引器来生成索引,但可以使用门户中的导入数据向导。Although indexers cannot generate an index for you, the Import data wizard in the portal can help. 大多数情况下,该向导可以根据源中现有的元数据推断索引架构,提供一个初级索引架构,该架构在向导处于活动状态时可以进行内联编辑。In most cases, the wizard can infer an index schema from existing metadata in the source, presenting a preliminary index schema which you can edit in-line while the wizard is active. 在服务上创建索引以后,若要在门户中进一步进行编辑,多数情况下只能添加新字段。Once the index is created on the service, further edits in the portal are mostly limited to adding new fields. 可以将向导视为索引的创建工具而非修订工具。Consider the wizard for creating, but not revising, an index. 如需手动方式的学习,请一步步完成门户演练For hands-on learning, step through the portal walkthrough.

步骤 3:创建和计划索引器Step 3: Create and schedule the indexer

索引器定义是一种构造,它将与数据引入相关的所有元素组合在一起。The indexer definition is a construct that brings together all of the elements related to data ingestion. 必需元素包括数据源和索引。Required elements include a data source and index. 可选元素包括计划和字段映射。Optional elements include a schedule and field mappings. 只有在源字段和索引字段明确对应的情况下,字段映射才是可选的。Field mapping are only optional if source fields and index fields clearly correspond. 有关构建索引器的详细信息,请参阅 创建索引器(Azure 认知搜索 REST API)For more information about structuring an indexer, see Create Indexer (Azure Cognitive Search REST API).

按需运行索引器Run indexers on-demand

虽然通常会对索引操作进行计划,但也可使用 Run 命令按需调用索引器:While it's common to schedule indexing, an indexer can also be invoked on demand using the Run command:

POST https://[service name].search.azure.cn/indexers/[indexer name]/run?api-version=2020-06-30
api-key: [Search service admin key]

备注

“运行 API”成功返回时,已计划索引器调用,但实际处理过程以异步方式发生。When Run API returns successfully, the indexer invocation has been scheduled, but the actual processing happens asynchronously.

可以通过门户或“获取索引器状态 API”监视索引器状态。You can monitor the indexer status in the portal or through Get Indexer Status API.

获取索引器状态Get indexer status

可以通过“获取索引器状态”命令检索索引器的状态和执行历史记录:You can retrieve the status and execution history of an indexer through the Get Indexer Status command:

GET https://[service name].search.azure.cn/indexers/[indexer name]/status?api-version=2020-06-30
api-key: [Search service admin key]

响应包含总体索引器状态、最后一次(或正在进行的)索引器调用以及最近索引器调用的历史记录。The response contains overall indexer status, the last (or in-progress) indexer invocation, and the history of recent indexer invocations.

{
    "status":"running",
    "lastResult": {
        "status":"success",
        "errorMessage":null,
        "startTime":"2018-11-26T03:37:18.853Z",
        "endTime":"2018-11-26T03:37:19.012Z",
        "errors":[],
        "itemsProcessed":11,
        "itemsFailed":0,
        "initialTrackingState":null,
        "finalTrackingState":null
     },
    "executionHistory":[ {
        "status":"success",
         "errorMessage":null,
        "startTime":"2018-11-26T03:37:18.853Z",
        "endTime":"2018-11-26T03:37:19.012Z",
        "errors":[],
        "itemsProcessed":11,
        "itemsFailed":0,
        "initialTrackingState":null,
        "finalTrackingState":null
    }]
}

执行历史记录包含最多 50 个最近完成的执行,它们被按反向时间顺序排序(因此,最新执行出现在响应中的第一个)。Execution history contains up to the 50 most recent completed executions, which are sorted in reverse chronological order (so the latest execution comes first in the response).

后续步骤Next steps

了解基本概念后,下一步是查看每种数据源特定的要求和任务。Now that you have the basic idea, the next step is to review requirements and tasks specific to each data source type.