如何在 Azure 认知搜索中为大型数据集编制索引How to index large data sets in Azure Cognitive Search

Azure 认知搜索支持采用两种基本方法将数据导入到搜索索引中:一种方法是以编程方式将数据推送到索引中;另一种方法是将 Azure 认知搜索索引器指向受支持的数据源来拉取数据。Azure Cognitive Search supports two basic approaches for importing data into a search index: pushing your data into the index programmatically, or pointing an Azure Cognitive Search indexer at a supported data source to pull in the data.

随着数据量的增长或处理需求的变化,你可能发现,简单或默认的索引编制策略不再实用。As data volumes grow or processing needs change, you might find that simple or default indexing strategies are no longer practical. 在 Azure 认知搜索中,可通过多种方法来适应较大的数据集,包括构建数据上传请求、对计划和分布式工作负荷使用特定于源的索引器,等等。For Azure Cognitive Search, there are several approaches for accommodating larger data sets, ranging from how you structure a data upload request, to using a source-specific indexer for scheduled and distributed workloads.

这些相同的技术也适用于长时间运行的进程。The same techniques also apply to long-running processes. 具体而言,并行索引编制中所述的步骤有助于完成计算密集型编制索引,例如 AI 扩充管道中的图像分析或自然语言处理。In particular, the steps outlined in parallel indexing are helpful for computationally intensive indexing, such as image analysis or natural language processing in an AI enrichment pipeline.

以下部分探讨使用推送 API 和索引器对大量数据进行索引编制的方法。The following sections explore techniques for indexing large amounts of data using both the push API and indexers.

推送 APIPush API

将数据推送到索引中时,有多个重要注意事项会影响推送 API 的索引编制速度。When pushing data into an index, there's several key considerations that impact indexing speeds for the push API. 以下部分概要介绍这些因素。These factors are outlined in the section below.

除本文中的信息外,还可利用优化索引编制速度教程中的代码示例来了解详细信息。In addition to the information in this article, you can also take advantage of the code samples in the optimizing indexing speeds tutorial to learn more.

服务层级和分区/副本数Service tier and number of partitions/replicas

添加分区或提高搜索服务层级都将提高索引编制速度。Adding partitions or increasing the tier of your search service will both increase indexing speeds.

添加其他副本也可能会提高索引编制速度,但无法保证提高。Adding additional replicas may also increase indexing speeds but it isn't guaranteed. 另一方面,其他副本将增加搜索服务可以处理的查询量。On the other hand, additional replicas will increase the query volume your search service can handle. 副本也是获取 SLA 的关键组件。Replicas are also a key component for getting an SLA.

在添加分区/副本或升级到更高层级前,请考虑金钱成本和分配时间。Before adding partition/replicas or upgrading to a higher tier, consider the monetary cost and allocation time. 添加分区可以显著提高索引编制速度,但添加/删除分区可能需要 15 分钟到几个小时不等。Adding partitions can significantly increase indexing speed but adding/removing them can take anywhere from 15 minutes to several hours. 有关详细信息,请参阅关于调整容量的文档。For more information, see the documentation on adjusting capacity.

索引架构Index Schema

索引架构在编制数据索引方面扮演重要角色。The schema of your index plays an important role in indexing data. 添加字段和向字段添加其他属性(例如“可搜索”、“可分面”或“可筛选”)都会降低索引编制速度。Adding fields and adding additional properties to those fields (such as searchable, facetable, or filterable) both reduce indexing speeds.

一般而言,建议仅在打算使用其他属性时才将其添加到字段中。In general, we recommend only adding additional properties to fields if you intend to use them.

备注

若要保持较小的文档大小,请避免向索引添加不可查询的数据。To keep document size down, avoid adding non-queryable data to an index. 图像和其他二进制数据不可直接搜索,不应存储在索引中。Images and other binary data are not directly searchable and shouldn't be stored in the index. 若要将不可查询的数据集成到搜索结果中,应定义用于存储资源的 URL 引用的不可搜索字段。To integrate non-queryable data into search results, you should define a non-searchable field that stores a URL reference to the resource.

批大小Batch Size

为较大数据集编制索引的最简单机制之一是在单个请求中提交多个文档或记录。One of the simplest mechanisms for indexing a larger data set is to submit multiple documents or records in a single request. 只要整个有效负载小于 16 MB,则请求就可以在一个批量上传操作中最多处理 1000 个文档。As long as the entire payload is under 16 MB, a request can handle up to 1000 documents in a bulk upload operation. 不管你在 .NET SDK 中使用添加文档 REST API 还是索引方法,这些限制均适用。These limits apply whether you're using the Add Documents REST API or the Index method in the .NET SDK. 不管什么 API,你都会在每个请求的正文中打包 1000 个文档。For either API, you would package 1000 documents in the body of each request.

使用批处理为文档编制索引可显著提高索引编制性能。Using batches to index documents will significantly improve indexing performance. 确定数据的最佳批大小是优化索引编制速度的关键。Determining the optimal batch size for your data is a key component of optimizing indexing speeds. 影响最佳批大小的两个主要因素是:The two primary factors influencing the optimal batch size are:

  • 索引的架构The schema of your index
  • 数据的大小The size of your data

因为最佳批大小取决于你的索引和数据,所以最好的方法是测试不同的批大小,以确定可为你的方案实现最快索引编制速度的批大小。Because the optimal batch size depends on your index and your data, the best approach is to test different batch sizes to determine what results in the fastest indexing speeds for your scenario. 教程提供使用 .NET SDK 测试批大小的示例代码。This tutorial provides sample code for testing batch sizes using the .NET SDK.

线程/辅助角色数Number of threads/workers

若要充分利用 Azure 认知搜索的索引编制速度,你可能需要使用多个线程将许多批量编制索引请求并发发送到该服务。To take full advantage of Azure Cognitive Search's indexing speeds, you'll likely need to use multiple threads to send batch indexing requests concurrently to the service.

最佳线程数取决于:The optimal number of threads is determined by:

  • 搜索服务层级The tier of your search service
  • 分区数The number of partitions
  • 批大小The size of your batches
  • 索引的架构The schema of your index

你可以修改此示例并测试不同的线程数,以确定适合你的方案的最佳线程数。You can modify this sample and test with different thread counts to determine the optimal thread count for your scenario. 但是,只要有多个线程并发运行,就应该能够利用大部分提升的效率。However, as long as you have several threads running concurrently, you should be able to take advantage of most of the efficiency gains.

备注

提高搜索服务层级或增加分区时,还应增加并发线程数。As you increase the tier of your search service or increase the partitions, you should also increase the number of concurrent threads.

当你增加命中搜索服务的请求时,可能会遇到表示请求没有完全成功的 HTTP 状态代码As you ramp up the requests hitting the search service, you may encounter HTTP status codes indicating the request didn't fully succeed. 在编制索引期间,有两个常见的 HTTP 状态代码:During indexing, two common HTTP status codes are:

  • 503 服务不可用 - 此错误表示系统负载过重,当前无法处理请求。503 Service Unavailable - This error means that the system is under heavy load and your request can't be processed at this time.
  • 207 多状态 - 此错误意味着某些文档成功,但至少一个文档失败。207 Multi-Status - This error means that some documents succeeded, but at least one failed.

重试策略Retry strategy

如果失败,则应使用指数回退重试策略来重试请求。If a failure happens, requests should be retried using an exponential backoff retry strategy.

Azure 认知搜索的 .NET SDK 会自动重试 503 和其他失败的请求,但你需要实现自己的逻辑来重试 207。Azure Cognitive Search's .NET SDK automatically retries 503s and other failed requests but you'll need to implement your own logic to retry 207s. 还可以使用 Polly 等开源工具来实现重试策略。Open-source tools such as Polly can also be used to implement a retry strategy.

网络数据传输速度Network data transfer speeds

编制数据索引时,网络数据传输速度可能是一个限制因素。Network data transfer speeds can be a limiting factor when indexing data. 在 Azure 环境中为数据编制索引是加快索引编制的一种简便方法。Indexing data from within your Azure environment is an easy way to speed up indexing.

索引器Indexers

索引器用于在支持的 Azure 数据源中通过爬网找到可搜索的内容。Indexers are used to crawl supported Azure data sources for searchable content. 有几项索引器功能尽管并非专门用于大规模索引编制,但它们特别能够适应较大的数据集:While not specifically intended for large-scale indexing, several indexer capabilities are particularly useful for accommodating larger data sets:

  • 使用计划程序可按固定的间隔来分配索引编制工作,以便在不同的时间完成这些工作。Schedulers allow you to parcel out indexing at regular intervals so that you can spread it out over time.
  • 计划的索引编制可在最后一个已知的停止点处恢复。Scheduled indexing can resume at the last known stopping point. 如果在 24 小时期限内未完全爬网某个数据源,第二天,索引器将在中断位置恢复索引编制。If a data source is not fully crawled within a 24-hour window, the indexer will resume indexing on day two at wherever it left off.
  • 将数据分区成较小的独立数据源可以实现并行处理。Partitioning data into smaller individual data sources enables parallel processing. 可以将源数据拆分成较小的组件(例如 Azure Blob 存储中的多个容器),然后在 Azure 认知搜索中创建多个相应的可以通过并行方式编制索引的数据源对象You can break up source data into smaller components, such as into multiple containers in Azure Blob storage, and then create corresponding, multiple data source objects in Azure Cognitive Search that can be indexed in parallel.

备注

索引器特定于数据源,因此,使用索引器方法仅对 Azure 上的选定数据源可行:SQL 数据库Blob 存储表存储Cosmos DBIndexers are data-source-specific, so using an indexer approach is only viable for selected data sources on Azure: SQL Database, Blob storage, Table storage, Cosmos DB.

批大小Batch Size

与推送 API 一样,索引器允许配置每个批处理的项数。As with the push API, indexers allow you to configure the number of items per batch. 对于基于创建索引器 REST API 的索引器,可以设置 batchSize 参数来自定义此设置,以便与数据特征更相符。For indexers based on the Create Indexer REST API, you can set the batchSize argument to customize this setting to better match the characteristics of your data.

默认批大小特定于数据源。Default batch sizes are data source specific. Azure SQL 数据库和 Azure Cosmos DB 的默认批大小为 1000。Azure SQL Database and Azure Cosmos DB have a default batch size of 1000. 相反,当平均文档大小较大时,Azure Blob 索引编制会将批大小设置为 10 个文档。In contrast, Azure Blob indexing sets batch size at 10 documents in recognition of the larger average document size.

计划的索引Scheduled indexing

索引器计划是用于处理大型数据集和缓慢运行的进程(例如,认知搜索管道中的图像分析)的重要机制。Indexer scheduling is an important mechanism for processing large data sets, as well as slow-running processes like image analysis in a cognitive search pipeline. 索引器处理需在 24 小时时限内完成运行。Indexer processing operates within a 24-hour window. 如果处理未能在 24 小时内完成,则索引器计划的行为能够提供优势。If processing fails to finish within 24 hours, the behaviors of indexer scheduling can work to your advantage.

根据设计,计划的索引按特定的间隔启动,作业通常会按下一个计划间隔在恢复之前完成。By design, scheduled indexing starts at specific intervals, with a job typically completing before resuming at the next scheduled interval. 但是,如果处理在该间隔内未完成,则索引器会停止(因为已超时)。However, if processing does not complete within the interval, the indexer stops (because it ran out of time). 在下一个间隔,处理将上次中断的位置恢复,同时,系统会跟踪该位置。At the next interval, processing resumes where it last left off, with the system keeping track of where that occurs.

实际上,对于跨越好几天的索引负载,可按 24 小时计划放置索引器。In practical terms, for index loads spanning several days, you can put the indexer on a 24-hour schedule. 在下一个 24 小时周期恢复索引编制时,该作业会从已知正常的文档重新开始。When indexing resumes for the next 24-hour cycle, it restarts at the last known good document. 这样,索引器便可以处理很多天的文档积压工作,直到处理完所有未处理的文档。In this way, an indexer can work its way through a document backlog over a series of days until all unprocessed documents are processed. 有关此方法的详细信息,请参阅为 Azure Blob 存储中的大型数据集编制索引For more information about this approach, see Indexing large datasets in Azure Blob storage. 有关设置计划的一般详细信息,请参阅创建索引器 REST API如何为 Azure 认知搜索计划索引器For more information about setting schedules in general, see Create Indexer REST API or see How to schedule indexers for Azure Cognitive Search.

并行索引Parallel indexing

并行索引编制策略基于同时为多个数据源编制索引,其中,每个数据源定义指定数据的子集。A parallel indexing strategy is based on indexing multiple data sources in unison, where each data source definition specifies a subset of the data.

对于非常规的计算密集型索引编制要求 - 例如,针对认知搜索管道中扫描的文档运行 OCR、图像分析或自然语言处理 - 并行索引编制策略通常是在最短时间内完成长时间运行的进程的适当方法。For non-routine, computationally intensive indexing requirements - such as OCR on scanned documents in a cognitive search pipeline, image analysis, or natural language processing - a parallel indexing strategy is often the right approach for completing a long-running process in the shortest time. 如果可以消除或减少查询请求,则在操作慢速处理内容的大段正文时,针对不会同时处理查询的服务使用并行索引编制是最好的策略选项。If you can eliminate or reduce query requests, parallel indexing on a service that is not simultaneously handling queries is your best strategy option for working through a large body of slow-processing content.

并行处理具有以下要素:Parallel processing has these elements:

  • 在多个容器之间分割源数据,或者分割同一容器中的多个虚拟文件夹。Subdivide source data among multiple containers or multiple virtual folders inside the same container.
  • 将每个微型数据集映射到与其自身索引器配对的自身数据源Map each mini data set to its own data source, paired to its own indexer.
  • 对于认知搜索,请在每个索引器定义中引用相同的技能集For cognitive search, reference the same skillset in each indexer definition.
  • 写入相同的目标搜索索引。Write into the same target search index.
  • 将所有索引器计划为在同一时间运行。Schedule all indexers to run at the same time.

备注

在 Azure 认知搜索中,不能将单个副本或分区分配给索引或查询处理。In Azure Cognitive Search, you cannot assign individual replicas or partitions to indexing or query processing. 系统确定如何使用资源。The system determines how resources are used. 若要了解对查询性能的影响,可以先在测试环境中尝试并行索引,然后再将其投入生产。To understand the impact on query performance, you might try parallel indexing in a test environment before rolling it into production.

如何配置并行索引编制How to configure parallel indexing

对于索引器,处理能力并不严格依赖于搜索服务所用每个服务单位 (SU) 的一个索引器子系统。For indexers, processing capacity is loosely based on one indexer subsystem for each service unit (SU) used by your search service. 可以在基本或标准层上预配的、至少包含两个副本的 Azure 认知搜索服务中创建多个并发索引器。Multiple concurrent indexers are possible on Azure Cognitive Search services provisioned on Basic or Standard tiers having at least two replicas.

  1. Azure 门户中,在搜索服务仪表板的“概述”页上,选中“定价层”以确认它能够适应并行索引。 In the Azure portal, on your search service dashboard Overview page, check the Pricing tier to confirm it can accommodate parallel indexing. 基本和标准层提供多个副本。Both Basic and Standard tiers offer multiple replicas.

  2. 可以并行运行与服务中的搜索单位数一样多的索引器。You can run as many indexers in parallel as the number of search units in your service. 在“设置” > “缩放”中,为并行处理增加副本或分区:为每个索引器工作负荷额外添加一个副本或分区。 In Settings > Scale, increase replicas or partitions for parallel processing: one additional replica or partition for each indexer workload. 保留足够数量的现有查询卷。Leave a sufficient number for existing query volume. 为索引牺牲查询工作负荷并不是一个很好的折衷方法。Sacrificing query workloads for indexing is not a good tradeoff.

  3. 在 Azure 认知搜索索引器可以访问的级别,将数据分配到多个容器。Distribute data into multiple containers at a level that Azure Cognitive Search indexers can reach. 这可能是 Azure SQL 数据库中的多个表、Azure Blob 存储中的多个容器,或多个集合。This could be multiple tables in Azure SQL Database, multiple containers in Azure Blob storage, or multiple collections. 为每个表或容器定义一个数据源对象。Define one data source object for each table or container.

  4. 创建并计划要并行运行的多个索引器:Create and schedule multiple indexers to run in parallel:

    • 假设某个服务包含六个副本。Assume a service with six replicas. 配置六个索引器(每个索引器映射到包含 1/6 数据集的数据源),以便对整个数据集进行 6 向拆分。Configure six indexers, each one mapped to a data source containing one-sixth of the data set for a 6-way split of the entire data set.

    • 将每个索引器指向相同的索引。Point each indexer to the same index. 对于认知搜索工作负荷,请将每个索引器指向相同的技能集。For cognitive search workloads, point each indexer to the same skillset.

    • 在每个索引器定义中,计划相同的运行时执行模式。Within each indexer definition, schedule the same run-time execution pattern. 例如,"schedule" : { "interval" : "PT8H", "startTime" : "2018-05-15T00:00:00Z" } 在 2018-05-15 针对所有索引器创建计划,这些计划每隔八小时运行。For example, "schedule" : { "interval" : "PT8H", "startTime" : "2018-05-15T00:00:00Z" } creates a schedule on 2018-05-15 on all indexers, running at eight-hour intervals.

在计划的时间,所有索引器将开始执行、加载数据、应用扩充(如果配置了认知搜索管道),并写入索引。At the scheduled time, all indexers begin execution, loading data, applying enrichments (if you configured a cognitive search pipeline), and writing to the index. Azure 认知搜索不会为更新锁定索引。Azure Cognitive Search does not lock the index for updates. 如果特定的写入在首次尝试时不成功,则会使用重试对并发写入进行管理。Concurrent writes are managed, with retry if a particular write does not succeed on first attempt.

备注

增加副本时,如果索引大小预计会显著增加,请考虑增加分区计数。When increasing replicas, consider increasing the partition count if index size is projected to increase significantly. 分区存储索引内容的切片;分区数越多,每个分区要存储的切片就越小。Partitions store slices of indexed content; the more partitions you have, the smaller the slice each one has to store.

另请参阅See also