教程:使用推送 API 优化索引编制Tutorial: Optimize indexing with the push API

Azure 认知搜索支持采用两种基本方法将数据导入到搜索索引中:一种方法是以编程方式将数据推送到索引中;另一种方法是将 Azure 认知搜索索引器指向受支持的数据源来拉取数据。Azure Cognitive Search supports two basic approaches for importing data into a search index: pushing your data into the index programmatically, or pointing an Azure Cognitive Search indexer at a supported data source to pull in the data.

本教程介绍了如何使用推送模型通过分批处理请求并使用指数回退重试策略来高效地为数据编制索引。This tutorial describes how to efficiently index data using the push model by batching requests and using an exponential backoff retry strategy. 你可以下载并运行该应用程序You can download and run the application. 本文介绍了该应用程序的主要方面,以及为数据编制索引时要考虑的因素。This article explains the key aspects of the application and factors to consider when indexing data.

本教程使用 C# 和 .NET SDK 执行以下任务:This tutorial uses C# and the .NET SDK to perform the following tasks:

  • 创建索引Create an index
  • 测试各种批大小以确定最高效的大小Test various batch sizes to determine the most efficient size
  • 以异步方式为数据编制索引Index data asynchronously
  • 使用多个线程以提高索引编制速度Use multiple threads to increase indexing speeds
  • 使用指数回退重试策略重试失败的项Use an exponential backoff retry strategy to retry failed items

如果没有 Azure 订阅,可在开始前创建一个试用帐户If you don't have an Azure subscription, create a trial account before you begin.

先决条件Prerequisites

本教程需要以下服务和工具。The following services and tools are required for this tutorial.

下载文件Download files

本教程的源代码位于 GitHub 存储库 Azure-Samples/azure-search-dotnet-samplesoptimzize-data-indexing 文件夹中。Source code for this tutorial is in the optimzize-data-indexing folder in the Azure-Samples/azure-search-dotnet-samples GitHub repository.

重要注意事项Key considerations

将数据推送到索引中时,有几个会影响索引编制速度的重要注意事项。When pushing data into an index, there's several key considerations that impact indexing speeds. 有关这些因素的详细信息,可参阅为大型数据集编制索引一文。You can learn more about these factors in the index large data sets article.

要考虑的六个主要因素包括:Six key factors to consider are:

  • 服务层级和分区/副本数 - 添加分区和提升层级都可以提高索引编制速度。Service tier and number of partitions/replicas - Adding partitions and increasing your tier will both increase indexing speeds.
  • 索引架构 - 添加字段和向字段添加更多属性(例如“可搜索”、“可分面”或“可筛选”)都会降低索引编制速度。Index Schema - Adding fields and adding additional properties to fields (such as searchable, facetable, or filterable) both reduce indexing speeds.
  • 批大小 - 最佳批大小因索引架构和数据集而异。Batch size - The optimal batch size varies based on your index schema and dataset.
  • 线程数/工作器数 - 单个线程不能充分利用索引编制速度Number of threads/workers - a single thread won't take full advantage of indexing speeds
  • 重试策略 - 应使用指数回退重试策略来优化索引编制。Retry strategy - An exponential backoff retry strategy should be used to optimize indexing.
  • 网络数据传输速度 - 数据传输速度可能是一个限制因素。Network data transfer speeds - Data transfer speeds can be a limiting factor. 从 Azure 环境中为数据编制索引可以提高数据传输速度。Index data from within your Azure environment to increase data transfer speeds.

1 - 创建 Azure 认知搜索服务1 - Create Azure Cognitive Search service

若要完成本教程,你需要有一个 Azure 认知搜索服务,该服务可以在门户中创建To complete this tutorial, you'll need an Azure Cognitive Search service, which you can create in the portal. 建议使用打算在生产中使用的同一层级,以便能够准确地测试和优化索引编制速度。We recommend using the same tier you plan to use in production so that you can accurately test and optimize indexing speeds.

API 调用需要服务 URL 和访问密钥。API calls require the service URL and an access key. 搜索服务是使用这二者创建的,因此,如果向订阅添加了 Azure 认知搜索,则请按以下步骤获取必需信息:A search service is created with both, so if you added Azure Cognitive Search to your subscription, follow these steps to get the necessary information:

  1. 登录到 Azure 门户,在搜索服务的“概述”页中获取 URL。Sign in to the Azure portal, and in your search service Overview page, get the URL. 示例终结点可能类似于 https://mydemo.search.azure.cnAn example endpoint might look like https://mydemo.search.azure.cn.

  2. 在“设置” > “密钥”中,获取有关该服务的完全权限的管理员密钥 。In Settings > Keys, get an admin key for full rights on the service. 有两个可交换的管理员密钥,为保证业务连续性而提供,以防需要滚动一个密钥。There are two interchangeable admin keys, provided for business continuity in case you need to roll one over. 可以在请求中使用主要或辅助密钥来添加、修改和删除对象。You can use either the primary or secondary key on requests for adding, modifying, and deleting objects.

    获取 HTTP 终结点和访问密钥Get an HTTP endpoint and access key

2 - 设置环境2 - Set up your environment

  1. 启动 Visual Studio 并打开“OptimizeDataIndexing.sln”。Start Visual Studio and open OptimizeDataIndexing.sln.
  2. 在解决方案资源管理器中,打开“appsettings.json”以提供连接信息。In Solution Explorer, open appsettings.json to provide connection information.
  3. 对于 searchServiceName,如果完整 URL 为“https://my-demo-service.search.azure.cn”,则要提供的服务名称为“my-demo-service”。For searchServiceName, if the full URL is "https://my-demo-service.search.azure.cn", the service name to provide is "my-demo-service".
{
  "SearchServiceName": "<YOUR-SEARCH-SERVICE-NAME>",
  "SearchServiceAdminApiKey": "<YOUR-ADMIN-API-KEY>",
  "SearchIndexName": "optimize-indexing"
}

3 - 浏览代码3 - Explore the code

更新 appsettings.json 后,OptimizeDataIndexing.sln 中的示例程序应当已做好了生成和运行准备工作。Once you update appsettings.json, the sample program in OptimizeDataIndexing.sln should be ready to build and run.

此代码是从 C# 快速入门派生的。This code is derived from the C# Quickstart. 可以在那篇文章中更详细地了解使用 .NET SDK 的基础知识。You can find more detailed information on the basics of working with the .NET SDK in that article.

此简单的 C#/.NET 控制台应用程序执行以下任务:This simple C#/.NET console app performs the following tasks:

  • 基于 C# Hotel 类(此类还引用 Address 类)的数据结构创建新索引。Creates a new index based on the data structure of the C# Hotel class (which also references the Address class).
  • 测试各种批大小以确定最高效的大小Tests various batch sizes to determine the most efficient size
  • 以异步方式为数据编制索引Indexes data asynchronously
    • 使用多个线程以提高索引编制速度Using multiple threads to increase indexing speeds
    • 使用指数回退重试策略重试失败的项Using an exponential backoff retry strategy to retry failed items

运行该程序之前,请抽时间研究此示例的代码和索引定义。Before running the program, take a minute to study the code and the index definitions for this sample. 相关代码存在于多个文件中:The relevant code is in several files:

  • Hotel.csAddress.cs 包含用于定义索引的架构Hotel.cs and Address.cs contains the schema that defines the index
  • DataGenerator.cs 包含一个简单的类,可轻松创建大量的酒店数据DataGenerator.cs contains a simple class to make it easy to create large amounts of hotel data
  • ExponentialBackoff.cs 包含用于优化索引编制过程的代码,如下所述ExponentialBackoff.cs contains code to optimize the indexing process as described below
  • Program.cs 包含用于创建和删除 Azure 认知搜索索引、为各批数据编制索引以及测试不同批大小的函数Program.cs contains functions that create and delete the Azure Cognitive Search index, indexes batches of data, and tests different batch sizes

创建索引Creating the index

此示例程序使用 .NET SDK 来定义和创建 Azure 认知搜索索引。This sample program uses the .NET SDK to define and create an Azure Cognitive Search index. 它利用 FieldBuilder 类,从 C# 数据模型类来生成索引结构。It takes advantage of the FieldBuilder class to generate an index structure from a C# data model class.

数据模型由 Hotel 类定义,该类还包含对 Address 类的引用。The data model is defined by the Hotel class, which also contains references to the Address class. FieldBuilder 向下钻取多个类定义,从而为索引生成复杂的数据结构。The FieldBuilder drills down through multiple class definitions to generate a complex data structure for the index. 元数据标记用于定义每个字段的属性,例如字段是否可搜索或可排序。Metadata tags are used to define the attributes of each field, such as whether it's searchable or sortable.

以下 Hotel.cs 文件中的片段显示如何指定单个字段和对另一个数据模型类的引用。The following snippets from the Hotel.cs file show how a single field, and a reference to another data model class, can be specified.

. . .
[IsSearchable, IsSortable]
public string HotelName { get; set; }
. . .
public Address Address { get; set; }
. . .

在 Program.cs 文件中,索引是使用由 FieldBuilder.BuildForType<Hotel>() 方法生成的名称和字段集合来定义的,并按下示方法创建:In the Program.cs file, the index is defined with a name and a field collection generated by the FieldBuilder.BuildForType<Hotel>() method, and then created as follows:

private static async Task CreateIndex(string indexName, SearchServiceClient searchService)
{
    // Create a new search index structure that matches the properties of the Hotel class.
    // The Address class is referenced from the Hotel class. The FieldBuilder
    // will enumerate these to create a complex data structure for the index.
    var definition = new Index()
    {
        Name = indexName,
        Fields = FieldBuilder.BuildForType<Hotel>()
    };
    await searchService.Indexes.CreateAsync(definition);
}

生成数据Generating data

DataGenerator.cs 文件中实现了一个简单的类,该类可生成用于测试的数据。A simple class is implemented in the DataGenerator.cs file to generate data for testing. 这个类的唯一用途是轻松生成具有唯一 ID 的大量文档以编制索引。The sole purpose of this class is to make it easy to generate a large number of documents with a unique ID for indexing.

若要获取具有唯一 ID 的 100,000 家酒店的列表,请运行以下两行代码:To get a list of 100,000 hotels with unique IDs, you'd run the following two lines of code:

DataGenerator dg = new DataGenerator();
List<Hotel> hotels = dg.GetHotels(100000, "large");

在此示例中,有两种大小的酒店可用于测试:smalllargeThere are two sizes of hotels available for testing in this sample: small and large.

索引的架构可能会对索引编制速度产生重大影响。The schema of your index can have a significant impact on indexing speeds. 由于存在此影响,因此,在完成本教程之后,有必要对该类进行转换以生成与预期的索引架构匹配的数据。Because of this impact, it makes sense to convert this class to generate data matching your intended index schema after you run through this tutorial.

4 - 测试批大小4 - Test batch sizes

Azure 认知搜索支持使用以下 API 将单个或多个文档加载到索引中:Azure Cognitive Search supports the following APIs to load single or multiple documents into an index:

分批为文档编制索引可显著提高索引编制性能。Indexing documents in batches will significantly improve indexing performance. 这些批中的每一批最多可以包含 1000 个文档或大约 16 MB。These batches can be up to 1000 documents, or up to about 16 MB per batch.

确定数据的最佳批大小是优化索引编制速度的关键。Determining the optimal batch size for your data is a key component of optimizing indexing speeds. 影响最佳批大小的两个主要因素是:The two primary factors influencing the optimal batch size are:

  • 索引的架构The schema of your index
  • 数据的大小The size of your data

因为最佳批大小取决于你的索引和数据,所以最好的方法是测试不同的批大小,以确定哪个批大小可以为你的方案实现最快的索引编制速度。Because the optimal batch size is dependent on your index and your data, the best approach is to test different batch sizes to determine what results in the fastest indexing speeds for your scenario.

以下函数演示了一种用于测试批大小的简单方法。The following function demonstrates a simple approach to testing batch sizes.

public static async Task TestBatchSizes(ISearchIndexClient indexClient, int min = 100, int max = 1000, int step = 100, int numTries = 3)
{
    DataGenerator dg = new DataGenerator();

    Console.WriteLine("Batch Size \t Size in MB \t MB / Doc \t Time (ms) \t MB / Second");
    for (int numDocs = min; numDocs <= max; numDocs += step)
    {
        List<TimeSpan> durations = new List<TimeSpan>();
        double sizeInMb = 0.0;
        for (int x = 0; x < numTries; x++)
        {
            List<Hotel> hotels = dg.GetHotels(numDocs, "large");

            DateTime startTime = DateTime.Now;
            await UploadDocuments(indexClient, hotels);
            DateTime endTime = DateTime.Now;
            durations.Add(endTime - startTime);

            sizeInMb = EstimateObjectSize(hotels);
        }

        var avgDuration = durations.Average(timeSpan => timeSpan.TotalMilliseconds);
        var avgDurationInSeconds = avgDuration / 1000;
        var mbPerSecond = sizeInMb / avgDurationInSeconds;

        Console.WriteLine("{0} \t\t {1} \t\t {2} \t\t {3} \t {4}", numDocs, Math.Round(sizeInMb, 3), Math.Round(sizeInMb / numDocs, 3), Math.Round(avgDuration, 3), Math.Round(mbPerSecond, 3));

        // Pausing 2 seconds to let the search service catch its breath
        Thread.Sleep(2000);
    }
}

因为并非所有文档都大小相同(尽管本示例中如此),所以我们要估计发送到搜索服务的数据的大小。Because not all documents are the same size (although they are in this sample), we estimate the size of the data we're sending to the search service. 我们使用下面的函数进行估计,首先将对象转换为 json,然后确定其大小(以字节为单位)。We do this using the function below that first converts the object to json and then determines its size in bytes. 利用此技术,我们可以根据 MB/秒的索引编制速度来确定哪些批大小是最高效的。This technique allows us to determine which batch sizes are most efficient in terms of MB/s indexing speeds.

public static double EstimateObjectSize(object data)
{
    // converting data to json for more accurate sizing
    var json = JsonConvert.SerializeObject(data);

    // converting object to byte[] to determine the size of the data
    BinaryFormatter bf = new BinaryFormatter();
    MemoryStream ms = new MemoryStream();
    byte[] Array;

    bf.Serialize(ms, json);
    Array = ms.ToArray();

    // converting from bytes to megabytes
    double sizeInMb = (double)Array.Length / 1000000;

    return sizeInMb;
}

此函数需要一个 ISearchIndexClient 以及要为每个批大小测试的尝试次数。The function requires an ISearchIndexClient as well as the number of tries you'd like to test for each batch size. 由于每个批次的索引编制时间可能会有一些变化,因此我们默认情况下会将每个批次尝试三次,以使结果更具统计意义。As there may be some variability in indexing times for each batch, we try each batch three times by default to make the results more statistically significant.

await TestBatchSizes(indexClient, numTries: 3);

运行此函数时,控制台中应会显示如下所示的输出:When you run the function, you should see an output like below in your console:

测试批大小函数的输出Output of test batch size function

确定最高效的批大小,然后在教程的下一步中使用该批大小。Identify which batch size is most efficient and then use that batch size in the next step of the tutorial. 你可能会发现不同批大小的索引编制速度(MB/秒)差异不大。You may see a plateau in MB/s across different batch sizes.

5 - 为数据编制索引5 - Index data

现在我们已经确定了要使用的批大小,下一步就是开始为数据编制索引。Now that we've identified the batch size we intend to use, the next step is to begin to index the data. 为了高效地为数据编制索引,此示例:To index data efficiently, this sample:

  • 使用了多个线程/工作器。Uses multiple threads/workers.
  • 实现了指数回退重试策略。Implements an exponential backoff retry strategy.

使用多个线程/工作器Use multiple threads/workers

若要充分利用 Azure 认知搜索的索引编制速度,你可能需要使用多个线程将许多批量编制索引请求并发发送到该服务。To take full advantage of Azure Cognitive Search's indexing speeds, you'll likely need to use multiple threads to send batch indexing requests concurrently to the service.

上面提到的几个重要注意事项会影响最佳线程数。Several of the key considerations mentioned above impact the optimal number of threads. 你可以修改此示例并测试不同的线程数,以确定适合你的方案的最佳线程数。You can modify this sample and test with different thread counts to determine the optimal thread count for your scenario. 但是,只要有多个线程并发运行,就应该能够利用大部分提升的效率。However, as long as you have several threads running concurrently, you should be able to take advantage of most of the efficiency gains.

当你增加命中搜索服务的请求时,可能会遇到表示请求没有完全成功的 HTTP 状态代码As you ramp up the requests hitting the search service, you may encounter HTTP status codes indicating the request didn't fully succeed. 在编制索引期间,有两个常见的 HTTP 状态代码:During indexing, two common HTTP status codes are:

  • 503 服务不可用 - 此错误表示系统负载过重,当前无法处理请求。503 Service Unavailable - This error means that the system is under heavy load and your request can't be processed at this time.
  • 207 多状态 - 此错误意味着某些文档成功,但至少一个文档失败。207 Multi-Status - This error means that some documents succeeded, but at least one failed.

实现指数回退重试策略Implement an exponential backoff retry strategy

如果失败,则应使用指数回退重试策略来重试请求。If a failure happens, requests should be retried using an exponential backoff retry strategy.

Azure 认知搜索的 .NET SDK 会自动重试 503 和其他失败的请求,但你需要实现自己的逻辑来重试 207。Azure Cognitive Search's .NET SDK automatically retries 503s and other failed requests but you'll need to implement your own logic to retry 207s. 还可以使用 Polly 等开源工具来实现重试策略。Open-source tools such as Polly can also be used to implement a retry strategy.

在此示例中,我们实现了自己的指数回退重试策略。In this sample, we implement our own exponential backoff retry strategy. 为了实现此策略,我们首先定义一些变量,包括失败请求的 maxRetryAttempts 和初始 delayTo implement this strategy, we start by defining some variables including the maxRetryAttempts and the initial delay for a failed request:

// Create batch of documents for indexing
IndexBatch<Hotel> batch = IndexBatch.Upload(hotels);

// Define parameters for exponential backoff
int attempts = 0;
TimeSpan delay = delay = TimeSpan.FromSeconds(2);
int maxRetryAttempts = 5;

请务必捕获 IndexBatchException,因为这些异常指示索引编制操作仅部分成功 (207)。It's important to catch IndexBatchException as these exceptions indicates that the indexing operation only partially succeeded (207s). 应使用 FindFailedActionsToRetry 方法重试失败的项,该方法可以轻松创建仅包含失败项的新批次。Failed items should be retried using the FindFailedActionsToRetry method that makes it easy to create a new batch containing only the failed items.

还应当捕获 IndexBatchException 以外的异常,并指示请求完全失败。Exceptions other than IndexBatchException should also be caught and indicate the request failed completely. 这些异常不太常见,特别是在 .NET SDK 中,因为它会自动重试 503。These exceptions are less common, particularly with the .NET SDK as it retries 503s automatically.

// Implement exponential backoff
do
{
    try
    {
        attempts++;
        var response = await indexClient.Documents.IndexAsync(batch);
        break;
    }
    catch (IndexBatchException ex)
    {
        Console.WriteLine("[Attempt: {0} of {1} Failed] - Error: {2}", attempts, maxRetryAttempts, ex.Message);

        if (attempts == maxRetryAttempts)
            break;

        // Find the failed items and create a new batch to retry
        batch = ex.FindFailedActionsToRetry(batch, x => x.HotelId);
        Console.WriteLine("Retrying failed documents using exponential backoff...\n");

        Task.Delay(delay).Wait();
        delay = delay * 2;
    }
    catch (Exception ex)
    {
        Console.WriteLine("[Attempt: {0} of {1} Failed] - Error: {2} \n", attempts, maxRetryAttempts, ex.Message);

        if (attempts == maxRetryAttempts)
            break;

        Task.Delay(delay).Wait();
        delay = delay * 2;
    }
} while (true);

在这里,我们将指数回退代码包装到一个函数中,以便可以轻松调用它。From here, we wrap the exponential backoff code into a function so it can be easily called.

然后创建另一个函数来管理活动线程。Another function is then created to manage the active threads. 为简单起见,此处未包括该函数,但你可在 ExponentialBackoff.cs 中找到该函数。For simplicity, that function isn't included here but can be found in ExponentialBackoff.cs. 可以通过以下命令调用该函数,其中,hotels 是要上传的数据,1000 是批大小,8 是并发线程数:The function can be called with the following command where hotels is the data we want to upload, 1000 is the batch size, and 8 is the number of concurrent threads:

ExponentialBackoff.IndexData(indexClient, hotels, 1000, 8).Wait();

运行该函数时,应会显示如下所示的输出:When you run the function, you should see an output like below:

数据索引编制函数的输出Output of index data function

当一批文档失败时,将会输出一条错误消息,指出已失败并且正在重试该批次:When a batch of documents fails, an error is printed out indicating the failure and that the batch is being retried:

来自数据索引编制函数的错误Error from index data function

在函数完成运行后,你可以验证是否所有文档都已添加到索引中。After the function is finished running, you can verify that all of the documents were added to the index.

6 - 浏览索引6 - Explore index

在程序运行后,你可以采用编程方式或者使用门户中的搜索浏览器来浏览已填充的搜索索引。You can explore the populated search index after the program has run programatically or using the Search explorer in the portal.

采用编程方式Programatically

可以使用两个主要选项来检查索引中的文档数:对文档计数 API获取索引统计信息 APIThere are two main options for checking the number of documents in an index: the Count Documents API and the Get Index Statistics API. 这两个路径可能都需要额外花费时间进行更新,因此,如果返回的文档数低于你最初预计的值,请不要惊慌。Both paths may require some additional time to update so don't be alarmed if the number of documents returned is lower than you expected initially.

计数文档Count Documents

“对文档计数”操作检索某个搜索索引中文档的计数:The Count Documents operation retrieves a count of the number of documents in a search index:

long indexDocCount = indexClient.Documents.Count();

获取索引统计信息Get Index Statistics

“获取索引统计信息”操作会返回当前索引的文档计数以及存储使用情况。The Get Index Statistics operation returns a document count for the current index, plus storage usage. 索引统计信息更新需要花费比文档计数更新更多的时间。Index statistics will take longer than document count to update.

IndexGetStatisticsResult indexStats = serviceClient.Indexes.GetStatistics(configuration["SearchIndexName"]);

Azure 门户Azure portal

在 Azure 门户中,打开搜索服务的“概览”页,在“索引”列表中找到“optimize-indexing”索引 。In Azure portal, open the search service Overview page, and find the optimize-indexing index in the Indexes list.

Azure 认知搜索索引列表List of Azure Cognitive Search indexes

“文档计数”和“存储大小”基于获取索引统计信息 API,可能需要花费几分钟时间进行更新。The Document Count and Storage Size are based on Get Index Statistics API and may take several minutes to update.

重置并重新运行Reset and rerun

在开发的前期试验阶段,设计迭代的最实用方法是,删除 Azure 认知搜索中的对象,并允许代码重新生成它们。In the early experimental stages of development, the most practical approach for design iteration is to delete the objects from Azure Cognitive Search and allow your code to rebuild them. 资源名称是唯一的。Resource names are unique. 删除某个对象后,可以使用相同的名称重新创建它。Deleting an object lets you recreate it using the same name.

本教程的示例代码会检查现有索引并将其删除,使你能够重新运行代码。The sample code for this tutorial checks for existing indexes and deletes them so that you can rerun your code.

还可以使用门户来删除索引。You can also use the portal to delete indexes.

清理资源Clean up resources

在自己的订阅中操作时,最好在项目结束时删除不再需要的资源。When you're working in your own subscription, at the end of a project, it's a good idea to remove the resources that you no longer need. 持续运行资源可能会产生费用。Resources left running can cost you money. 可以逐个删除资源,也可以删除资源组以删除整个资源集。You can delete resources individually or delete the resource group to delete the entire set of resources.

可以使用左侧导航窗格中的“所有资源”或“资源组”链接 ,在门户中查找和管理资源。You can find and manage resources in the portal, using the All resources or Resource groups link in the left-navigation pane.

后续步骤Next steps

现在你已熟悉了高效引入数据的概念,接下来让我们更详细地了解 Lucene 查询体系结构,以及在 Azure 认知搜索中全文搜索如何工作。Now that you're familiar with the concept of ingesting data efficiently, let's take a closer look at Lucene query architecture and how full text search works in Azure Cognitive Search.