监视 Azure 认知搜索中的查询请求Monitor query requests in Azure Cognitive Search

本文介绍如何使用指标与资源日志来度量查询性能和查询量。This article explains how to measure query performance and volume using metrics and resource logging. 此外,介绍如何收集查询中使用的输入字词 - 评估搜索集的实用性和有效性时必须提供这些信息。It also explains how to collect the input terms used in queries - necessary information when you need to assess the utility and effectiveness of your search corpus.

馈送到指标中的历史数据将保留 30 天。Historical data that feeds into metrics is preserved for 30 days. 若要保留更长时间,或者要报告操作数据和查询字符串,请务必启用一项诊断设置,该设置指定用于保存所记录的事件和指标的存储选项。For longer retention, or to report on operational data and query strings, be sure to enable a diagnostic setting that specifies a storage option for persisting logged events and metrics.

可最大程度地提高数据度量完整性的条件包括:Conditions that maximize the integrity of data measurement include:

  • 使用可计费服务(在“基本”或“标准”层创建的服务)。Use a billable service (a service created at either the Basic or a Standard tier). 免费服务由多个订阅者共享,当负载变化时,这会造成某种程度的波动。The free service is shared by multiple subscribers, which introduces a certain amount of volatility as loads shift.

  • 如果可能,请使用单个副本和分区创建包容性的隔离环境。Use a single replica and partition, if possible, to create a contained and isolated environment. 如果使用多个副本,将会计算多个节点中的查询指标的平均值,这可能会降低结果的精确度。If you use multiple replicas, query metrics are averaged across multiple nodes, which can lower the precision of results. 同样,使用多个分区意味着数据将被分割,如果同时正在进行索引编制,则某些分区有可能包含不同的数据。Similarly, multiple partitions mean that data is divided, with the potential that some partitions might have different data if indexing is also underway. 优化查询性能时,单个节点和分区就能提供更稳定的环境用于测试。When tuning query performance, a single node and partition gives a more stable environment for testing.

提示

使用附加的客户端代码和 Application Insights,还可以捕获点击率数据,以更深入地了解哪些内容引起了应用程序用户的兴趣。With additional client-side code and Application Insights, you can also capture clickthrough data for deeper insight into what attracts the interest of your application users. 有关详细信息,请参阅搜索流量分析For more information, see Search traffic analytics.

查询量 (QPS)Query volume (QPS)

查询量以每秒搜索查询数 (QPS) 度量。QPS 是一个内置指标,可按一分钟时间范围内执行的查询数的平均、计数、最小或最大值来报告。Volume is measured as Search Queries Per Second (QPS), a built-in metric that can be reported as an average, count, minimum, or maximum values of queries that execute within a one-minute window. 指标的一分钟间隔 (TimeGrain = "PT1M") 在系统中是固定的。One-minute intervals (TimeGrain = "PT1M") for metrics is fixed within the system.

查询往往在若干毫秒内即可完成执行,因此,指标中仅显示以秒度量的查询。It's common for queries to execute in milliseconds, so only queries that measure as seconds will appear in metrics.

聚合类型Aggregation Type 说明Description
平均值Average 执行查询的某一分钟内的平均秒数。The average number of seconds within a minute during which query execution occurred.
计数Count 在一分钟间隔内发出到日志的指标数。The number of metrics emitted to the log within the one-minute interval.
最大值Maximum 在一分钟内每秒注册的最大搜索查询数。The highest number of search queries per second registered during a minute.
最小值Minimum 在一分钟内每秒注册的最小搜索查询数。The lowest number of search queries per second registered during a minute.
SumSum 在一分钟内执行的所有查询数之和。The sum of all queries executed within the minute.

例如,在一分钟内可能出现如下所述的模式:有 1 秒出现高负载(这是 SearchQueriesPerSecond 的最大值),紧接着有 58 秒的平均负载,最后 1 秒只有 1 个查询(这是最小值)。For example, within one minute, you might have a pattern like this: one second of high load that is the maximum for SearchQueriesPerSecond, followed by 58 seconds of average load, and finally one second with only one query, which is the minimum.

另举一例:如果某个节点发出 100 个指标,其中每个指标的值为 40,那么,“计数”为 100,“总和”为 4000,“平均值”为 40,“最大值”为 40。Another example: if a node emits 100 metrics, where the value of each metric is 40, then "Count" is 100, "Sum" is 4000, "Average" is 40, and "Max" is 40.

查询性能Query performance

服务范围的查询性能以搜索延迟(完成查询花费的时间)以及由于资源争用而丢弃的受限制查询度量。Service-wide, query performance is measured as search latency (how long a query takes to complete) and throttled queries that were dropped as a result of resource contention.

搜索延迟Search latency

聚合类型Aggregation Type 延迟Latency
平均值Average 平均查询持续时间(毫秒)。Average query duration in milliseconds.
计数Count 在一分钟间隔内发出到日志的指标数。The number of metrics emitted to the log within the one-minute interval.
最大值Maximum 样本中运行时间最长的查询。Longest running query in the sample.
最小值Minimum 样本中运行时间最短的查询。Shortest running query in the sample.
总计Total 样本中在时间间隔(一分钟)内执行的所有查询的总执行时间。Total execution time of all queries in the sample, executing within the interval (one minute).

考虑以下“搜索延迟”指标示例:**** 已采样 86 个查询,平均持续时间为 23.26 毫秒。Consider the following example of Search Latency metrics: 86 queries were sampled, with an average duration of 23.26 milliseconds. 最小值 0 表示丢弃了某些查询。A minimum of 0 indicates some queries were dropped. 运行时间最长的查询花费了 1000 毫秒才完成。The longest running query took 1000 milliseconds to complete. 总执行时间为 2 秒。Total execution time was 2 seconds.

延迟聚合Latency aggregations

受限制的查询Throttled queries

受限制的查询是指已丢弃而未处理的查询。Throttled queries refers to queries that are dropped instead of process. 在大多数情况下,限制是运行服务的正常组成部分。In most cases, throttling is a normal part of running the service. 发生限制并不一定表示出现了问题。It is not necessarily an indication that there is something wrong.

如果当前处理的请求数超出了可用资源,则会发生限制。Throttling occurs when the number of requests currently processed exceed the available resources. 排除某个副本的轮换或者在编制索引期间,可能会看到受限制请求的数量增多。You might see an increase in throttled requests when a replica is taken out of rotation or during indexing. 查询和索引编制请求由同一组资源处理。Both query and indexing requests are handled by the same set of resources.

服务根据资源消耗量确定是否丢弃请求。The service determines whether to drop requests based on resource consumption. 内存、CPU 和磁盘 IO 资源消耗百分比是一段时间的平均值。The percentage of resources consumed across memory, CPU, and disk IO are averaged over a period of time. 如果此百分比超过阈值,则所有索引请求将受到限制,直到请求量减少。If this percentage exceeds a threshold, all requests to the index are throttled until the volume of requests is reduced.

根据所用的客户端,可通过以下方式指示受限制的请求:Depending on your client, a throttled request can be indicated in these ways:

  • 服务返回错误“发送的请求过多”。A service returns an error "You are sending too many requests. 请稍后再试。”Please try again later."
  • 服务返回 503 错误代码,指示服务当前不可用。A service returns a 503 error code indicating the service is currently unavailable.
  • 如果使用的是门户(例如搜索资源管理器),则会以静默方式丢弃查询,你需要再次单击“搜索”。If you are using the portal (for example, Search Explorer), the query is dropped silently and you will need to click Search again.

若要确认受限制的查询,请使用“受限制的搜索查询”指标。****To confirm throttled queries, use Throttled search queries metric. 可以在门户中浏览指标,或根据本文中所述创建一个警报指标。You can explore metrics in the portal or create an alert metric as described in this article. 对于在采样间隔内丢弃的查询,请使用“总计”来获取未执行的查询百分比。**For queries that were dropped within the sampling interval, use Total to get the percentage of queries that did not execute.

聚合类型Aggregation Type 限制Throttling
平均值Average 间隔内丢弃的查询百分比。Percentage of queries dropped within the interval.
计数Count 在一分钟间隔内发出到日志的指标数。The number of metrics emitted to the log within the one-minute interval.
最大值Maximum 间隔内丢弃的查询百分比。Percentage of queries dropped within the interval.
最小值Minimum 间隔内丢弃的查询百分比。Percentage of queries dropped within the interval.
总计Total 间隔内丢弃的查询百分比。Percentage of queries dropped within the interval.

对于“受限制的搜索查询百分比”、最小值、最大值、平均值和总计,全都具有相同的值:在一分钟内的搜索查询总数中,已限制搜索查询百分比****。For Throttled Search Queries Percentage, minimum, maximum, average and total, all have the same value: the percentage of search queries that were throttled, from the total number of search queries during one minute.

在以下屏幕截图中,第一个数字是计数(或发送到日志的指标数)。In the following screenshot, the first number is the count (or number of metrics sent to the log). 显示在顶部或者将鼠标悬停在指标上显示的其他聚合包括平均值、最大值和总计。Additional aggregations, which appear at the top or when hovering over the metric, include average, maximum, and total. 在此示例中,未丢弃任何请求。In this sample, no requests were dropped.

受限制聚合Throttled aggregations

在门户中浏览指标Explore metrics in the portal

为了让用户快速查看当前数字,服务“概述”页上的“监视”选项卡会显示三个按固定间隔以小时、天和周度量的指标(“搜索延迟”、“每秒搜索查询数(每搜索单位)”和“受限制的搜索查询百分比”),并提供用于更改聚合类型的选项。**** **** **** ****For a quick look at the current numbers, the Monitoring tab on the service Overview page shows three metrics (Search latency, Search queries per second (per search unit), Throttled Search Queries Percentage) over fixed intervals measured in hours, days, and weeks, with the option of changing the aggregation type.

若要进行更深入的浏览,请从“监视”菜单中打开指标资源管理器,以便可以分层、放大和可视化数据,从而浏览趋势或异常情况。****For deeper exploration, open metrics explorer from the Monitoring menu so that you can layer, zoom in, and visualize data to explore trends or anomalies. 在这篇有关创建指标图表的教程中详细了解指标资源管理器。Learn more about metrics explorer by completing this tutorial on creating a metrics chart.

  1. 在“监视”部分下,选择“指标”打开指标资源管理器,其中的数据范围是根据搜索服务设置的。****Under the Monitoring section, select Metrics to open the metrics explorer with the scope set to your search service.

  2. 在“指标”下,从下拉列表中选择一个指标,并查看偏好类型的可用聚合列表。Under Metric, choose one from the dropdown list and review the list of available aggregations for a preferred type. 聚合定义在每个时间间隔如何对收集的值采样。The aggregation defines how the collected values will be sampled over each time interval.

    QPS 指标的指标资源管理器Metrics explorer for QPS metric

  3. 在右上角设置时间间隔。In the top-right corner, set the time interval.

  4. 选择可视化效果。Choose a visualization. 默认设置为折线图。The default is a line chart.

  5. 选择“添加指标”并选择不同的聚合来叠加更多聚合。****Layer additional aggregations by choosing Add metric and selecting different aggregations.

  6. 在折线图上放大感兴趣的区域。Zoom into an area of interest on the line chart. 将鼠标指针放在区域的开头位置,单击并按住鼠标左键,拖动到区域的另一侧,然后松开按钮。Put the mouse pointer at the beginning of the area, click and hold the left mouse button, drag to the other side of area, and release the button. 图表将放大该时间范围。The chart will zoom in on that time range.

识别查询中使用的字符串Identify strings used in queries

启用资源日志记录时,系统将捕获“AzureDiagnostics”表中的查询请求****。When you enable resource logging, the system captures query requests in the AzureDiagnostics table. 作为先决条件,必须已启用资源日志记录,并指定 Log Analytics 工作区或其他存储选项。As a prerequisite, you must have already enabled resource logging, specifying a log analytics workspace or another storage option.

  1. 在“监视”部分下,选择“日志”在 Log Analytics 中打开一个空查询窗口。****Under the Monitoring section, select Logs to open up an empty query window in Log Analytics.

  2. 运行以下表达式来搜索 Query.Search 操作,这会返回表格格式的结果集,其中包含操作名称、查询字符串、查询的索引以及找到的文档数。Run the following expression to search Query.Search operations, returning a tabular result set consisting of the operation name, query string, the index queried, and the number of documents found. 最后两条语句排除针对样本索引运行的、包含空的或未指定的搜索的查询字符串,这可以减少结果中的干扰信息。The last two statements exclude query strings consisting of an empty or unspecified search, over a sample index, which cuts down the noise in your results.

    AzureDiagnostics
    | project OperationName, Query_s, IndexName_s, Documents_d
    | where OperationName == "Query.Search"
    | where Query_s != "?api-version=2020-06-30&search=*"
    | where IndexName_s != "realestate-us-sample-index"
    
  3. (可选)在 Query_s 中设置列筛选器,以基于特定的语法或字符串进行搜索。Optionally, set a Column filter on Query_s to search over a specific syntax or string. 例如,可以基于“等于”进行筛选?api-version=2020-06-30&search=*&%24filter=HotelName。**For example, you could filter over is equal to ?api-version=2020-06-30&search=*&%24filter=HotelName).

    记录的查询字符串Logged query strings

尽管此方法适用于临时调查,但生成报告可以在更有利于分析的布局中合并和呈现查询字符串。While this technique works for ad hoc investigation, building a report lets you consolidate and present the query strings in a layout more conducive to analysis.

识别长时间运行的查询Identify long-running queries

添加持续时间列以获取所有查询的数量,而不仅仅是作为指标选取的查询的数量。Add the duration column to get the numbers for all queries, not just those that are picked up as a metric. 将此数据排序可以显示完成哪些查询所花费的时间最长。Sorting this data shows you which queries take the longest to complete.

  1. 在“监视”部分下,选择“日志”以查询日志信息。****Under the Monitoring section, select Logs to query for log information.

  2. 运行以下查询以返回按持续时间(以毫秒为单位)排序的查询。Run the following query to return queries, sorted by duration in milliseconds. 运行时间最长的查询列在最前面。The longest-running queries are at the top.

    AzureDiagnostics
    | project OperationName, resultSignature_d, DurationMs, Query_s, Documents_d, IndexName_s
    | where OperationName == "Query.Search"
    | sort by DurationMs
    

    按持续时间对查询排序Sort queries by duration

创建指标警报Create a metric alert

指标警报会建立一个阈值,达到该阈值时,将会发出通知,或者触发预先定义的纠正措施。A metric alert establishes a threshold at which you will either receive a notification or trigger a corrective action that you define in advance.

对于搜索服务,通常会针对搜索延迟和受限制的查询创建指标警报。For a search service, it's common to create a metric alert for search latency and throttled queries. 如果你知道何时丢弃了查询,可以查看降低负载或增加容量的补救措施。If you know when queries are dropped, you can look for remedies that reduce load or increase capacity. 例如,如果受限制的查询在索引编制期间增加,可将其推迟到查询活动减少时。For example, if throttled queries increase during indexing, you could postpone it until query activity subsides.

推送特定副本分区配置的限制时,针对查询量阈值 (QPS) 设置警报也很有帮助。When pushing the limits of a particular replica-partition configuration, setting up alerts for query volume thresholds (QPS) is also helpful.

  1. 在“监视”部分下选择“警报”,然后单击“+ 新建警报规则”。**** ****Under the Monitoring section, select Alerts and then click + New alert rule. 确保选择你的搜索服务作为资源。Make sure your search service is selected as the resource.

  2. 在“条件”下单击“添加”。****Under Condition, click Add.

  3. 配置信号逻辑。Configure signal logic. 对于信号类型,请选择“指标”,然后选择信号。****For signal type, choose metrics and then select the signal.

  4. 选择信号后,可以使用图表来可视化历史数据,以便在如何继续设置条件方面做出明智的决策。After selecting the signal, you can use a chart to visualize historical data for an informed decision on how to proceed with setting up conditions.

  5. 接下来,向下滚动到“警报逻辑”。Next, scroll down to Alert logic. 对于概念证明,可以人为地指定一个较小值进行测试。For proof-of-concept, you could specify an artificially low value for testing purposes.

    警报逻辑Alert logic

  6. 接下来,指定或创建操作组。Next, specify or create an Action Group. 这是在达到阈值时要调用的响应措施。This is the response to invoke when the threshold is met. 该措施可以是推送通知或自动响应。It might be a push notification or an automated response.

  7. 最后,指定警报详细信息。Last, specify Alert details. 命名并描述警报,分配严重性值,并指定是要以启用还是禁用状态创建规则。Name and describe the alert, assign a severity value, and specify whether to create the rule in an enabled or disabled state.

    警报详细信息Alert details

如果指定了电子邮件通知,将会收到来自“Microsoft Azure”的、主题行为“Azure:已激活,严重性:3 <your rule name>”的电子邮件。If you specified an email notification, you will receive an email from "Microsoft Azure" with a subject line of "Azure: Activated Severity: 3 <your rule name>".

后续步骤Next steps

如果尚未这样做,请查看搜索服务监视基础知识,以全方面地了解监督功能。If you haven't done so already, review the fundamentals of search service monitoring to learn about the full range of oversight capabilities.