排查使用 Azure Cosmos DB API for MongoDB 时遇到的查询问题Troubleshoot query issues when using the Azure Cosmos DB API for MongoDB

适用于: Azure Cosmos DB API for MongoDB

本文逐步说明排查 Azure Cosmos DB 中的查询问题的一般建议方法。This article walks through a general recommended approach for troubleshooting queries in Azure Cosmos DB. 虽然不应将本文中所述的步骤视为针对潜在查询问题的完全防御方法,但我们在其中包含了最常见的性能提示。Although you shouldn't consider the steps outlined in this article a complete defense against potential query issues, we've included the most common performance tips here. 你应该从本文着手,排查 Azure Cosmos DB API for MongoDB 中速度较慢或成本昂贵的查询。You should use this article as a starting place for troubleshooting slow or expensive queries in Azure Cosmos DB's API for MongoDB. 如果使用的是 Azure Cosmos DB 核心 (SQL) API,请参阅 SQL API 查询故障排除指南一文。If you are using the Azure Cosmos DB core (SQL) API, see the SQL API query troubleshooting guide article.

Azure Cosmos DB 中的查询优化大致分为以下类别:Query optimizations in Azure Cosmos DB are broadly categorized as follows:

  • 可降低查询请求单位 (RU) 费用的优化Optimizations that reduce the Request Unit (RU) charge of the query
  • 仅降低延迟的优化Optimizations that just reduce latency

如果降低查询的 RU 费用,通常还会降低延迟。If you reduce the RU charge of a query, you'll typically decrease latency as well.

本文提供可使用 nutrition 数据集重新创建的示例。This article provides examples that you can re-create by using the nutrition dataset.

备注

本文假设你使用的是 3.6 版的 Azure Cosmos DB API for MongoDB。This article assumes you are using version 3.6 of Azure Cosmos DB"s API for MongoDB. 在版本 3.2 中性能较差的某些查询在版本 3.6 中有了显著改进。Some queries that perform poorly in version 3.2 have significant improvements in version 3.6. 可以通过提交支持请求升级到版本 3.6。Upgrade to version 3.6 by filing a support request.

使用 $explain 命令获取指标Use $explain command to get metrics

在 Azure Cosmos DB 中优化查询时,第一步始终是获取查询的 RU 费用When you optimize a query in Azure Cosmos DB, the first step is always to obtain the RU charge for your query. 对于费用高于 50 RU 的查询,你应当始终想方设法降低 RU 费用,这是一个粗略的指导原则。As a rough guideline, you should explore ways to lower the RU charge for queries with charges greater than 50 RUs.

除了获取 RU 费用之外,还应使用 $explain 命令来获取查询和索引使用指标。In addition to obtaining the RU charge, you should use the $explain command to obtain the query and index usage metrics. 以下示例运行查询并使用 $explain 命令来显示查询和索引使用指标:Here is an example that runs a query and uses the $explain command to show query and index usage metrics:

$explain 命令:$explain command:

db.coll.find({foodGroup: "Baby Foods"}).explain({"executionStatistics": true })

输出:Output:

{
    "stages" : [ 
        {
            "stage" : "$query",
            "timeInclusiveMS" : 905.2888,
            "timeExclusiveMS" : 905.2888,
            "in" : 362,
            "out" : 362,
            "details" : {
                "database" : "db-test",
                "collection" : "collection-test",
                "query" : {
                    "foodGroup" : {
                        "$eq" : "Baby Foods"
                    }
                },
                "pathsIndexed" : [],
                "pathsNotIndexed" : [ 
                    "foodGroup"
                ],
                "shardInformation" : [ 
                    {
                        "activityId" : "e68e6bdd-5e89-4ec5-b053-3dbbc2428140",
                        "shardKeyRangeId" : "0",
                        "durationMS" : 788.5867,
                        "preemptions" : 1,
                        "outputDocumentCount" : 362,
                        "retrievedDocumentCount" : 8618
                    }
                ],
                "queryMetrics" : {
                    "retrievedDocumentCount" : 8618,
                    "retrievedDocumentSizeBytes" : 104963042,
                    "outputDocumentCount" : 362,
                    "outputDocumentSizeBytes" : 2553535,
                    "indexHitRatio" : 0.0016802042237178,
                    "totalQueryExecutionTimeMS" : 777.72,
                    "queryPreparationTimes" : {
                        "queryCompilationTimeMS" : 0.19,
                        "logicalPlanBuildTimeMS" : 0.14,
                        "physicalPlanBuildTimeMS" : 0.09,
                        "queryOptimizationTimeMS" : 0.03
                    },
                    "indexLookupTimeMS" : 0,
                    "documentLoadTimeMS" : 687.22,
                    "vmExecutionTimeMS" : 774.09,
                    "runtimeExecutionTimes" : {
                        "queryEngineExecutionTimeMS" : 37.45,
                        "systemFunctionExecutionTimeMS" : 10.82,
                        "userDefinedFunctionExecutionTimeMS" : 0
                    },
                    "documentWriteTimeMS" : 49.42
                }
            }
        }
    ],
    "estimatedDelayFromRateLimitingInMilliseconds" : 0.0,
    "continuation" : {
        "hasMore" : false
    },
    "ok" : 1.0
}

$explain 命令输出很长,包含有关查询执行情况的详细信息。The $explain command output is lengthy and has detailed information about query execution. 但是一般情况下,在优化查询性能时应重点关注几个部分:In general, though, there are a few sections where you should focus when optimizing query performance:

指标Metric 说明Description
timeInclusiveMS 后端查询延迟Backend query latency
pathsIndexed 显示查询已使用的索引Shows indexes that the query used
pathsNotIndexed 显示查询可能已使用的索引(如果有)Shows indexes that the query could have used, if available
shardInformation 特定物理分区的查询性能摘要Summary of query performance for a particular physical partition
retrievedDocumentCount 查询引擎加载的文档数Number of documents loaded by the query engine
outputDocumentCount 查询结果中返回的文档数Number of documents returned in the query results
estimatedDelayFromRateLimitingInMilliseconds 由于速率限制而导致的额外查询延迟估计值Estimated additional query latency due to rate limiting

获取查询指标后,请将 retrievedDocumentCount 与你的查询的 outputDocumentCount 进行比较。After you get the query metrics, compare the retrievedDocumentCount with the outputDocumentCount for your query. 使用这种比较可以确定要在本文中查看的相关部分。Use this comparison to identify the relevant sections to review in this article. retrievedDocumentCount 是查询引擎需要加载的文档数。The retrievedDocumentCount is the number of documents that the query engine needs to load. outputDocumentCount 是查询结果所需的文档数。The outputDocumentCount is the number of documents that were needed for the results of the query. 如果 retrievedDocumentCount 明显大于 outputDocumentCount,则表明查询中至少有一个部分无法使用索引并需要执行扫描。If the retrievedDocumentCount is significantly higher than the outputDocumentCount, there was at least one part of your query that was unable to use an index and needed to do a scan.

请参阅以下部分,了解适用于你的方案的相关查询优化。Refer to the following sections to understand the relevant query optimizations for your scenario.

查询的 RU 费用过高Query's RU charge is too high

“已检索文档计数”明显大于“输出文档计数”Retrieved Document Count is significantly higher than Output Document Count

“已检索文档计数”约等于“输出文档计数”Retrieved Document Count is approximately equal to Output Document Count

查询的 RU 费用可接受,但延迟仍然过高Query's RU charge is acceptable but latency is still too high

“已检索文档计数”超过“输出文档计数”的查询Queries where retrieved document count exceeds output document count

retrievedDocumentCount 是查询引擎需要加载的文档数。The retrievedDocumentCount is the number of documents that the query engine needed to load. outputDocumentCount 是查询返回的文档数。The outputDocumentCount is the number of documents returned by the query. 如果 retrievedDocumentCount 明显大于 outputDocumentCount,则表明查询中至少有一个部分无法使用索引并需要执行扫描。If the retrievedDocumentCount is significantly higher than the outputDocumentCount, there was at least one part of your query that was unable to use an index and needed to do a scan.

下面是一个不完全由索引提供服务的扫描查询示例:Here's an example of scan query that wasn't entirely served by the index:

$explain 命令:$explain command:

db.coll.find(
  {
    $and : [
            { "foodGroup" : "Cereal Grains and Pasta"}, 
            { "description" : "Oat bran, cooked"}
        ]
  }
).explain({"executionStatistics": true })

输出:Output:

{
    "stages" : [ 
        {
            "stage" : "$query",
            "timeInclusiveMS" : 436.5716,
            "timeExclusiveMS" : 436.5716,
            "in" : 1,
            "out" : 1,
            "details" : {
                "database" : "db-test",
                "collection" : "indexing-test",
                "query" : {
                    "$and" : [ 
                        {
                            "foodGroup" : {
                                "$eq" : "Cereal Grains and Pasta"
                            }
                        }, 
                        {
                            "description" : {
                                "$eq" : "Oat bran, cooked"
                            }
                        }
                    ]
                },
                "pathsIndexed" : [],
                "pathsNotIndexed" : [ 
                    "foodGroup", 
                    "description"
                ],
                "shardInformation" : [ 
                    {
                        "activityId" : "13a5977e-a10a-4329-b68e-87e4f0081cac",
                        "shardKeyRangeId" : "0",
                        "durationMS" : 435.4867,
                        "preemptions" : 1,
                        "outputDocumentCount" : 1,
                        "retrievedDocumentCount" : 8618
                    }
                ],
                "queryMetrics" : {
                    "retrievedDocumentCount" : 8618,
                    "retrievedDocumentSizeBytes" : 104963042,
                    "outputDocumentCount" : 1,
                    "outputDocumentSizeBytes" : 6064,
                    "indexHitRatio" : 0.0,
                    "totalQueryExecutionTimeMS" : 433.64,
                    "queryPreparationTimes" : {
                        "queryCompilationTimeMS" : 0.12,
                        "logicalPlanBuildTimeMS" : 0.09,
                        "physicalPlanBuildTimeMS" : 0.1,
                        "queryOptimizationTimeMS" : 0.02
                    },
                    "indexLookupTimeMS" : 0,
                    "documentLoadTimeMS" : 387.44,
                    "vmExecutionTimeMS" : 432.93,
                    "runtimeExecutionTimes" : {
                        "queryEngineExecutionTimeMS" : 45.36,
                        "systemFunctionExecutionTimeMS" : 16.86,
                        "userDefinedFunctionExecutionTimeMS" : 0
                    },
                    "documentWriteTimeMS" : 0.13
                }
            }
        }
    ],
    "estimatedDelayFromRateLimitingInMilliseconds" : 0.0,
    "continuation" : {
        "hasMore" : false
    },
    "ok" : 1.0
}

retrievedDocumentCount (8618) 明显高于 outputDocumentCount (1),表示此查询需要文档扫描。The retrievedDocumentCount (8618) is significantly higher than the outputDocumentCount (1), implying that this query required a document scan.

包括必要的索引Include necessary indexes

你应检查 pathsNotIndexed 数组并添加这些索引。You should check the pathsNotIndexed array and add these indexes. 在此示例中,应当为路径 foodGroupdescription 编制索引。In this example, the paths foodGroup and description should be indexed.

"pathsNotIndexed" : [ 
                    "foodGroup", 
                    "description"
                ]

Azure Cosmos DB API for MongoDB 中的索引编制最佳做法不同于 MongoDB。Indexing best practices in Azure Cosmos DB's API for MongoDB are different from MongoDB. 在 Azure Cosmos DB API for MongoDB 中,复合索引仅用于需要按多个属性有效排序的查询中。In Azure Cosmos DB's API for MongoDB, compound indexes are only used in queries that need to efficiently sort by multiple properties. 如果你的查询具有基于多个属性的筛选器,你应当为这些属性中的每一个创建单字段索引。If you have queries with filters on multiple properties, you should create single field indexes for each of these properties. 查询谓词可以使用多个单字段索引。Query predicates can use multiple single field indexes.

通配符索引可以简化索引编制。Wildcard indexes can simplify indexing. 与在 MongoDB 中不同,通配符索引可以在查询谓词中支持多个字段。Unlike in MongoDB, wildcard indexes can support multiple fields in query predicates. 如果使用一个通配符索引,而不是为每个属性创建单独的索引,查询性能不会有差异。There will not be a difference in query performance if you use one single wildcard index instead of creating a separate index for each property. 为所有属性添加通配符索引是优化所有查询的最简单方法。Adding a wildcard index for all properties is the easiest way to optimize all of your queries.

你可以随时添加新索引,而不会影响写入或读取可用性。You can add new indexes at any time, with no effect on write or read availability. 你可以跟踪索引转换进度You can track index transformation progress.

了解哪些聚合操作使用索引Understand which aggregation operations use the index

大多数情况下,Azure Cosmos DB API for MongoDB 中的聚合操作会部分使用索引。In most cases, aggregation operations in Azure Cosmos DB's API for MongoDB will partially use indexes. 通常情况下,查询引擎会首先应用等式和范围筛选器,并使用索引。Typically, the query engine will apply equality and range filters first and use indexes. 应用这些筛选器后,查询引擎可以评估其他筛选器,并根据需要加载其余文档以计算聚合。After applying these filters, the query engine can evaluate additional filters and resort to loading remaining documents to compute the aggregate, if needed.

下面是一个示例:Here's an example:

db.coll.aggregate( [
   { $match: { foodGroup: 'Fruits and Fruit Juices' } },
   {
     $group: {
        _id: "$foodGroup",
        total: { $max: "$version" }
     }
   }
] )

在这种情况下,索引可以优化 $match 阶段。In this case, indexes can optimize the $match stage. foodGroup 添加索引将极大地提高查询性能。Adding an index for foodGroup will significantly improve query performance. 与在 MongoDB 中一样,应尽早在聚合管道中放置 $match,以最大程度地利用索引。Like in MongoDB, you should place $match as early in the aggregation pipeline as possible to maximize usage of indexes.

在 Azure Cosmos DB API for MongoDB 中,索引不用于实际聚合(在本例中为 $max)。In Azure Cosmos DB's API for MongoDB, indexes are not used for the actual aggregation, which in this case is $max. version 上添加索引不会改进查询性能。Adding an index on version will not improve query performance.

“已检索文档计数”等于“输出文档计数”的查询Queries where retrieved document count is equal to Output Document Count

如果 retrievedDocumentCount 约等于 outputDocumentCount,则查询引擎无需扫描许多不必要的文档。If the retrievedDocumentCount is approximately equal to the outputDocumentCount, the query engine didn't have to scan many unnecessary documents.

最大限度地减少跨分区查询Minimize cross partition queries

当请求单位和数据存储需求增加时,Azure Cosmos DB 将使用分区来扩展单个容器。Azure Cosmos DB uses partitioning to scale individual containers as Request Unit and data storage needs increase. 每个物理分区具有不同的独立索引。Each physical partition has a separate and independent index. 如果查询具有匹配容器分区键的相等性筛选器,则你只需检查相关分区的索引。If your query has an equality filter that matches your container's partition key, you'll need to check only the relevant partition's index. 这种优化可以减少查询所需的 RU 总数。This optimization reduces the total number of RUs that the query requires. 详细了解分区内查询与跨分区查询之间的差异Learn more about the differences between in-partition queries and cross-partition queries.

如果有大量预配的 RU(超过 30,000)或存储了大量数据(约超过 100 GB),那么你可能需要有一个足够大的容器,以查看查询 RU 费用的显著降低。If you have a large number of provisioned RUs (more than 30,000) or a large amount of data stored (more than approximately 100 GB), you probably have a large enough container to see a significant reduction in query RU charges.

你可以检查 shardInformation 数组,以了解每个单独物理分区的查询指标。You can check the shardInformation array to understand the query metrics for each individual physical partition. 唯一 shardKeyRangeId 值的数目是需要在其中执行查询的物理分区的数目。The number of unique shardKeyRangeId values is the number of physical partitions where the query needed to be executed. 在此示例中,查询是在四个物理分区上执行的。In this example, the query was executed on four physical partitions. 请务必明白,执行与索引利用完全无关。It's important to understand that execution is completely independent from index utilization. 换句话说,跨分区查询仍可使用索引。In other words, cross-partition queries can still use indexes.

  "shardInformation" : [ 
                    {
                        "activityId" : "42f670a8-a201-4c58-8023-363ac18d9e18",
                        "shardKeyRangeId" : "5",
                        "durationMS" : 24.3859,
                        "preemptions" : 1,
                        "outputDocumentCount" : 463,
                        "retrievedDocumentCount" : 463
                    }, 
                    {
                        "activityId" : "a8bf762a-37b9-4c07-8ed4-ae49961373c0",
                        "shardKeyRangeId" : "2",
                        "durationMS" : 35.8328,
                        "preemptions" : 1,
                        "outputDocumentCount" : 905,
                        "retrievedDocumentCount" : 905
                    }, 
                    {
                        "activityId" : "3754e36b-4258-49a6-8d4d-010555628395",
                        "shardKeyRangeId" : "1",
                        "durationMS" : 67.3969,
                        "preemptions" : 1,
                        "outputDocumentCount" : 1479,
                        "retrievedDocumentCount" : 1479
                    }, 
                    {
                        "activityId" : "a69a44ee-db97-4fe9-b489-3791f3d52878",
                        "shardKeyRangeId" : "0",
                        "durationMS" : 185.1523,
                        "preemptions" : 1,
                        "outputDocumentCount" : 867,
                        "retrievedDocumentCount" : 867
                    }
                ]

可降低查询延迟的优化Optimizations that reduce query latency

在许多情况下,当查询延迟仍然过高时,RU 费用是可接受的。In many cases, the RU charge might be acceptable when query latency is still too high. 以下部分概述了降低查询延迟的提示。The following sections give an overview of tips for reducing query latency. 如果对同一个数据集多次运行同一个查询,该查询通常每次都会产生相同的 RU 开销。If you run the same query multiple times on the same dataset, it will typically have the same RU charge each time. 但是,每次执行查询时,查询延迟可能各不相同。But query latency might vary between query executions.

提高邻近度Improve proximity

在非 Azure Cosmos DB 帐户的区域中运行的查询,比在同一区域中运行的查询的延迟更高。Queries that are run from a different region than the Azure Cosmos DB account will have higher latency than if they were run inside the same region. 例如,如果在台式机上运行代码,则延迟比从 Azure Cosmos DB 所在的同一 Azure 区域中的某个虚拟机上运行查询要高出几十或几百毫秒(或更高)。For example, if you're running code on your desktop computer, you should expect latency to be tens or hundreds of milliseconds higher (or more) than if the query came from a virtual machine within the same Azure region as Azure Cosmos DB. 可以轻松地在 Azure Cosmos DB 中以多区域方式分布数据,以确保将数据放在更靠近应用的位置。It's simple to multiple-regionally distribute data in Azure Cosmos DB to ensure you can bring your data closer to your app.

增大预配吞吐量Increase provisioned throughput

在 Azure Cosmos DB 中,预配的吞吐量以请求单位 (RU) 计量。In Azure Cosmos DB, your provisioned throughput is measured in Request Units (RUs). 假设某个查询消耗 5 RU 吞吐量。Imagine you have a query that consumes 5 RUs of throughput. 如果预配 1000 RU,则每秒可以运行该查询 200 次。For example, if you provision 1,000 RUs, you would be able to run that query 200 times per second. 如果在没有足够的可用吞吐量的情况下尝试运行查询,Azure Cosmos DB 会对请求进行速率限制。If you tried to run the query when there wasn't enough throughput available, Azure Cosmos DB will rate limit the requests. Azure Cosmos DB API for MongoDB 在等待一小段时间后会自动重试该查询。Azure Cosmos DB's API for MongoDB will automatically retry this query after waiting for a short time. 受限制的请求需要花费更长的时间,因此增加预配的吞吐量可以改进查询延迟。Throttled requests take longer, so increasing provisioned throughput can improve query latency.

如果提高吞吐量,则 estimatedDelayFromRateLimitingInMilliseconds 值具有潜在的延迟优势。The value estimatedDelayFromRateLimitingInMilliseconds gives a sense of the potential latency benefits if you increase throughput.

后续步骤Next steps