使用 Azure Cosmos DB 中的指标进行监视和调试Monitor and debug with metrics in Azure Cosmos DB

Azure Cosmos DB 提供吞吐量、存储、一致性、可用性和延迟的指标。Azure Cosmos DB provides metrics for throughput, storage, consistency, availability, and latency. Azure 门户提供这些指标的聚合视图。The Azure portal provides an aggregated view of these metrics. 如需更精细的指标,可使用客户端 SDK 和诊断日志For more granular metrics, both the client SDK and the diagnostic logs are available.

本文介绍常见用例,以及如何使用 Azure Cosmos DB 指标来分析和调试这些问题。This article walks through common use cases and how Azure Cosmos DB metrics can be used to analyze and debug these issues. 指标每 5 分钟收集一次且保留 7 天。Metrics are collected every five minutes and are kept for seven days.

从 Azure 门户查看指标View metrics from Azure portal

  1. 登录到 Azure 门户Sign into Azure portal

  2. 打开“指标”窗格。 Open the Metrics pane. 默认情况下,指标窗格显示 Azure Cosmos 帐户中所有数据库的存储、索引、请求单位指标。By default, the metrics pane shows the storage, index, request units metrics for all the databases in your Azure Cosmos account. 可以按数据库、容器或区域筛选这些指标。You can filter these metrics per database, container, or a region. 也可按特定的时间粒度筛选这些指标。You can also filter the metrics at a specific time granularity. 在单独的选项卡上提供有关吞吐量、存储、可用性、延迟和一致性指标的更多详细信息。More details on the throughput, storage, availability, latency, and consistency metrics are provided on separate tabs.

    Azure 门户中的 Cosmos DB 性能指标

“指标”窗格提供以下指标: The following metrics are available from the Metrics pane:

  • 吞吐量指标 - 此指标显示所使用的请求数,或者显示失败(响应代码为 429),因为已超出为容器预配的吞吐量或存储容量。Throughput metrics - This metric shows the number of requests consumed or failed (429 response code) because the throughput or storage capacity provisioned for the container has exceeded.

  • 存储指标 - 此指标显示数据大小和索引使用情况。Storage metrics - This metric shows the size of data and index usage.

  • 可用性指标 - 此指标显示每小时成功请求数占总请求数的百分比。Availability metrics - This metric shows the percentage of successful requests over the total requests per hour. 成功率按 Azure Cosmos DB SLA 定义。The success rate is defined by the Azure Cosmos DB SLAs.

  • 延迟指标 - 此指标显示 Azure Cosmos DB 在帐户运行区域观察到的读写延迟。Latency metrics - This metric shows the read and write latency observed by Azure Cosmos DB in the region where your account is operating. 可以针对异地复制帐户跨区域将延迟可视化。You can visualize latency across regions for a geo-replicated account. 此指标不表示端到端请求延迟。This metric doesn't represent the end-to-end request latency.

  • 一致性指标 - 此指标显示所选一致性模型的最终一致性。Consistency metrics - This metric shows how eventual is the consistency for the consistency model you choose. 对于多区域帐户,此指标还显示所选区域之间的复制延迟。For multi-region accounts, this metric also shows the replication latency between the regions you have selected.

  • 系统指标 - 此指标显示主分区处理的元数据请求数。System metrics - This metric shows how many metadata requests are served by the master partition. 此指标还有助于确定限制的请求数。It also helps to identify the throttled requests.

以下部分介绍可以使用 Azure Cosmos DB 指标的常见场景。The following sections explain common scenarios where you can use Azure Cosmos DB metrics.

了解成功的请求数或导致错误的请求数Understand how many requests are succeeding or causing errors

若要开始,请前往 Azure 门户并导航到“指标”边栏选项卡 。To get started, head to the Azure portal and navigate to the Metrics blade. 在此边栏选项卡中找到“每分钟超出容量的请求数”图表。**In the blade, find the **Number of requests exceeded capacity per 1-minute chart. 该图表显示按状态代码划分的总请求数(以分钟计)。This chart shows a minute by minute total requests segmented by the status code. 有关 HTTP 状态代码的详细信息,请参阅 Azure Cosmos DB 的 HTTP 状态代码For more information about HTTP status codes, see HTTP status codes for Azure Cosmos DB.

最常见的错误状态代码为 429(速率限制)。The most common error status code is 429 (rate limiting/throttling). 此错误意味着对 Azure Cosmos DB 的请求超过预配的吞吐量。This error means that requests to Azure Cosmos DB are more than the provisioned throughput. 此问题最常见的解决方案是为给定集合纵向扩展 RUThe most common solution to this problem is to scale up the RUs for the given collection.

每分钟的请求数

确定跨分区的吞吐量分布Determine the throughput distribution across partitions

对任何可伸缩应用程序而言,均必须具有良好的分区键基数。Having a good cardinality of your partition keys is essential for any scalable application. 若要确定任何由分区细分为分区容器的吞吐量分布,请导航到 Azure 门户中的“指标”边栏选项卡 。To determine the throughput distribution of any partitioned container broken down by partitions, navigate to the Metrics blade in the Azure portal. 在“吞吐量”选项卡中,存储细目显示在“各物理分区占用的最大 RU 数/秒”图表中 。In the Throughput tab, the storage breakdown is shown in the Max consumed RU/second by each physical partition chart. 下图显示一个示例介绍因最左侧的倾斜分区而产生的不良数据分布。The following graphic illustrates an example of a poor distribution of data as shown by the skewed partition on the far left.

单个分区在下午 3:05 的使用率很高

吞吐量分布不均可能导致热分区,进而造成请求受阻和需要重新分区 。An uneven throughput distribution may cause hot partitions, which can result in throttled requests and may require repartitioning. 若要深入了解如何在 Azure Cosmos DB 中进行分区,请参阅在 Azure Cosmos DB 中进行分区和缩放For more information about partitioning in Azure Cosmos DB, see Partition and scale in Azure Cosmos DB.

确定跨分区的存储分布Determine the storage distribution across partitions

对任何可伸缩应用程序而言,均必须具有良好的分区基数。Having a good cardinality of your partition is essential for any scalable application. 若要确定任何按分区细分为分区容器的存储分布,请前往 Azure 门户中的“指标”边栏选项卡。To determine the storage distribution of any partitioned container broken down by partitions, head to the Metrics blade in the Azure portal. 在“存储”选项卡中,存储细分显示在顶部分区键图表所占用的“数据 + 索引”存储中。In the Storage tab, the storage breakdown is shown in the Data + Index storage consumed by top partition keys chart. 下图说明了数据存储的不良分布,如最左侧的倾斜分区所示。The following graphic illustrates a poor distribution of data storage as shown by the skewed partition on the far left.

不良数据分布示例

可单击图表上的分区,深入查看当前造成分布倾斜的分区键。You can root cause which partition key is skewing the distribution by clicking on the partition in the chart.

分区键使分布倾斜

确定导致分布倾斜的分区键之后,可能需使用进一步分布的分区键重新执行容器分区。After identifying which partition key is causing the skew in distribution, you may have to repartition your container with a more distributed partition key. 若要深入了解如何在 Azure Cosmos DB 中进行分区,请参阅在 Azure Cosmos DB 中进行分区和缩放For more information about partitioning in Azure Cosmos DB, see Partition and scale in Azure Cosmos DB.

比较数据与索引的大小Compare data size against index size

在 Azure Cosmos DB 中,所用存储空间总量是指数据大小和索引大小的总和。In Azure Cosmos DB, the total consumed storage is the combination of both the Data size and Index size. 索引大小通常只是数据大小的一小部分。Typically, the index size is a fraction of the data size. Azure 门户的“指标”边栏选项卡,“存储”选项卡显示基于数据和索引的存储空间使用量详情。In the Metrics blade in the Azure portal, the Storage tab showcases the breakdown of storage consumption based on data and index.

// Measure the document size usage (which includes the index size)  
ResourceResponse<DocumentCollection> collectionInfo = await client.ReadDocumentCollectionAsync(UriFactory.CreateDocumentCollectionUri("db", "coll"));
 Console.WriteLine("Document size quota: {0}, usage: {1}", collectionInfo.DocumentQuota, collectionInfo.DocumentUsage);

若要节省索引空间,可调整索引策略If you would like to conserve index space, you can adjust the indexing policy.

调试查询运行缓慢的原因Debug why queries are running slow

在 SQL API SDK 中,Azure Cosmos DB 提供查询执行的统计信息。In the SQL API SDKs, Azure Cosmos DB provides query execution statistics.

IDocumentQuery<dynamic> query = client.CreateDocumentQuery(
 UriFactory.CreateDocumentCollectionUri(DatabaseName, CollectionName),
 "SELECT * FROM c WHERE c.city = 'Seattle'",
 new FeedOptions
 {
 PopulateQueryMetrics = true,
 MaxItemCount = -1,
 MaxDegreeOfParallelism = -1,
 EnableCrossPartitionQuery = true
 }).AsDocumentQuery();
FeedResponse<dynamic> result = await query.ExecuteNextAsync();

// Returns metrics by partition key range Id
IReadOnlyDictionary<string, QueryMetrics> metrics = result.QueryMetrics;

QueryMetrics 提供执行各查询组件所用时长的详细信息 。QueryMetrics provides details on how long each component of the query took to execution. 导致查询长时间运行的最常见根因是扫描,这意味着查询无法利用索引。The most common root cause for long running queries is scans, meaning the query was unable to leverage the indexes. 可通过设置更好的筛选条件来解决此问题。This problem can be resolved with a better filter condition.

后续步骤Next steps

前文介绍了如何使用 Azure 门户中提供的指标来监视和调试问题。You've now learned how to monitor and debug issues using the metrics provided in the Azure portal. 以下文章进一步介绍了如何提高数据库性能:You may want to learn more about improving database performance by reading the following articles: