使用 Azure Monitor 聚合诊断日志监视 Azure Cosmos DB 数据(预览版)

本文介绍如何使用表诊断 Azure Cosmos DB for NoSQL 中的数据CDBDataPlaneRequests5M。 此表是 聚合诊断日志 功能的一部分。

聚合诊断日志功能旨在通过将诊断数据汇总为 5 分钟15 分钟的间隔,从而大幅节省成本并增强故障排除功能。 聚合日志将写入 特定于资源的表,从而提高架构可发现性、引入延迟和整体查询效率。

小窍门

大规模记录每个请求的详细跟踪可能会非常昂贵。 聚合诊断提供精简高效的替代方法,最多可减少 95% 日志记录成本

先决条件

  • Azure 订阅服务

    • 如果没有 Azure 订阅,可在开始前创建一个试用帐户
  • 一个现有的 Azure Cosmos DB for NoSQL 帐户

  • 现有 Azure Monitor - Log Analytics 工作区

配置诊断设置

首先,必须启用 诊断设置。 在使用 CDBDataPlaneRequests5M 表之前,需要执行此步骤。 使用以下设置和值配置诊断:

价值
目的地 选择目标 Log Analytics 工作区
表格式 Resource-specific
类别 DataPlaneRequests5MDataPlaneRequests15M (仅限聚合版本,而不是按请求)

警告

除非明确需要每个请求的详细日志或查询分析,避免选择经典DataPlaneRequests类别。 聚合表(CDBDataPlaneRequests5M,) CDBDataPlaneRequests15M具有显著的成本效益。

查询数据源

下面是可以使用聚合诊断日志功能执行的查询列表。 这些查询可帮助解决常见的故障排除方案。

//1. Are you experiencing spikes in server-side latency?
//2. Was the latency on a particular Operation?
CDBDataPlaneRequests5M
//| where TimeGenerated > now(-6h)
| where DatabaseName == "ContosoDemo" and CollectionName == "Transactions"
| summarize TotalDurationInMs=sum(TotalDurationMs), MaxRequestCharge=max(MaxDurationMs), AverageRequestCharge=max(AvgDurationMs) by OperationName, TimeGenerated//, bin(TimeGenerated, 1d)
| render timechart

//3. Was the latency on a particular partition or many partitions?
CDBDataPlaneRequests5M
//| where TimeGenerated > now(-6h)
| where DatabaseName == "ContosoDemo" and CollectionName == "Transactions"
| summarize TotalDurationInMs=sum(TotalDurationMs), MaxRequestCharge=max(MaxDurationMs), AverageRequestCharge=max(AvgDurationMs) by PartitionId//, bin(TimeGenerated, 1d)
| render timechart

//4. Were you also experiencing throttling? If throttled percentage is above 5% and you are experiencing high latency this is a sign to continue troubleshooting.
CDBDataPlaneRequests5M
//| where TimeGenerated > now(-6h)
//| where OperationName == "Insert from previous step if latency was on a particular operation"
| where DatabaseName == "ContosoDemo" and CollectionName == "Transactions"
| summarize throttledOperations=sumif(SampleCount, StatusCode == 429), totalOperations=sum(SampleCount) by TimeGenerated, OperationName
| extend throttledPercentage =  throttledOperations/ totalOperations * 1.0
//| summarize count() by  TimeGenerated
//| render timechart

//5. Did transaction volume drastically increase/decrease recently?
CDBDataPlaneRequests5M
| where DatabaseName == "ContosoDemo" and CollectionName == "Transactions"
| summarize count() by TimeGenerated
| render timechart

//6. Did RU/s per operation increase?
//7. Did RU/s per partition increase?
CDBDataPlaneRequests5M
| where DatabaseName == "ContosoDemo" and CollectionName == "Transactions"
| summarize TotalRequestCharge=sum(TotalRequestCharge), MaxRequestCharge=max(MaxRequestCharge), AverageRequestCharge=max(AvgRequestCharge) by OperationName, bin(TimeGenerated, 1d)//, PartitionId
| order by TotalRequestCharge desc

//8. Was there an increase in payload size for write operations?
CDBDataPlaneRequests5M
| where DatabaseName == "ContosoDemo" and CollectionName == "Transactions"
| where OperationName in ("Create", "Upsert", "Delete", "Execute")
| summarize sum(TotalRequestLength) by TimeGenerated, OperationName
| render timechart

//9. Was there an increase in response size for read operations?
CDBDataPlaneRequests5M
| where DatabaseName == "ContosoDemo" and CollectionName == "Transactions"
| where OperationName in ("Read", "Query")
| summarize sum(TotalResponseLength) by TimeGenerated, OperationName
| render timechart

//10. Was there an increase in server-side timeouts?
CDBDataPlaneRequests5M
| where DatabaseName == "ContosoDemo" and CollectionName == "Transactions"
| where StatusCode == 408
| summarize sum(SampleCount) by TimeGenerated
| render timechart

//11. Was the latency on a particular client or app?
CDBDataPlaneRequests5M
//| where TimeGenerated > now(-6h)
| where DatabaseName == "ContosoDemo" and CollectionName == "Transactions"
| summarize TotalDurationInMs=sum(TotalDurationMs) by UserAgent, ClientIpAddress, TimeGenerated
| render timechart