故障排除 API 限制错误Troubleshooting API throttling errors

Azure 计算请求可能会根据订阅和区域进行限制,以便优化服务的总体性能。Azure Compute requests may be throttled at a subscription and on a per-region basis to help with the overall performance of the service. 我们会确保对 Azure 计算资源提供程序(CRP,用于管理 Microsoft.Compute 命名空间中的资源)的所有调用不超出允许的最大 API 请求速率。We ensure all the calls to the Azure Compute Resource Provider (CRP), which manages resources under Microsoft.Compute namespace don't exceed the maximum allowed API request rate. 本文档介绍 API 限制、有关如何排查限制问题的详细信息,以及如何避免受限的最佳做法。This document describes API throttling, details on how to troubleshoot throttling issues, and best practices to avoid being throttled.

Azure 资源管理器限制与资源提供程序限制Throttling by Azure Resource Manager vs Resource Providers

作为 Azure 的“前门”,Azure 资源管理器会对所有传入的 API 请求进行身份验证、第一级验证和限制。As the front door to Azure, Azure Resource Manager does the authentication and first-order validation and throttling of all incoming API requests. 此处介绍了 Azure 资源管理器调用速率限制和相关的诊断响应 HTTP 标头。Azure Resource Manager call rate limits and related diagnostic response HTTP headers are described here.

当 Azure API 客户端收到限制错误时,HTTP 状态为“429 请求过多”。When an Azure API client gets a throttling error, the HTTP status is 429 Too Many Requests. 若要了解请求限制是由 Azure 资源管理器施加的还是由基础资源提供程序(例如 CRP)施加的,请检查 x-ms-ratelimit-remaining-subscription-reads(针对 GET 请求)和 x-ms-ratelimit-remaining-subscription-writes 响应标头(针对非 GET 请求)。To understand if the request throttling is done by Azure Resource Manager or an underlying resource provider like CRP, inspect the x-ms-ratelimit-remaining-subscription-reads for GET requests and x-ms-ratelimit-remaining-subscription-writes response headers for non-GET requests. 如果剩余调用计数接近 0,则表明已达到订阅的常规调用限制(由 Azure 资源管理器定义)。If the remaining call count is approaching 0, the subscription's general call limit defined by Azure Resource Manager has been reached. 所有订阅客户端的活动会一起计数。Activities by all subscription clients are counted together. 否则,限制由目标资源提供程序(请求 URL 的 /providers/<RP> 段所指的提供程序)施加。Otherwise, the throttling is coming from the target resource provider (the one addressed by the /providers/<RP> segment of the request URL).

调用速率信息响应标头Call rate informational response headers

标头Header 值格式Value format 示例Example 说明Description
x-ms-ratelimit-remaining-resourcex-ms-ratelimit-remaining-resource <source RP>/<policy or bucket>;<count> Microsoft.Compute/HighCostGet3Min;159Microsoft.Compute/HighCostGet3Min;159 限制策略(涵盖资源 Bucket 或操作组,包括此请求的目标)的剩余 API 调用计数Remaining API call count for the throttling policy covering the resource bucket or operation group including the target of this request
x-ms-request-chargex-ms-request-charge <count> 11 针对相应策略的限制计入的此 HTTP 请求的调用计数。The number of call counts "charged" for this HTTP request toward the applicable policy's limit. 这通常为 1。This is most typically 1. 针对特殊情况(例如针对虚拟机规模集的缩放)的批请求可以有多个计数。Batch requests, such as for scaling a virtual machine scale set, can charge multiple counts.

请注意,一个 API 请求可能受多个限制策略的约束。Note that an API request can be subjected to multiple throttling policies. 每个策略会有单独的 x-ms-ratelimit-remaining-resource 标头。There will be a separate x-ms-ratelimit-remaining-resource header for each policy.

下面是删除虚拟机规模集请求的示例响应。Here is a sample response to delete virtual machine scale set request.

x-ms-ratelimit-remaining-resource: Microsoft.Compute/DeleteVMScaleSet3Min;107 
x-ms-ratelimit-remaining-resource: Microsoft.Compute/DeleteVMScaleSet30Min;587 
x-ms-ratelimit-remaining-resource: Microsoft.Compute/VMScaleSetBatchedVMRequests5Min;3704 
x-ms-ratelimit-remaining-resource: Microsoft.Compute/VmssQueuedVMOperations;4720 

限制错误详细信息Throttling error details

429 HTTP 状态通常用于在达到调用速率限制时拒绝某个请求。The 429 HTTP status is commonly used to reject a request because a call rate limit is reached. 来自计算资源提供程序的常规限制错误响应将如以下示例所示(仅显示相关的标头):A typical throttling error response from Compute Resource Provider will look like the example below (only relevant headers are shown):

HTTP/1.1 429 Too Many Requests
x-ms-ratelimit-remaining-resource: Microsoft.Compute/HighCostGet3Min;46
x-ms-ratelimit-remaining-resource: Microsoft.Compute/HighCostGet30Min;0
Retry-After: 1200
Content-Type: application/json; charset=utf-8
{
  "code": "OperationNotAllowed",
  "message": "The server rejected the request because too many requests have been received for this subscription.",
  "details": [
    {
      "code": "TooManyRequests",
      "target": "HighCostGet30Min",
      "message": "{\"operationGroup\":\"HighCostGet30Min\",\"startTime\":\"2018-06-29T19:54:21.0914017+00:00\",\"endTime\":\"2018-06-29T20:14:21.0914017+00:00\",\"allowedRequestCount\":800,\"measuredRequestCount\":1238}"
    }
  ]
}

剩余调用计数为 0 时,将根据相关策略返回限制错误。The policy with remaining call count of 0 is the one due to which the throttling error is returned. 在此示例中,该策略为 HighCostGet30MinIn this case that is HighCostGet30Min. 响应正文的总体格式是 Azure 资源管理器 API 的常规错误格式(与 OData 相符)。The overall format of the response body is the general Azure Resource Manager API error format (conformant with OData). 主要错误代码 OperationNotAllowed 是计算资源提供程序用来报告限制错误(以及其他类型的客户端错误)的代码。The main error code, OperationNotAllowed, is the one Compute Resource Provider uses to report throttling errors (among other types of client errors). 内部错误的 message 属性包含一个具有限制冲突详细信息的序列化 JSON 结构。The message property of the inner error(s) contains a serialized JSON structure with the details of the throttling violation.

如上所述,每个限制错误都包含 Retry-After 标头,其提供的最小秒数是客户端在重试请求之前应该等待的时间。As illustrated above, every throttling error includes the Retry-After header, which provides the minimum number of seconds the client should wait before retrying the request.

API 调用速率和限制错误分析器API call rate and throttling error analyzer

针对计算资源提供程序的 API 提供了故障排除功能的一个预览版版本。A preview version of a troubleshooting feature is available for the Compute resource provider's API. 这些 PowerShell cmdlet 按时间间隔按操作提供有关 API 请求速率的统计信息并且按操作组(策略)提供限制违规统计信息:These PowerShell cmdlets provide statistics about API request rate per time interval per operation and throttling violations per operation group (policy):

使用此 API 调用统计信息可以很好地洞察订阅的客户端的行为,并轻松识别导致限制的调用模式。The API call stats can provide great insight into the behavior of a subscription's client(s) and enable easy identification of call patterns that cause throttling.

目前,分析器的限制是它不会将针对磁盘和快照资源类型的请求计算在内(支持托管磁盘)。A limitation of the analyzer for the time being is that it does not count requests for disk and snapshot resource types (in support of managed disks). 因为它从 CRP 的遥测数据收集数据,所以它也不能帮助识别来自 ARM 的限制错误。Since it gathers data from CRP's telemetry, it also cannot help in identifying throttling errors from ARM. 但是,如上文所述,可以根据独特的 ARM 响应标头轻松识别这些错误。But those can be identified easily based on the distinctive ARM response headers, as discussed earlier.

PowerShell cmdlet 使用 REST 服务 API,客户端可以轻松直接调用该 API(但是尚未提供正式支持)。The PowerShell cmdlets are using a REST service API, which can be easily called directly by clients (though with no formal support yet). 若要查看 HTTP 请求格式,请在使用 -Debug 开关的情况下运行 cmdlet 或者使用 Fiddler 探查其执行。To see the HTTP request format, run the cmdlets with -Debug switch or snoop on their execution with Fiddler.

最佳做法Best practices

  • 请勿无条件地以及(或者)立即地重试 Azure 服务 API 错误。Do not retry Azure service API errors unconditionally and/or immediately. 遇到不可重试的错误时,常见的情况是客户端代码会进入快速的重试循环。A common occurrence is for client code to get into a rapid retry loop when encountering an error that's not retry-able. 重试最终会耗光目标操作对应的组的允许调用限制,影响订阅的其他客户端。Retries will eventually exhaust the allowed call limit for the target operation's group and impact other clients of the subscription.
  • 在大容量 API 自动化示例中,如果目标操作组的可用调用计数掉到某个较低的阈值以下,则可考虑实施前摄性客户端自动限制。In high-volume API automation cases, consider implementing proactive client-side self-throttling when the available call count for a target operation group drops below some low threshold.
  • 跟踪异步操作时,请遵循 Retry-After 标头提示。When tracking async operations, respect the Retry-After header hints.
  • 如果客户端代码需要特定虚拟机的信息,请直接查询该 VM,不需先列出包含资源组或整个订阅中的所有 VM,然后选取客户端的所需 VM。If the client code needs information about a particular Virtual Machine, query that VM directly instead of listing all VMs in the containing resource group or the entire subscription and then picking the needed VM on the client side.
  • 如果客户端代码需要特定 Azure 位置的 VM、磁盘和快照,请使用基于位置的查询形式,不需先查询所有订阅 VM,然后在客户端按位置进行筛选:GET /subscriptions/<subId>/providers/Microsoft.Compute/locations/<location>/virtualMachines?api-version=2017-03-30 查询针对计算资源提供程序区域终结点。If client code needs VMs, disks and snapshots from a specific Azure location, use location-based form of the query instead of querying all subscription VMs and then filtering by location on the client side: GET /subscriptions/<subId>/providers/Microsoft.Compute/locations/<location>/virtualMachines?api-version=2017-03-30 query to Compute Resource Provider regional endpoints.
  • 创建或更新 API 资源(尤其是 VM 和虚拟机规模集)时,跟踪返回的异步操作直至完成比针对资源 URL 本身进行轮询(基于 provisioningState)要有效得多。When creating or updating API resources in particular, VMs and virtual machine scale sets, it is far more efficient to track the returned async operation to completion than do polling on the resource URL itself (based on the provisioningState).