排查 Azure 自动缩放问题Troubleshooting Azure autoscale

Azure Monitor 自动缩放可帮助你运行适当数量的资源来处理应用程序的负载。Azure Monitor autoscale helps you to have the right amount of resources running to handle the load on your application. 在负载增加时,可以通过自动缩放添加资源来处理增加的负载;在资源处于空闲状态时,可以通过自动缩放删除资源来节省资金。It enables you to add resources to handle increases in load and also save money by removing resources that are sitting idle. 可以根据你选择的计划、固定的日期时间或资源指标进行缩放。You can scale based on a schedule, fixed date-time, or resource metric you choose. 有关详细信息,请参阅自动缩放概述For more information, see Autoscale Overview.

自动缩放服务提供了指标和日志,以便你了解已发生了哪些缩放操作,它还可以评估导致执行这些操作的条件。The autoscale service provides you metrics and logs to understand what scale actions have occurred and the evaluation of the conditions that led to those actions. 你可以找到下面所述的问题的答案:You can find answers to questions such as:

  • 为何横向扩展或缩减了我的服务?Why did my service scale-out or in?
  • 我的服务为何未缩放?Why did my service not scale?
  • 自动缩放操作为何失败?Why did an autoscale action fail?
  • 自动缩放操作为何花费了很长时间?Why is an autoscale action taking time to scale?

自动缩放指标Autoscale metrics

自动缩放提供了四个指标以便你了解其操作。Autoscale provides you with four metrics to understand its operation.

  • 观察到的指标值 - 你选择针对其执行缩放操作的指标的值,由自动缩放引擎查看或计算。Observed Metric Value - The value of the metric you chose to take the scale action on, as seen or computed by the autoscale engine. 由于一个自动缩放设置可以包含多个规则(因而也包含多个指标源),你可以使用“指标源”作为维度进行筛选。Because a single autoscale setting can have multiple rules and therefore multiple metric sources, you can filter using "metric source" as a dimension.
  • 指标阈值 - 设置的阈值,达到该阈值即会执行缩放操作。Metric Threshold - The threshold you set to take the scale action. 由于一个自动缩放设置可以包含多个规则(因而也包含多个指标源),你可以使用“指标规则”作为维度进行筛选。Because a single autoscale setting can have multiple rules and therefore multiple metric sources, you can filter using "metric rule" as a dimension.
  • 观察到的容量 - 自动缩放引擎看到的目标资源活动实例数。Observed Capacity - The active number of instances of the target resource as seen by Autoscale engine.
  • 启动的缩放操作 - 自动缩放引擎启动的横向扩展操作和横向缩小操作的数目。Scale Actions Initiated - The number of scale-out and scale-in actions initiated by the autoscale engine. 可按横向扩展与横向缩减操作进行筛选。You can filter by scale-out vs. scale in actions.

可以使用指标资源管理器在一个位置绘制上述所有指标的图表。You can use the Metrics Explorer to chart the above metrics all in one place. 该图表应显示:The chart should show:

  • 实际指标the actual metric
  • 自动缩放引擎看到/计算的指标the metric as seen/computed by autoscale engine
  • 缩放操作的阈值the threshold for a scale action
  • 容量变化the change in capacity

示例 1 - 分析简单的自动缩放规则Example 1 - Analyzing a simple autoscale rule

我们对虚拟机规模集使用了一个简单的自动缩放设置:We have a simple autoscale setting for a virtual machine scale set that:

  • 如果某个规模集的平均 CPU 利用率百分比持续 10 分钟大于 70%,则横向扩展scales out when the average CPU percentage of a set is greater than 70% for 10 minutes
  • 如果该规模集的 CPU 利用率百分比持续 10 分钟以上小于 5%,则横向缩减。scales in when the CPU percentage of the set is less than 5% for more than 10 minutes.

让我们查看自动缩放服务提供的指标。Let’s review the metrics from the autoscale service.

虚拟机规模集 CPU 利用率百分比示例

虚拟机规模集 CPU 利用率百分比示例

图 1a - 虚拟机规模集的 CPU 利用率百分比指标,以及自动缩放设置的“观测到的指标值”指标Figure 1a - Percentage CPU metric for virtual machine scale set and the Observed Metric Value metric for autoscale setting

“指标阈值”和“观察到的容量”

图 1b -“指标阈值”和“观察到的容量”Figure 1b - Metric Threshold and Observed Capacity

在图 1b 中,横向扩展规则的“指标阈值”(浅蓝色线条)为 70。In figure 1b, the Metric Threshold (light blue line) for the scale-out rule is 70. “观察到的容量”(深蓝色线条)显示活动实例数,当前为 3。The Observed Capacity (dark blue line) shows the number of active instances, which is currently 3.

备注

需要按指标触发器规则维度横向扩展(递增)规则筛选“指标阈值”,才能查看横向扩展阈值。You will need to filter the Metric Threshold by the metric trigger rule dimension scale out (increase) rule to see the scale-out threshold and by the scale in rule (decrease).

示例 2 - 虚拟机规模集的高级自动缩放Example 2 - Advanced autoscaling for a virtual machine scale set

我们使用了一个自动缩放设置来使虚拟机规模集资源能够根据其自己的指标“出站流”进行横向扩展。We have an autoscale setting that allows a virtual machine scale set resource to scale out based on its own metric Outbound Flows. 请注意,已选中了指标阈值的“将指标除以实例计数”选项。Notice that the divide metric by instance count option for the metric threshold is checked.

缩放操作规则为:The scale action rule is:

如果“每个实例的出站流”值大于 10,则应将自动缩放服务横向扩展 1 个实例。If the value of Outbound Flow per instance is greater than 10, then autoscale service should scale out by 1 instance.

在这种情况下,自动缩放引擎的“观察到的指标值”的计算方式是将实际指标值除以实例数。In this case, the autoscale engine’s observed metric value is calculated as the actual metric value divided by the number of instances. 如果观察到的指标值小于阈值,则不启动横向扩展操作。If the observed metric value is less than the threshold, no scale-out action is initiated.

虚拟机规模集自动缩放指标图表示例

虚拟机规模集自动缩放指标图表示例

图 2 - 虚拟机规模集自动缩放指标图表示例Figure 2 - Virtual machine scale set autoscale metrics charts example

在图 2 中可以看到两个指标图表。In figure 2, you can see two metric charts.

顶部的图表显示“出站流”指标的实际值。The chart on top shows the actual value of the Outbound Flows metric. 实际值为 6。The actual value is 6.

底部的图表显示了少数几个值。The chart on the bottom shows a few values.

  • “观察到的指标值”(浅蓝色)为 3,因为有 2 个活动实例,6 除以 2 等于 3。The Observed Metric value (light blue) is 3 because there are 2 active instances and 6 divided by 2 is 3.
  • “观察到的容量”(紫色)显示自动缩放引擎看到的实例计数。The Observed Capacity (purple) shows the instance count seen by autoscale engine.
  • “指标阈值”(浅绿色)设置为 10。The Metric Threshold (light green) is set to 10.

如果有多个缩放操作规则,则可以使用指标资源管理器图表中的“拆分”或“添加筛选器”选项,按特定的源或规则查看指标。If there are multiple scale action rules, you can use splitting or the add filter option in the Metrics explorer chart to look at metric by a specific source or rule. 有关拆分指标图表的详细信息,请参阅指标图表的高级功能 - 拆分For more information on splitting a metric chart, see Advanced features of metric charts - splitting

示例 3 - 了解自动缩放事件Example 3 - Understanding autoscale events

在自动缩放设置屏幕上,转到“运行历史记录”选项卡查看最近的缩放操作。In the autoscale setting screen, go to the Run history tab to see the most recent scale actions. 该选项卡还会显示“观察到的容量”在一段时间内的变化。The tab also shows the change in Observed Capacity over time. 若要查找有关所有自动缩放操作(包括更新/删除自动缩放设置等操作)的详细信息,请查看活动日志并按自动缩放操作进行筛选。To find more details about all autoscale actions including operations such as update/delete autoscale settings, view the activity log and filter by autoscale operations.

自动缩放设置 - 运行历史记录

自动缩放资源日志Autoscale Resource Logs

与任何其他 Azure 资源相同,自动缩放服务提供资源日志Same as any other Azure resource, the autoscale service provides resource logs. 有两种类别的日志。There are two categories of logs.

  • 自动缩放评估 - 自动缩放引擎每次执行检查时,都会针对每一条件评估记录相应的日志条目。Autoscale Evaluations - The autoscale engine records log entries for every single condition evaluation every time it does a check. 该条目包含有关所观察到的指标值、评估的规则以及该评估是否导致发生缩放操作的详细信息。The entry includes details on the observed values of the metrics, the rules evaluated, and if the evaluation resulted in a scale action or not.

  • “自动缩放”缩放操作 - 引擎将会记录自动缩放服务发起的缩放操作事件,以及这些缩放操作的结果(成功、失败,以及自动缩放服务看到的缩放量)。Autoscale Scale Actions - The engine records scale action events initiated by autoscale service and the results of those scales actions (success, failure, and how much scaling occurred as seen by the autoscale service).

与在 Azure Monitor 支持的任何服务中一样,你可以使用诊断设置将这些日志:As with any Azure Monitor supported service, you can use Diagnostic Settings to route these logs:

  • 路由到 Log Analytics 工作区进行详细分析to your Log Analytics workspace for detailed analytics
  • 依次路由到事件中心和非 Azure 工具to Event Hubs and then to non-Azure tools
  • 路由到 Azure 存储帐户进行存档to your Azure storage account for archival

自动缩放诊断设置

上图显示了 Azure 门户中的自动缩放诊断设置。The previous picture shows the Azure portal autoscale diagnostic settings. 在其中可以选择“诊断/资源日志”选项卡,并启用日志收集和路由。There you can select the Diagnostic/Resource Logs tab and enable log collection and routing. 还可以使用 REST API、CLI、PowerShell、资源管理器模板,通过选择“Microsoft.Insights/AutoscaleSettings”作为资源类型,来对诊断设置执行相同的操作。You can also perform the same action using REST API, CLI, PowerShell, Resource Manager templates for Diagnostic Settings by choosing the resource type as Microsoft.Insights/AutoscaleSettings.

使用自动缩放日志进行故障排除Troubleshooting using autoscale logs

为了获得最佳的故障排除体验,我们建议在创建自动缩放设置时,通过工作区将日志路由到 Azure Monitor 日志 (Log Analytics)。For best troubleshooting experience, we recommend routing your logs to Azure Monitor Logs (Log Analytics) through a workspace when you create the autoscale setting. 上一部分的插图已演示了此过程。This process is shown in the picture in the previous section. 使用 Log Analytics 可以更好地验证评估和缩放操作。You can validate the evaluations and scale actions better using Log Analytics.

将自动缩放日志配置为发送到 Log Analytics 工作区后,可以执行以下查询来检查日志。Once you have configured your autoscale logs to be sent to the Log Analytics workspace, you can execute the following queries to check the logs.

若要开始,请尝试使用此查询来查看最近的自动缩放评估日志:To get started, try this query to view the most recent autoscale evaluation logs:

AutoscaleEvaluationsLog
| limit 50

或者尝试使用以下查询来查看最近的缩放操作日志:Or try the following query to view the most recent scale action logs:

AutoscaleScaleActionsLog
| limit 50

如果遇到任何问题,请参考以下部分。Use the following sections to these questions.

发生了意料之外的缩放操作A scale action occurred that I didn’t expect

首先对缩放操作执行查询,找到你所关注的缩放操作。First execute the query for scale action to find the scale action you are interested in. 如果它是最新的缩放操作,请使用以下查询:If it is the latest scale action, use the following query:

AutoscaleScaleActionsLog
| take 1

在缩放操作日志中选择 CorrelationId 字段。Select the CorrelationId field from the scale actions log. 使用 CorrelationId 查找正确的评估日志。Use the CorrelationId to find the right Evaluation log. 执行以下查询会显示已评估的、导致发生该缩放操作的所有规则和条件。Executing the below query will display all the rules and conditions evaluated leading to that scale action.

AutoscaleEvaluationsLog
| where CorrelationId = "<correliationId>"

哪个配置文件导致发生了缩放操作?What profile caused a scale action?

发生了缩放操作,但你的规则和配置文件重叠,需要追查到哪一个配置文件导致发生了该操作。A scaled action occurred, but you have overlapping rules and profiles and need to track down which caused the action.

找到缩放操作的 correlationId(如示例 1 中所述),然后对评估日志执行查询以了解有关该配置文件的详细信息。Find the correlationId of the scale action (as explained in example 1) and then execute the query on evaluation logs to learn more about the profile.

AutoscaleEvaluationsLog
| where CorrelationId = "<correliationId_Guid>"
| where ProfileSelected == true
| project ProfileEvaluationTime, Profile, ProfileSelected, EvaluationResult

也可以使用以下查询来更好地了解整个配置文件评估The whole profile evaluation can also be understood better using the following query

AutoscaleEvaluationsLog
| where TimeGenerated > ago(2h)
| where OperationName contains == "profileEvaluation"
| project OperationName, Profile, ProfileEvaluationTime, ProfileSelected, EvaluationResult

未发生缩放操作A scale action did not occur

我预期会发生某个缩放操作,但它并未发生。I expected a scale action and it did not occur. 可能没有缩放操作事件或日志。There may be no scale action events or logs.

如果你使用的是基于指标的缩放规则,请查看自动缩放指标。Review the autoscale metrics if you are using a metric-based scale rule. “观察到的指标值”或“观察到的容量”可能不符合预期,因此未激发缩放规则。 It's possible that the Observed metric value or Observed Capacity are not what you expected them to be and therefore the scale rule did not fire. 你仍然可以看到评估,但看不到横向扩展规则。You would still see evaluations, but not a scale-out rule. 此外,冷却时间也可能会阻止发生缩放操作。It's also possible that the cool-down time kept a scale action from occurring.

查看你预期会发生缩放操作的时间段内的自动缩放评估日志。Review the autoscale evaluation logs during the time period you expected the scale action to occur. 查看自动缩放服务执行的所有评估,确定它为何决定不触发缩放操作。Review all the evaluations it did and why it decided to not trigger a scale action.

AutoscaleEvaluationsLog
| where TimeGenerated > ago(2h)
| where OperationName == "MetricEvaluation" or OperationName == "ScaleRuleEvaluation"
| project OperationName, MetricData, ObservedValue, Threshold, EstimateScaleResult

缩放操作失败Scale action failed

在某些情况下,自动缩放服务会执行缩放操作,但系统决定不进行缩放或者无法完成缩放操作。There may be a case where autoscale service took the scale action but the system decided not to scale or failed to complete the scale action. 使用此查询来查找失败的缩放操作。Use this query to find the failed scale actions.

AutoscaleScaleActionsLog
| where ResultType == "Failed"
| project ResultDescription

创建警报规则,以便在发生自动缩放操作或者失败时获得通知。Create alert rules to get notified of autoscale actions or failures. 创建警报规则还可以在发生自动缩放事件时获得通知。You can also create alert rules to get notified on autoscale events.

自动缩放资源日志的架构Schema of autoscale resource logs

有关详细信息,请参阅自动缩放资源日志For more information, see autoscale resource logs

后续步骤Next steps

阅读有关自动缩放最佳做法的信息。Read information on autoscale best practices.