排查和识别 Azure Database for PostgreSQL 灵活服务器中运行缓慢的查询的问题

2025/07/07

适用于： Azure Database for PostgreSQL 灵活服务器

本文介绍如何识别和诊断查询运行缓慢的根本原因。

本文介绍以下内容：

如何识别运行缓慢的查询。
如何识别运行速度缓慢的过程。在属于同一运行缓慢的存储过程的查询列表中识别运行缓慢的查询。

先决条件

按照使用故障排除指南中所述的步骤启用故障排除指南。
通过将 auto_explain 扩展列入允许列表并加载该扩展来对其进行配置。
配置 auto_explain 扩展后，更改以下服务器参数，这些参数可控制扩展的行为：
- auto_explain.log_analyze 至 ON。
- auto_explain.log_buffers 至 ON。
- 设置适合你的方案的 auto_explain.log_min_duration。
- auto_explain.log_timing 至 ON。
- auto_explain.log_verbose 至 ON。

备注

如果将 auto_explain.log_min_duration 设置为 0，扩展会开始记录服务器上执行的所有查询。这可能会影响数据库的性能。必须进行适当的尽职调查，才能分析出怎样的值才被视为服务器运行速度缓慢。例如，如果所有查询在 30 秒内完成，并且应用程序可以接受，则建议将参数更新为 30000 毫秒。这样就可以记录完成时间超过 30 秒的所有查询。

识别速度缓慢的查询

有了故障排除指南和 auto_explain 扩展，我们可以通过一个示例来描述该场景。

我们遇到 CPU 使用率飙升到 90% 的情况，想要确定造成飙升的原因。要调试方案，请执行以下步骤：

一旦收到有关 CPU 的警报，请在受影响的 Azure Database for PostgreSQL 灵活服务器实例的资源菜单中，在‘监视’部分下选择‘故障排除指南’。
选择 CPU 选项卡。故障排除指南 “优化高 CPU 使用率” 随即打开。
在“分析时段(当地时间)”下拉列表中，选择要要重点分析的时间范围。
选择“查询”选项卡。其中显示了在 CPU 使用率达到 90% 的时间间隔内运行的所有查询的详细信息。从快照来看，该时间间隔内平均执行时间最慢的查询约为 2.6 分钟，该查询在该时间间隔内运行了 22 次。该查询很可能是 CPU 飙升的原因。
若要检索实际查询文本，请连接到 azure_sys 数据库并执行以下查询。

    psql -h <server>.postgres.database.chinacloudapi.cn -U <user> -d azure_sys

     SELECT query_sql_text
     FROM query_store.query_texts_view
     WHERE query_text_id = <query_id>;

在所考虑的示例中，发现运行缓慢的查询是：

SELECT  c_id, SUM(c_balance) AS total_balance
FROM customer
GROUP BY c_w_id, c_id
ORDER BY c_w_id;

若要了解生成的确切说明计划，请使用 Azure Database for PostgreSQL 灵活服务器日志。每次查询在该时间窗口内完成执行时，auto_explain 扩展就会在日志中写入一个条目。在资源菜单中的“监视”部分下，选择“日志”。
选择 CPU 利用率达到 90% 的时间范围。
执行以下查询，以检索所识别查询的 EXPLAIN ANALYZE 输出。

AzureDiagnostics
| where Category contains  'PostgreSQLLogs'
| where Message contains "<snippet of SQL text identified or name of any of the tables referenced in the query>"
| project TimeGenerated, Message

消息列存储执行计划，如以下输出所示：

2024-11-10 19:56:46 UTC-6525a8e7.2e3d-LOG: duration: 150692.864 ms plan:

Query Text: SELECT c_id, SUM(c_balance) AS total_balance
FROM customer
GROUP BY c_w_id,c_id
order by c_w_id;
GroupAggregate (cost=1906820.83..2131820.83 rows=10000000 width=40) (actual time=70930.833..129231.950 rows=10000000 loops=1)
Output: c_id, sum(c_balance), c_w_id
Group Key: customer.c_w_id, customer.c_id
Buffers: shared hit=44639 read=355362, temp read=77521 written=77701
-> Sort (cost=1906820.83..1931820.83 rows=10000000 width=13) (actual time=70930.821..81898.430 rows=10000000 loops=1)
Output: c_id, c_w_id, c_balance
Sort Key: customer.c_w_id, customer.c_id
Sort Method: external merge Disk: 225104kB
Buffers: shared hit=44639 read=355362, temp read=77521 written=77701
-> Seq Scan on public.customer (cost=0.00..500001.00 rows=10000000 width=13) (actual time=0.021..22997.697 rows=10000000 loops=1)
Output: c_id, c_w_id, c_balance

查询运行了大约 2.5 分钟，如故障排除指南中所示。提取的执行计划输出中的 duration 值为 150692.864 毫秒，这证实了问题所在。使用 EXPLAIN ANALYZE 的输出进一步排除故障并优化查询。

备注

观察查询在间隔期间运行了 22 次，并且显示的日志包含间隔期间捕获的此类条目。

识别存储过程中运行速度缓慢的查询

有了故障排除指南和 auto_explain 扩展，我们将借助示例来解释该场景。

我们遇到 CPU 使用率飙升到 90% 的情况，想要确定造成飙升的原因。要调试方案，请执行以下步骤：

一旦收到有关 CPU 的警报，请在受影响的 Azure Database for PostgreSQL 灵活服务器实例的资源菜单中，在‘监视’部分下选择‘故障排除指南’。
选择 CPU 选项卡。故障排除指南 “优化高 CPU 使用率” 随即打开。
在“分析时段(当地时间)”下拉列表中，选择要要重点分析的时间范围。
选择“查询”选项卡。其中显示了在 CPU 使用率达到 90% 的时间间隔内运行的所有查询的详细信息。从快照来看，该时间间隔内平均执行时间最慢的查询约为 2.6 分钟，该查询在该时间间隔内运行了 22 次。该查询很可能是 CPU 飙升的原因。
连接到 azure_sys 数据库并执行查询，以使用以下脚本检索实际查询文本。

    psql -h <server>.postgres.database.chinacloudapi.cn -U <user> -d azure_sys

     SELECT query_sql_text
     FROM query_store.query_texts_view
     WHERE query_text_id = <query_id>;

在所考虑的示例中，发现速度缓慢的查询是一个存储过程：

    call autoexplain_test ();

若要了解生成的确切说明计划，请使用 Azure Database for PostgreSQL 灵活服务器日志。每次查询在该时间窗口内完成执行时，auto_explain 扩展就会在日志中写入一个条目。在资源菜单中的“监视”部分下，选择“日志”。然后，在“时间范围:”中，选择要集中分析的时间窗口。

执行以下查询，以检索所标识查询的说明分析输出。

AzureDiagnostics
| where Category contains  'PostgreSQLLogs'
| where Message contains "<snippet of SQL text identified or name of any of the tables referenced in the query>"
| project TimeGenerated, Message

该过程有多个查询，下面突出显示了这些查询。记录存储过程中使用的每个查询的解释分析输出，以进一步进行分析和排除故障。记录的查询的执行时间可用于识别存储过程中最慢的查询。

2024-11-10 17:52:45 UTC-6526d7f0.7f67-LOG: duration: 38459.176 ms plan:

Query Text: insert into customer_balance SELECT c_id, SUM(c_balance) AS total_balance FROM customer GROUP BY c_w_id,c_id order by c_w_id
Insert on public.customer_balance (cost=1906820.83..2231820.83 rows=0 width=0) (actual time=38459.173..38459.174 rows=0 loops=1)Buffers: shared hit=10108203 read=454055 dirtied=54058, temp read=77521 written=77701 WAL: records=10000000 fpi=1 bytes=640002197
    -> Subquery Scan on "*SELECT*" (cost=1906820.83..2231820.83 rows=10000000 width=36) (actual time=20415.738..29514.742 rows=10000000 loops=1)
        Output: "*SELECT*".c_id, "*SELECT*".total_balance Buffers: shared hit=1 read=400000, temp read=77521 written=77701
            -> GroupAggregate (cost=1906820.83..2131820.83 rows=10000000 width=40) (actual time=20415.737..28574.266 rows=10000000 loops=1)
                Output: customer.c_id, sum(customer.c_balance), customer.c_w_id Group Key: customer.c_w_id, customer.c_id Buffers: shared hit=1 read=400000, temp read=77521 written=77701
                -> Sort (cost=1906820.83..1931820.83 rows=10000000 width=13) (actual time=20415.723..22023.515 rows=10000000 loops=1)
                    Output: customer.c_id, customer.c_w_id, customer.c_balance Sort Key: customer.c_w_id, customer.c_id Sort Method: external merge Disk: 225104kB Buffers: shared hit=1 read=400000, temp read=77521 written=77701
                     -> Seq Scan on public.customer (cost=0.00..500001.00 rows=10000000 width=13) (actual time=0.310..15061.471 rows=10000000 loops=1) Output: customer.c_id, customer.c_w_id, customer.c_balance Buffers: shared hit=1 read=400000

2024-11-10 17:52:07 UTC-6526d7f0.7f67-LOG: duration: 61939.529 ms plan:
Query Text: delete from customer_balance
Delete on public.customer_balance (cost=0.00..799173.51 rows=0 width=0) (actual time=61939.525..61939.526 rows=0 loops=1) Buffers: shared hit=50027707 read=620942 dirtied=295462 written=71785 WAL: records=50026565 fpi=34 bytes=2711252160
    -> Seq Scan on public.customer_balance (cost=0.00..799173.51 rows=15052451 width=6) (actual time=3185.519..35234.061 rows=50000000 loops=1)
        Output: ctid Buffers: shared hit=27707 read=620942 dirtied=26565 written=71785 WAL: records=26565 fpi=34 bytes=11252160


2024-11-10 17:51:05 UTC-6526d7f0.7f67-LOG: duration: 10387.322 ms plan:
Query Text: select max(c_id) from customer
Finalize Aggregate (cost=180185.84..180185.85 rows=1 width=4) (actual time=10387.220..10387.319 rows=1 loops=1) Output: max(c_id) Buffers: shared hit=37148 read=1204 written=69
    -> Gather (cost=180185.63..180185.84 rows=2 width=4) (actual time=10387.214..10387.314 rows=1 loops=1)
        Output: (PARTIAL max(c_id)) Workers Planned: 2 Workers Launched: 0 Buffers: shared hit=37148 read=1204 written=69
        -> Partial Aggregate (cost=179185.63..179185.64 rows=1 width=4) (actual time=10387.012..10387.013 rows=1 loops=1) Output: PARTIAL max(c_id) Buffers: shared hit=37148 read=1204 written=69
            -> Parallel Index Only Scan using customer_i1 on public.customer (cost=0.43..168768.96 rows=4166667 width=4) (actual time=0.446..7676.356 rows=10000000 loops=1)
               Output: c_w_id, c_d_id, c_id Heap Fetches: 24 Buffers: shared hit=37148 read=1204 written=69

备注

为了便于演示，仅显示该过程中使用的少数查询的 EXPLAIN ANALYZE 输出。思路是可以收集日志中所有查询的 EXPLAIN ANALYZE 输出，识别其中最慢的查询，并尝试优化它们。

通过

排查和识别 Azure Database for PostgreSQL 灵活服务器中运行缓慢的查询的问题

先决条件

识别速度缓慢的查询

识别存储过程中运行速度缓慢的查询

相关内容

其他资源