Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article describes how to identify the root cause of high CPU utilization. It also provides possible remedial actions to control CPU utilization when using Elastic clusters in Azure Database for PostgreSQL.
In this article, you learn about:
- How to use tools like Azure Metrics,
pg_stat_statements,citus_stat_activity, andpg_stat_activityto identify high CPU utilization. - How to identify root causes, such as long running queries and total connections.
- How to resolve high CPU utilization by using
EXPLAIN ANALYZEand vacuuming tables.
Tools to identify high CPU utilization
Use the following tools to identify high CPU utilization:
Azure Metrics
Azure Metrics is a good starting point to check the CPU utilization for a specific period. Metrics provide information about the resources utilized during the period in which you're monitoring. You can use the Apply splitting option and Split by Server Name to view the details of each individual node in your elastic cluster. You can then compare the performance of Write IOPs, Read IOPs, Read Throughput Bytes/Sec, and Write Throughput Bytes/Sec with CPU percent, to view the performance of individual nodes when you observe your workload consuming high CPU.
After you identify a particular node (or nodes) with higher than expected CPU utilization, you can connect directly to one or more nodes in question and perform a more in-depth analysis by using the following Postgres tools:
pg_stat_statements
The pg_stat_statements extension helps identify queries that consume time on the server. For more information about this extension, see the detailed documentation.
Calls/Mean and total execution time
The following query returns the top five SQL statements by highest total execution time:
SELECT userid::regrole, dbid, query, total_exec_time, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY total_exec_time
DESC LIMIT 5;
pg_stat_activity
The pg_stat_activity view shows the queries that are currently running on the specific node. Use it to monitor active queries, sessions, and states on that node.
SELECT *, now() - xact_start AS duration
FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'active') AND pid <> pg_backend_pid()
ORDER BY duration DESC;
citus_stat_activity
The citus_stat_activity view is a superset of pg_stat_activity. It shows the distributed queries that are running on all nodes. It also shows tasks specific to subqueries dispatched to workers, task state, and worker nodes.
SELECT *, now() - xact_start AS duration
FROM citus_stat_activity
WHERE state IN ('idle in transaction', 'active') AND pid <> pg_backend_pid()
ORDER BY duration DESC;
Identify root causes
If CPU consumption levels are high, the following scenarios might be the root causes:
Long-running transactions on specific node
Long-running transactions consume CPU resources and lead to high CPU utilization.
The following query provides information on long-running transactions:
SELECT
pid,
datname,
usename,
application_name,
client_addr,
backend_start,
query_start,
now() - query_start AS duration,
state,
wait_event,
wait_event_type,
query
FROM pg_stat_activity
WHERE state != 'idle' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
ORDER BY now() - query_start DESC;
Long-running transactions on all nodes
Long-running transactions consume CPU resources and lead to high CPU utilization.
The following query provides information on long-running transactions across all nodes:
SELECT
global_pid, pid,
nodeid,
datname,
usename,
application_name,
client_addr,
backend_start,
query_start,
now() - query_start AS duration,
state,
wait_event,
wait_event_type,
query
FROM citus_stat_activity
WHERE state != 'idle' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
ORDER BY now() - query_start DESC;
Slow query
Slow queries consume CPU resources and cause high CPU utilization.
The following query helps you identify queries that take longer run times:
SELECT
query,
calls,
mean_exec_time,
total_exec_time,
rows,
shared_blks_hit,
shared_blks_read,
shared_blks_dirtied,
shared_blks_written,
temp_blks_read,
temp_blks_written,
wal_records,
wal_fpi,
wal_bytes
FROM pg_stat_statements
WHERE query ILIKE '%select%' OR query ILIKE '%insert%' OR query ILIKE '%update%' OR query ILIKE '%delete%' OR queryid = <queryid>
ORDER BY total_exec_time DESC;
Total number of connections and number of connections by state on a node
Many connections to the database lead to increased CPU utilization.
The following query provides information about the number of connections by state on a single node:
SELECT state, COUNT(*)
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
GROUP BY state
ORDER BY state ASC;
Total number of connections and number of connections by state on all nodes
Many connections to the database lead to increased CPU utilization.
The following query gives information about the number of connections by state across all nodes:
SELECT state, COUNT(*)
FROM citus_stat_activity
WHERE pid <> pg_backend_pid()
GROUP BY state
ORDER BY state ASC;
Vacuum and table stats
Keeping table statistics up to date helps improve query performance. Monitor whether regular autovacuuming is happening.
The following query helps you identify the tables that need vacuuming:
SELECT *
FROM run_command_on_all_nodes($$
SELECT json_agg(t)
FROM (
SELECT schemaname, relname
,n_live_tup, n_dead_tup
,n_dead_tup / (n_live_tup) AS bloat
,last_autovacuum, last_autoanalyze
,last_vacuum, last_analyze
FROM pg_stat_user_tables
WHERE n_live_tup > 0 AND relname LIKE '%orders%'
ORDER BY n_dead_tup DESC
) t
$$);
The following image highlights the output from the preceding query. The result column is a JSON data type containing information on the stats.
The last_autovacuum and last_autoanalyze columns provide the date and time when the table was last autovacuumed or analyzed. If the tables aren't autovacuumed regularly, take steps to tune autovacuum.
The following query provides information about the amount of bloat at the schema level:
SELECT *
FROM run_command_on_all_nodes($$
SELECT json_agg(t) FROM (
SELECT schemaname, sum(n_live_tup) AS live_tuples
, sum(n_dead_tup) AS dead_tuples
FROM pg_stat_user_tables
WHERE n_live_tup > 0
GROUP BY schemaname
ORDER BY sum(n_dead_tup) DESC
) t
$$);
Resolve high CPU utilization
Use EXPLAIN ANALYZE to examine any slow queries and terminate any improperly long running transactions. Consider using the built-in PgBouncer connection pooler and clear up excessive bloat to resolve high CPU utilization.
Use EXPLAIN ANALYZE
After you identify the queries that consume more CPUs, use EXPLAIN ANALYZE to further investigate and tune them.
For more information about the EXPLAIN ANALYZE command, see its documentation.
Terminate long running transactions on a node
Consider terminating a long running transaction if the transaction runs longer than expected.
To terminate a session's PID, first find the PID by using the following query:
SELECT
pid,
datname,
usename,
application_name,
client_addr,
backend_start,
query_start,
now() - query_start AS duration,
state,
wait_event,
wait_event_type,
query
FROM pg_stat_activity WHERE state != 'idle' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
ORDER BY now() - query_start DESC;
You can also filter by other properties like usename (user name), datname (database name), and more.
After you get the session's PID, terminate it by using the following query:
SELECT pg_terminate_backend(pid);
Terminating the PID ends the specific sessions related to a node.
Terminate long running transactions on all nodes
Consider ending a long running transaction.
To terminate a session's PID, find its PID and global_pid by using the following query:
SELECT
global_pid,
pid,
nodeid,
datname,
usename,
application_name,
client_addr,
backend_start,
query_start,
now() - query_start AS duration,
state,
wait_event,
wait_event_type,
query
FROM citus_stat_activity WHERE state != 'idle' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
ORDER BY now() - query_start DESC;
You can also filter by other properties like usename (user name), datname (database name), and more.
After you get the session's PID, terminate it by using the following query:
SELECT pg_terminate_backend(pid);
Terminating the pid ends the specific sessions related to a worker node.
The same query running on different worker nodes might have same global_pid's. In that case, you can end long running transaction on all worker nodes use global_pid.
The following screenshot shows the relativity of the global_pid's to session pid's.
SELECT pg_terminate_backend(global_pid);
Note
To terminate long running transactions, set server parameters statement_timeout or idle_in_transaction_session_timeout.
Clearing bloat
A short-term solution is to manually vacuum and then analyze the tables where slow queries appear:
VACUUM ANALYZE <table>;
Managing connections
If your application uses many short-lived connections or many connections that stay idle for most of their life, consider using a connection pooler like PgBouncer.
PgBouncer, a built-in connection pooler
For more information about PgBouncer, see connection pooler and connection handling best practices with PostgreSQL.
Azure Database for PostgreSQL Elastic Clusters offer PgBouncer as a built-in connection pooling solution. For more information, see PgBouncer.