使用 DMV 监视 Azure Synapse Analytics SQL 池工作负载Monitor your Azure Synapse Analytics SQL pool workload using DMVs

本文介绍如何使用动态管理视图 (DMV) 监视工作负载,包括调查 SQL 池中的查询执行情况。This article describes how to use Dynamic Management Views (DMVs) to monitor your workload including investigating query execution in SQL pool.

权限Permissions

若要查询本文中的 DMV,需要 VIEW DATABASE STATE 或 CONTROL 权限。To query the DMVs in this article, you need either VIEW DATABASE STATE or CONTROL permission. 通常情况下,首选授予 VIEW DATABASE STATE 权限,因为该权限的限制要大得多。Usually, granting VIEW DATABASE STATE is the preferred permission as it is much more restrictive.

GRANT VIEW DATABASE STATE TO myuser;

监视连接Monitor connections

所有到数据仓库的登录都会记录到 sys.dm_pdw_exec_sessions 中。All logins to your data warehouse are logged to sys.dm_pdw_exec_sessions. 此 DMV 包含最后 10,000 个登录。This DMV contains the last 10,000 logins. session_id 是主键,每次进行新的登录时按顺序分配。The session_id is the primary key and is assigned sequentially for each new logon.

-- Other Active Connections
SELECT * FROM sys.dm_pdw_exec_sessions where status <> 'Closed' and session_id <> session_id();

监视查询执行Monitor query execution

所有在 SQL 池上执行的查询都会记录到 sys.dm_pdw_exec_requests 中。All queries executed on SQL pool are logged to sys.dm_pdw_exec_requests. 此 DMV 包含最后 10,000 个执行的查询。This DMV contains the last 10,000 queries executed. request_id 对每个查询进行唯一标识,并且为此 DMV 的主键。The request_id uniquely identifies each query and is the primary key for this DMV. request_id 在每次进行新的查询时按顺序分配,并会加上前缀 QID,代表查询 ID。The request_id is assigned sequentially for each new query and is prefixed with QID, which stands for query ID. 针对给定 session_id 查询此 DMV 将显示给定登录的所有查询。Querying this DMV for a given session_id shows all queries for a given logon.

Note

存储过程使用多个请求 ID。Stored procedures use multiple Request IDs. 按先后顺序分配请求 ID。Request IDs are assigned in sequential order.

以下是调查特定查询的查询执行计划和时间所要遵循的步骤。Here are steps to follow to investigate query execution plans and times for a particular query.

步骤 1:确定想要调查的查询STEP 1: Identify the query you wish to investigate

-- Monitor active queries
SELECT *
FROM sys.dm_pdw_exec_requests
WHERE status not in ('Completed','Failed','Cancelled')
  AND session_id <> session_id()
ORDER BY submit_time DESC;

-- Find top 10 queries longest running queries
SELECT TOP 10 *
FROM sys.dm_pdw_exec_requests
ORDER BY total_elapsed_time DESC;

从前面的查询结果中,记下想要调查的查询的 请求 IDFrom the preceding query results, note the Request ID of the query that you would like to investigate.

由于存在大量活动的运行查询,因此处于“挂起” 状态的查询可以排队。Queries in the Suspended state can be queued due to a large number of active running queries. 这些查询也出现在类型为 UserConcurrencyResourceType 的 sys.dm_pdw_waits 等待查询中。These queries also appear in the sys.dm_pdw_waits waits query with a type of UserConcurrencyResourceType. 有关并发限制的信息,请参阅内存和并发限制用于工作负载管理的资源类For information on concurrency limits, see Memory and concurrency limits or Resource classes for workload management. 查询也可能因其他原因(如对象锁定)处于等待状态。Queries can also wait for other reasons such as for object locks. 如果查询正在等待资源,请参阅本文后面的调查等待资源的查询If your query is waiting for a resource, see Investigating queries waiting for resources further down in this article.

为了简化查询在 sys.dm_pdw_exec_requests 表中的查找,请使用 LABEL 为查询分配一条注释,可以在 sys.dm_pdw_exec_requests 视图中查询到该注释。To simplify the lookup of a query in the sys.dm_pdw_exec_requests table, use LABEL to assign a comment to your query, which can be looked up in the sys.dm_pdw_exec_requests view.

-- Query with Label
SELECT *
FROM sys.tables
OPTION (LABEL = 'My Query')
;

-- Find a query with the Label 'My Query'
-- Use brackets when querying the label column, as it it a key word
SELECT  *
FROM    sys.dm_pdw_exec_requests
WHERE   [label] = 'My Query';

步骤 2:调查查询计划STEP 2: Investigate the query plan

使用请求 ID 从 sys.dm_pdw_request_steps 检索查询的分布式 SQL (DSQL) 计划Use the Request ID to retrieve the query's distributed SQL (DSQL) plan from sys.dm_pdw_request_steps

-- Find the distributed query plan steps for a specific query.
-- Replace request_id with value from Step 1.

SELECT * FROM sys.dm_pdw_request_steps
WHERE request_id = 'QID####'
ORDER BY step_index;

当 DSQL 计划的执行时间超出预期时,原因可能是计划很复杂,包含许多 DSQL 步骤,也可能是一个步骤占用很长的时间。When a DSQL plan is taking longer than expected, the cause can be a complex plan with many DSQL steps or just one step taking a long time. 如果计划有很多步骤,包含多个移动操作,可考虑优化表分布,减少数据移动。If the plan is many steps with several move operations, consider optimizing your table distributions to reduce data movement. 表分布一文说明了为何必须移动数据才能解决查询。The Table distribution article explains why data must be moved to solve a query. 该文还介绍了一些用于最大程度减少数据移动的分布策略。The article also explains some distribution strategies to minimize data movement.

若要进一步调查单个步骤的详细信息,可检查长时间运行的查询步骤的 operation_type 列并记下步骤索引To investigate further details about a single step, the operation_type column of the long-running query step and note the Step Index:

  • 针对 SQL 操作 继续执行步骤 3a:OnOperation、RemoteOperation、ReturnOperation。Proceed with Step 3a for SQL operations: OnOperation, RemoteOperation, ReturnOperation.
  • 针对数据移动操作 继续执行步骤 3b:ShuffleMoveOperation、BroadcastMoveOperation、TrimMoveOperation、PartitionMoveOperation、MoveOperation、CopyOperation。Proceed with Step 3b for Data Movement operations: ShuffleMoveOperation, BroadcastMoveOperation, TrimMoveOperation, PartitionMoveOperation, MoveOperation, CopyOperation.

步骤 3:调查分布式数据库上的 SQLSTEP 3: Investigate SQL on the distributed databases

使用请求 ID 和步骤索引从 sys.dm_pdw_sql_requests 中检索详细信息,其中包含所有分布式数据库上的查询步骤的执行信息。Use the Request ID and the Step Index to retrieve details from sys.dm_pdw_sql_requests, which contains execution information of the query step on all of the distributed databases.

-- Find the distribution run times for a SQL step.
-- Replace request_id and step_index with values from Step 1 and 3.

SELECT * FROM sys.dm_pdw_sql_requests
WHERE request_id = 'QID####' AND step_index = 2;

当查询步骤正在运行时,可以使用 DBCC PDW_SHOWEXECUTIONPLAN 从 SQL Server 计划缓存中检索 SQL Server 估计计划,了解在特定分布基础上运行的步骤。When the query step is running, DBCC PDW_SHOWEXECUTIONPLAN can be used to retrieve the SQL Server estimated plan from the SQL Server plan cache for the step running on a particular distribution.

-- Find the SQL Server execution plan for a query running on a specific SQL pool or control node.
-- Replace distribution_id and spid with values from previous query.

DBCC PDW_SHOWEXECUTIONPLAN(1, 78);

步骤 4:调查在分布式数据库上进行的数据移动STEP 4: Investigate data movement on the distributed databases

使用请求 ID 和步骤索引检索在 sys.dm_pdw_dms_workers 中的每个分布上运行的数据移动步骤的相关信息。Use the Request ID and the Step Index to retrieve information about a data movement step running on each distribution from sys.dm_pdw_dms_workers.

-- Find information about all the workers completing a Data Movement Step.
-- Replace request_id and step_index with values from Step 1 and 3.

SELECT * FROM sys.dm_pdw_dms_workers
WHERE request_id = 'QID####' AND step_index = 2;
  • 检查 total_elapsed_time 列,以查看是否有特定分布在数据移动上比其他分布花费了更多时间。Check the total_elapsed_time column to see if a particular distribution is taking significantly longer than others for data movement.
  • 对于长时间运行的分布,请检查 rows_processed 列,以查看从该分布移动的行数是否远远多于其他分布。For the long-running distribution, check the rows_processed column to see if the number of rows being moved from that distribution is significantly larger than others. 如果是这样,此发现可能指示基础数据倾斜。If so, this finding might indicate skew of your underlying data.

如果查询正在运行,则可以使用 DBCC PDW_SHOWEXECUTIONPLAN 从 SQL Server 计划缓存中检索 SQL Server 估计计划,以了解特定分布中当前正在运行的 SQL 步骤。If the query is running, you can use DBCC PDW_SHOWEXECUTIONPLAN to retrieve the SQL Server estimated plan from the SQL Server plan cache for the currently running SQL Step within a particular distribution.

-- Find the SQL Server estimated plan for a query running on a specific SQL pool Compute or control node.
-- Replace distribution_id and spid with values from previous query.

DBCC PDW_SHOWEXECUTIONPLAN(55, 238);

监视正在等待的查询Monitor waiting queries

如果查询未取得进展(因其正在等待资源),下面是显示查询正在等待的所有资源的查询。If you discover that your query is not making progress because it is waiting for a resource, here is a query that shows all the resources a query is waiting for.

-- Find queries
-- Replace request_id with value from Step 1.

SELECT waits.session_id,
      waits.request_id,  
      requests.command,
      requests.status,
      requests.start_time,  
      waits.type,
      waits.state,
      waits.object_type,
      waits.object_name
FROM   sys.dm_pdw_waits waits
   JOIN  sys.dm_pdw_exec_requests requests
   ON waits.request_id=requests.request_id
WHERE waits.request_id = 'QID####'
ORDER BY waits.object_name, waits.object_type, waits.state;

如果查询正在主动等待另一个查询中的资源,则状态将为 AcquireResourcesIf the query is actively waiting on resources from another query, then the state will be AcquireResources. 如果查询具有全部所需资源,则状态为 GrantedIf the query has all the required resources, then the state will be Granted.

监视 tempdbMonitor tempdb

tempdb 用于在查询执行期间保存临时结果。Tempdb is used to hold intermediate results during query execution. 重度使用 tempdb 数据库可能会导致查询性能变慢。High utilization of the tempdb database can lead to slow query performance. 对于配置的每个 DW100c,将分配 399 GB 的 tempdb 空间(DW1000c 的总 tempdb 空间为 3.99 TB)。For every DW100c configured, 399 GB of tempdb space is allocated (DW1000c would have 3.99 TB of total tempdb space). 下面是有关监视 tempdb 用量以及在查询中减少 tempdb 用量的提示。Below are tips for monitoring tempdb usage and for decreasing tempdb usage in your queries.

使用视图监视 tempdbMonitoring tempdb with views

若要监视 tempdb 用量,请先从适用于 SQL 池的 Microsoft 工具包安装 microsoft.vw_sql_requests 视图。To monitor tempdb usage, first install the microsoft.vw_sql_requests view from the Microsoft Toolkit for SQL pool. 然后可执行以下查询,以查看在每个节点中执行的所有查询所消耗的 tempdb 用量:You can then execute the following query to see the tempdb usage per node for all executed queries:

-- Monitor tempdb
SELECT
    sr.request_id,
    ssu.session_id,
    ssu.pdw_node_id,
    sr.command,
    sr.total_elapsed_time,
    es.login_name AS 'LoginName',
    DB_NAME(ssu.database_id) AS 'DatabaseName',
    (es.memory_usage * 8) AS 'MemoryUsage (in KB)',
    (ssu.user_objects_alloc_page_count * 8) AS 'Space Allocated For User Objects (in KB)',
    (ssu.user_objects_dealloc_page_count * 8) AS 'Space Deallocated For User Objects (in KB)',
    (ssu.internal_objects_alloc_page_count * 8) AS 'Space Allocated For Internal Objects (in KB)',
    (ssu.internal_objects_dealloc_page_count * 8) AS 'Space Deallocated For Internal Objects (in KB)',
    CASE es.is_user_process
    WHEN 1 THEN 'User Session'
    WHEN 0 THEN 'System Session'
    END AS 'SessionType',
    es.row_count AS 'RowCount'
FROM sys.dm_pdw_nodes_db_session_space_usage AS ssu
    INNER JOIN sys.dm_pdw_nodes_exec_sessions AS es ON ssu.session_id = es.session_id AND ssu.pdw_node_id = es.pdw_node_id
    INNER JOIN sys.dm_pdw_nodes_exec_connections AS er ON ssu.session_id = er.session_id AND ssu.pdw_node_id = er.pdw_node_id
    INNER JOIN microsoft.vw_sql_requests AS sr ON ssu.session_id = sr.spid AND ssu.pdw_node_id = sr.pdw_node_id
WHERE DB_NAME(ssu.database_id) = 'tempdb'
    AND es.session_id <> @@SPID
    AND es.login_name <> 'sa'
ORDER BY sr.request_id;

如果某个查询消耗了大量内存或收到了与 tempdb 分配相关的错误消息,原因可能是运行的极大规模 CREATE TABLE AS SELECT (CTAS)INSERT SELECT 语句在最终的数据移动操作期间失败。If you have a query that is consuming a large amount of memory or have received an error message related to allocation of tempdb, it could be due to a very large CREATE TABLE AS SELECT (CTAS) or INSERT SELECT statement running that is failing in the final data movement operation. 在分布式查询计划中,紧靠在最后一条 INSERT SELECT 前面的语句会将此操作识别为 ShuffleMove 操作。This can usually be identified as a ShuffleMove operation in the distributed query plan right before the final INSERT SELECT. 使用 sys.dm_pdw_request_steps 监视 ShuffleMove 操作。Use sys.dm_pdw_request_steps to monitor ShuffleMove operations.

最常见的缓解措施是将 CTAS 或 INSERT SELECT 语句分解为多个加载语句,使数据量不会超过每个节点不超过 1 TB tempdb 空间的限制。The most common mitigation is to break your CTAS or INSERT SELECT statement into multiple load statements so the data volume will not exceed the 1TB per node tempdb limit. 也可以将群集扩展到更大的大小,以便将 tempdb 大小分散到更多的节点,从而减少每个节点的 tempdb 空间量。You can also scale your cluster to a larger size which will spread the tempdb size across more nodes reducing the tempdb on each individual node.

除了 CTAS 和 INSERT SELECT 语句外,在内存不足的情况下运行的大型复杂查询可能会溢出到 tempdb 中,导致查询失败。In addition to CTAS and INSERT SELECT statements, large, complex queries running with insufficient memory can spill into tempdb causing queries to fail. 请考虑使用更大的资源类运行,以避免溢出到 tempdb。Consider running with a larger resource class to avoid spilling into tempdb.

监视内存Monitor memory

内存可能是性能缓慢和内存不足问题的根本原因。Memory can be the root cause for slow performance and out of memory issues. 如果发现 SQL Server 内存用量在执行查询的过程中达到其限制,请考虑扩展数据仓库。Consider scaling your data warehouse if you find SQL Server memory usage reaching its limits during query execution.

以下查询返回每个节点的 SQL Server 内存用量和内存压力:The following query returns SQL Server memory usage and memory pressure per node:

-- Memory consumption
SELECT
  pc1.cntr_value as Curr_Mem_KB,
  pc1.cntr_value/1024.0 as Curr_Mem_MB,
  (pc1.cntr_value/1048576.0) as Curr_Mem_GB,
  pc2.cntr_value as Max_Mem_KB,
  pc2.cntr_value/1024.0 as Max_Mem_MB,
  (pc2.cntr_value/1048576.0) as Max_Mem_GB,
  pc1.cntr_value * 100.0/pc2.cntr_value AS Memory_Utilization_Percentage,
  pc1.pdw_node_id
FROM
-- pc1: current memory
sys.dm_pdw_nodes_os_performance_counters AS pc1
-- pc2: total memory allowed for this SQL instance
JOIN sys.dm_pdw_nodes_os_performance_counters AS pc2
ON pc1.object_name = pc2.object_name AND pc1.pdw_node_id = pc2.pdw_node_id
WHERE
pc1.counter_name = 'Total Server Memory (KB)'
AND pc2.counter_name = 'Target Server Memory (KB)'

监视事务日志大小Monitor transaction log size

以下查询返回每个分布区的事务日志大小。The following query returns the transaction log size on each distribution. 如果其中一个日志文件将达到 160 GB,则应考虑纵向扩展实例或限制事务大小。If one of the log files is reaching 160 GB, you should consider scaling up your instance or limiting your transaction size.

-- Transaction log size
SELECT
  instance_name as distribution_db,
  cntr_value*1.0/1048576 as log_file_size_used_GB,
  pdw_node_id
FROM sys.dm_pdw_nodes_os_performance_counters
WHERE
instance_name like 'Distribution_%'
AND counter_name = 'Log File(s) Used Size (KB)'

监视事务日志回滚Monitor transaction log rollback

如果查询失败或需要花费很长时间才能继续,则你可以检查并监视是否发生了任何事务回滚。If your queries are failing or taking a long time to proceed, you can check and monitor if you have any transactions rolling back.

-- Monitor rollback
SELECT
    SUM(CASE WHEN t.database_transaction_next_undo_lsn IS NOT NULL THEN 1 ELSE 0 END),
    t.pdw_node_id,
    nod.[type]
FROM sys.dm_pdw_nodes_tran_database_transactions t
JOIN sys.dm_pdw_nodes nod ON t.pdw_node_id = nod.pdw_node_id
GROUP BY t.pdw_node_id, nod.[type]

监视 PolyBase 负载Monitor PolyBase load

以下查询提供了负载进度的大致估计。The following query provides an approximate estimate of the progress of your load. 查询仅显示当前正在处理的文件。The query only shows files currently being processed.


-- To track bytes and files
SELECT
    r.command,
    s.request_id,
    r.status,
    count(distinct input_name) as nbr_files,
    sum(s.bytes_processed)/1024/1024/1024 as gb_processed
FROM
    sys.dm_pdw_exec_requests r
    inner join sys.dm_pdw_dms_external_work s
        on r.request_id = s.request_id
GROUP BY
    r.command,
    s.request_id,
    r.status
ORDER BY
    nbr_files desc,
    gb_processed desc;

后续步骤Next steps

有关 DMV 的详细信息,请参阅系统视图For more information about DMVs, see System views.