Azure Synapse Analytics(以前称为 SQL DW)的速查表Cheat sheet for Azure Synapse Analytics (formerly SQL DW)

此速查表提供有关生成 Azure Synapse 解决方案的有用提示和最佳做法。This cheat sheet provides helpful tips and best practices for building Azure Synapse solutions.

下图显示了设计数据仓库的过程:The following graphic shows the process of designing a data warehouse:

草图

跨表的查询和操作Queries and operations across tables

如果事先已知道要在数据仓库中运行的主要操作和查询,则可以针对这些操作指定数据仓库体系结构的优先级。When you know in advance the primary operations and queries to be run in your data warehouse, you can prioritize your data warehouse architecture for those operations. 这些查询和操作可能包括:These queries and operations might include:

  • 将一个或两个事实数据表与维度表相联接,筛选合并的表,然后将结果追加到数据市场。Joining one or two fact tables with dimension tables, filtering the combined table, and then appending the results into a data mart.
  • 对事实销售进行较大或较小的更新。Making large or small updates into your fact sales.
  • 仅将数据追加到表。Appending only data to your tables.

提前知道操作类型有助于优化表设计。Knowing the types of operations in advance helps you optimize the design of your tables.

数据迁移Data migration

首先,将数据加载到 Azure Blob 存储中。First, load your data into Azure Blob Storage. 接下来,使用 PolyBase 将数据载入临时表中。Next, use PolyBase to load your data into staging tables. 使用以下配置:Use the following configuration:

设计Design 建议Recommendation
分发Distribution 轮循机制Round Robin
索引Indexing Heap
分区Partitioning None
资源类Resource Class largerc 或 xlargerclargerc or xlargerc

详细了解数据迁移数据加载以及提取、加载和转换 (ELT) 过程Learn more about data migration, data loading, and the Extract, Load, and Transform (ELT) process.

分布式表或复制表Distributed or replicated tables

根据表属性使用以下策略:Use the following strategies, depending on the table properties:

类型Type 非常适合...Great fit for... 谨慎使用Watch out if...
复制Replicated * 星型架构中压缩后(约 5 倍压缩率)存储小于 2 GB 的小型维度表* Small dimension tables in a star schema with less than 2 GB of storage after compression (~5x compression) * 表上有大量写入事务(例如:插入、更新插入、删除、更新)* Many write transactions are on table (such as insert, upsert, delete, update)

* 经常更改数据仓库单位 (DWU) 预配* You change Data Warehouse Units (DWU) provisioning frequently

* 仅使用 2-3 列,而表有许多列* You only use 2-3 columns but your table has many columns

* 对复制表编制索引* You index a replicated table
轮循机制(默认)Round Robin (default) * 临时表* Temporary/staging table

* 没有明显的联接键或合适的候选列* No obvious joining key or good candidate column
* 因数据移动而性能下降* Performance is slow due to data movement
哈希Hash * 事实数据表* Fact tables

* 大型维度表* Large dimension tables
* 不能更新分布键* The distribution key cannot be updated

提示:Tips:

  • 先使用轮循机制,但以哈希分布策略为最终目标,以便充分利用大规模并行体系结构。Start with Round Robin, but aspire to a hash distribution strategy to take advantage of a massively parallel architecture.
  • 确保常见的哈希键具有相同的数据格式。Make sure that common hash keys have the same data format.
  • 请勿以 varchar 格式进行分发。Don't distribute on varchar format.
  • 可以将具有常见哈希键的维度表哈希分布到具有频繁联接操作的事实数据表。Dimension tables with a common hash key to a fact table with frequent join operations can be hash distributed.
  • 使用 sys.dm_pdw_nodes_db_partition_stats 分析数据中的任何偏斜。Use sys.dm_pdw_nodes_db_partition_stats to analyze any skewness in the data.
  • 使用 sys.dm_pdw_request_steps 分析查询背后的数据移动、监视时间广播以及随机选择操作需要。Use sys.dm_pdw_request_steps to analyze data movements behind queries, monitor the time broadcast, and shuffle operations take. 这有助于查看分布策略。This is helpful to review your distribution strategy.

详细了解复制表分布式表Learn more about replicated tables and distributed tables.

为表编制索引Index your table

索引有助于加速读取表。Indexing is helpful for reading tables quickly. 下面列出了一组独有的技术,可根据需要使用:There is a unique set of technologies that you can use based on your needs:

类型Type 非常适合...Great fit for... 谨慎使用Watch out if...
Heap * 临时表* Staging/temporary table

* 包含小规模查找的小型表* Small tables with small lookups
* 任何查找扫描全表* Any lookup scans the full table
聚集索引Clustered index * 最多包含 1 亿行的表* Tables with up to 100 million rows

* 仅 1-2 列被频繁使用的大型表(超过 1 亿行)* Large tables (more than 100 million rows) with only 1-2 columns heavily used
* 用于复制表* Used on a replicated table

* 涉及多个联接和 Group By 操作的复杂查询* You have complex queries involving multiple join and Group By operations

* 对索引列进行更新:它会占用内存* You make updates on the indexed columns: it takes memory
聚集列存储索引 (CCI)(默认)Clustered columnstore index (CCI) (default) * 大型表(超过 1 亿行)* Large tables (more than 100 million rows) * 用于复制表* Used on a replicated table

* 对表进行大规模更新操作* You make massive update operations on your table

* 对表过度分区:行组未横跨不同的分布节点和分区* You overpartition your table: row groups do not span across different distribution nodes and partitions

提示:Tips:

  • 除了聚集索引,可能还需要向经常用于筛选的列添加非聚集索引。On top of a clustered index, you might want to add a nonclustered index to a column heavily used for filtering.
  • 注意如何使用 CCI 管理表上的内存。Be careful how you manage the memory on a table with CCI. 加载数据时,你希望用户(或查询)受益于大型资源类。When you load data, you want the user (or the query) to benefit from a large resource class. 确保避免剪裁和创建许多经过压缩的小型行组。Make sure to avoid trimming and creating many small compressed row groups.
  • 在第 2 代上,CCI 表在计算节点上本地缓存,以最大限度地提高性能。On Gen2, CCI tables are cached locally on the compute nodes to maximize performance.
  • 对于 CCI,可能因行组压缩不当而出现性能下降的情况。For CCI, slow performance can happen due to poor compression of your row groups. 如果发生此情况,请重新生成或重新整理 CCI。If this occurs, rebuild or reorganize your CCI. 你希望每个压缩后的行组包含至少 10 万行。You want at least 100,000 rows per compressed row groups. 理想状态为一个行组 100 万行。The ideal is 1 million rows in a row group.
  • 基于增量加载频率和大小,你想自动执行索引的重新整理或重新生成操作。Based on the incremental load frequency and size, you want to automate when you reorganize or rebuild your indexes. 彻底清理操作始终有用。Spring cleaning is always helpful.
  • 想要剪裁行组时应更具战略性。Be strategic when you want to trim a row group. 打开的行组有多大?How large are the open row groups? 未来几天希望加载多少数据?How much data do you expect to load in the coming days?

详细了解索引Learn more about indexes.

分区Partitioning

如果拥有一个大型事实数据表(超过 10 亿行的表),可能需要对表分区。You might partition your table when you have a large fact table (greater than 1 billion rows). 在 99% 的情况下,分区键应基于日期。In 99 percent of cases, the partition key should be based on date. 注意不要过度分区,尤其是在具有聚集列存储索引时。Be careful to not overpartition, especially when you have a clustered columnstore index.

对于需要 ELT 的临时表,可从分区中受益。With staging tables that require ELT, you can benefit from partitioning. 它使数据生命周期管理更便捷。It facilitates data lifecycle management. 注意不要对数据过度分区,尤其是对聚集列存储索引。Be careful not to overpartition your data, especially on a clustered columnstore index.

详细了解分区Learn more about partitions.

增量加载Incremental load

若要增量加载数据,请先确保分配更大的资源类来加载数据。If you're going to incrementally load your data, first make sure that you allocate larger resource classes to loading your data. 当加载到具有聚集列存储索引的表中时,这一点尤其重要。This is particularly important when loading into tables with clustered columnstore indexes. 有关更多详细信息,请参阅资源类See resource classes for further details.

建议使用 PolyBase 和 ADF V2 在数据仓库中自动执行 ELT 管道。We recommend using PolyBase and ADF V2 for automating your ELT pipelines into your data warehouse.

对于历史数据中的大量更新,请考虑使用 CTAS 将要保留的数据写入表中,而不是使用 INSERT、UPDATE 和 DELETE。For a large batch of updates in your historical data, consider using a CTAS to write the data you want to keep in a table rather than using INSERT, UPDATE, and DELETE.

维护统计信息Maintain statistics

对数据做重大更改时务必更新统计信息。It's important to update statistics as significant changes happen to your data. 请参阅更新统计信息,确定是否发生重大更改。See update statistics to determine if significant changes have occurred. 更新的统计信息可以优化查询计划。Updated statistics optimize your query plans. 如果发现维护所有统计信息所需时间太长,请更谨慎地选择包含统计信息的列。If you find that it takes too long to maintain all of your statistics, be more selective about which columns have statistics.

还可以定义更新频率。You can also define the frequency of the updates. 例如,可能想要更新每天都要添加新值的日期列。For example, you might want to update date columns, where new values might be added, on a daily basis. 对涉及联接的列、WHERE 子句中使用的列、在 GROUP BY 中找到的列进行信息统计,可以获得最大效益。You gain the most benefit by having statistics on columns involved in joins, columns used in the WHERE clause, and columns found in GROUP BY.

详细了解统计信息Learn more about statistics.

资源类Resource class

资源组用作将内存分配给查询的一种方式。Resource groups are used as a way to allocate memory to queries. 如果需要更多内存来提高查询或加载速度,应分配更大的资源类。If you need more memory to improve query or loading speed, you should allocate higher resource classes. 但另一方面,使用更大的资源类会影响并发。On the flip side, using larger resource classes impacts concurrency. 在将所有用户移动到大型资源类之前,你希望把这点纳入考虑范围。You want to take that into consideration before moving all of your users to a large resource class.

如果发现查询所需时间过长,请确保用户未在大型资源类中运行。If you notice that queries take too long, check that your users do not run in large resource classes. 大型资源类会占用许多并发槽。Large resource classes consume many concurrency slots. 它们可能导致其他查询排队等待。They can cause other queries to queue up.

最后,通过使用第 2 代的 SQL 池,每个资源类可比第 1 代获得多 2.5 倍的内存。Finally, by using Gen2 of SQL pool, each resource class gets 2.5 times more memory than Gen1.

详细了解如何使用资源类和并发Learn more how to work with resource classes and concurrency.

降低成本Lower your cost

Azure Synapse 的一个重要功能是可以管理计算资源A key feature of Azure Synapse is the ability to manage compute resources. 你可在不使用 SQL 池时进行暂停,这会停止计算资源的计费。You can pause SQL pool when you're not using it, which stops the billing of compute resources. 可以缩放资源以满足性能需求。You can scale resources to meet your performance demands. 若要暂停,请使用 Azure 门户PowerShellTo pause, use the Azure portal or PowerShell. 若要缩放,请使用 Azure 门户PowershellT-SQLREST APITo scale, use the Azure portal, Powershell, T-SQL, or a REST API.

优化体系结构以提高性能Optimize your architecture for performance

建议考虑在中心辐射型体系结构中使用 SQL 数据库和 Azure Analysis Services。We recommend considering SQL Database and Azure Analysis Services in a hub-and-spoke architecture. 该解决方案可以在不同的用户组之间提供工作负荷隔离,同时还能使用 SQL 数据库和 Azure Analysis Services 的高级安全功能。This solution can provide workload isolation between different user groups while also using advanced security features from SQL Database and Azure Analysis Services. 这也是一种向用户提供无限并发的方式。This is also a way to provide limitless concurrency to your users.

详细了解利用 Azure Synapse 的典型体系结构Learn more about typical architectures that take advantage of Azure Synapse.