Azure Synapse Analytics 中的专用 SQL 池(以前称为 SQL DW)体系结构Dedicated SQL pool (formerly SQL DW) architecture in Azure Synapse Analytics

Azure Synapse Analytics 是一种分析服务,它将企业数据仓库和大数据分析结合在一起。Azure Synapse Analytics is an analytics service that brings together enterprise data warehousing and Big Data analytics. 它使你可以自由地根据你的条件查询数据。It gives you the freedom to query data on your terms.

Synapse SQL 体系结构组件Synapse SQL architecture components

专用 SQL 池(以前称为 SQL DW)利用横向扩展体系结构将数据的计算处理分布在多个节点上。Dedicated SQL pool (formerly SQL DW) leverages a scale-out architecture to distribute computational processing of data across multiple nodes. 缩放单位是计算能力(称为数据仓库单位)的抽象概念。The unit of scale is an abstraction of compute power that is known as a data warehouse unit. 计算与存储分离开来,以便用户能够独立于系统中的数据进行缩放计算。Compute is separate from storage, which enables you to scale compute independently of the data in your system.

专用 SQL 池(以前称为 SQL DW)体系结构

专用 SQL 池(以前称为 SQL DW)使用基于节点的体系结构。Dedicated SQL pool (formerly SQL DW) uses a node-based architecture. 应用程序连接到控制节点并将 T-SQL 命令发送到控制节点。Applications connect and issue T-SQL commands to a Control node. 控制节点托管分布式查询引擎(用于优化并行处理查询),然后将操作传递给计算节点以完成并行工作。The Control node hosts the distributed query engine, which optimizes queries for parallel processing, and then passes operations to Compute nodes to do their work in parallel.

计算节点将所有用户数据存储在 Azure 存储中并运行并行查询。The Compute nodes store all user data in Azure Storage and run the parallel queries. 数据移动服务 (DMS) 是一项系统级内部服务,它根据需要在节点间移动数据以并行运行查询和返回准确的结果。The Data Movement Service (DMS) is a system-level internal service that moves data across the nodes as necessary to run queries in parallel and return accurate results.

使用分离的存储和计算,用户可以在使用专用 SQL 池(以前称为 SQL DW)时执行以下操作:With decoupled storage and compute, when using a dedicated SQL pool (formerly SQL DW) one can:

  • 无论存储需求如何,都可独立计算大小。Independently size compute power irrespective of your storage needs.
  • 在专用 SQL 池(以前称为 SQL DW)中增加或减少计算能力,无需移动数据。Grow or shrink compute power, within a dedicated SQL pool (formerly SQL DW), without moving data.
  • 在保持数据不受影响的情况下暂停计算容量,因此只需为存储付费。Pause compute capacity while leaving data intact, so you only pay for storage.
  • 在操作期间恢复计算容量。Resume compute capacity during operational hours.

Azure 存储Azure Storage

专用 SQL 池 SQL(以前称为 SQL DW)利用 Azure 存储来保持用户数据安全。Dedicated SQL pool SQL (formerly SQL DW) leverages Azure Storage to keep your user data safe. 由于数据通过 Azure 存储进行存储和管理,因此会对存储消耗单独收费。Since your data is stored and managed by Azure Storage, there is a separate charge for your storage consumption. 将数据分片到“分布区”中来优化系统性能。The data is sharded into distributions to optimize the performance of the system. 可选择在定义表时用于分布数据的分片模式。You can choose which sharding pattern to use to distribute the data when you define the table. 支持以下分片模式:These sharding patterns are supported:

  • 哈希Hash
  • 轮循机制Round Robin
  • 复制Replicate

控制节点Control node

控制节点是体系结构的核心。The Control node is the brain of the architecture. 它是与所有应用程序和连接进行交互的前端。It is the front end that interacts with all applications and connections. 分布式查询引擎在控制节点上运行,可优化和协调并行查询。The distributed query engine runs on the Control node to optimize and coordinate parallel queries. 提交 T-SQL 查询时,控制节点会将其转换为可针对每个分布区并行运行的查询。When you submit a T-SQL query, the Control node transforms it into queries that run against each distribution in parallel.

计算节点Compute nodes

计算节点提供计算能力。The Compute nodes provide the computational power. 分布区映射到计算节点以进行处理。Distributions map to Compute nodes for processing. 当你为更多计算资源付费时,分布区将重新映射到可用的计算节点。As you pay for more compute resources, distributions are remapped to available Compute nodes. 计算节点数的范围是 1 到 60,它由 Synapse SQL 的服务级别确定。The number of compute nodes ranges from 1 to 60, and is determined by the service level for Synapse SQL.

每个计算节点均有一个节点 ID,该 ID 会显示在系统视图中。Each Compute node has a node ID that is visible in system views. 在名称以 sys.pdw_nodes 开头的系统视图中找到 node_id 列即可查看计算节点 ID。You can see the Compute node ID by looking for the node_id column in system views whose names begin with sys.pdw_nodes. 有关这些系统视图的列表,请参阅 Synapse SQL 系统视图For a list of these system views, see Synapse SQL system views.

数据移动服务Data Movement Service

数据移动服务 (DMS) 是一项数据传输技术,它可协调计算节点间的数据移动。Data Movement Service (DMS) is the data transport technology that coordinates data movement between the Compute nodes. 某些查询需要移动数据以确保并行查询返回准确的结果。Some queries require data movement to ensure the parallel queries return accurate results. 需要移动数据时,DMS 可确保正确的数据到达正确的位置。When data movement is required, DMS ensures the right data gets to the right location.

分发Distributions

分布区是存储和处理针对分布式数据运行的并行查询的基本单位。A distribution is the basic unit of storage and processing for parallel queries that run on distributed data. Synapse SQL 运行查询时,工作会被分割成 60 个并行运行的小型查询。When Synapse SQL runs a query, the work is divided into 60 smaller queries that run in parallel.

每个小型查询各在一个数据分布区上运行。Each of the 60 smaller queries runs on one of the data distributions. 每个计算节点管理其中一个或多个分布区。Each Compute node manages one or more of the 60 distributions. 具有最多计算资源的专用 SQL 池(以前称为 SQL DW)的每个分布区占 1 个计算节点。A dedicated SQL pool (formerly SQL DW) with maximum compute resources has one distribution per Compute node. 具有最小计算资源的专用 SQL 池(以前称为 SQL DW)的所有分布区都在 1 个计算节点上。A dedicated SQL pool (formerly SQL DW) with minimum compute resources has all the distributions on one compute node.

哈希分布表Hash-distributed tables

哈希分布表可为大型表上的联接和聚合提供最高查询性能。A hash distributed table can deliver the highest query performance for joins and aggregations on large tables.

为了将数据分片到哈希分布式表中,使用哈希函数明确将 1 个行分配到 1 个分布区。To shard data into a hash-distributed table, a hash function is used to deterministically assign each row to one distribution. 在表定义中,可以将一个列指定为分布列。In the table definition, one of the columns is designated as the distribution column. 哈希函数使用分布列中的值将 1 个行分配到 1 个分布区。The hash function uses the values in the distribution column to assign each row to a distribution.

下图说明了如何将完整的非分布式表存储为哈希分布表。The following diagram illustrates how a full (non-distributed table) gets stored as a hash-distributed table.

分布式表Distributed table

  • 一个行属于一个分布区。Each row belongs to one distribution.
  • 通过确定性哈希算法将一个行分配到一个分布区。A deterministic hash algorithm assigns each row to one distribution.
  • 不同大小的表显示,每个分布区的表行的数目各不相同。The number of table rows per distribution varies as shown by the different sizes of tables.

选择分布列时需考虑到性能,例如特异性、数据倾斜,以及在系统上运行的查询类型。There are performance considerations for the selection of a distribution column, such as distinctness, data skew, and the types of queries that run on the system.

轮循分布表Round-robin distributed tables

轮循机制表是最简单的表,在被用作负载临时表时,它可创造和提供高速性能。A round-robin table is the simplest table to create and delivers fast performance when used as a staging table for loads.

轮循机制分布表在表中均匀分布数据,但不会进行进一步优化。A round-robin distributed table distributes data evenly across the table but without any further optimization. 首先随机选择一个分布区,然后将行的缓冲区按顺序分配给分布区。A distribution is first chosen at random and then buffers of rows are assigned to distributions sequentially. 将数据加载到轮循机制表速度很快,但就查询性能而言,哈希分布式表的性能更佳。It is quick to load data into a round-robin table, but query performance can often be better with hash distributed tables. 轮循机制表上的联接要求重新安排数据,这需要花费更多时间。Joins on round-robin tables require reshuffling data, which takes additional time.

复制表Replicated Tables

复制表为小型表提供最快查询性能。A replicated table provides the fastest query performance for small tables.

复制表在每个计算节点上缓存表的完整副本。A table that is replicated caches a full copy of the table on each compute node. 因此复制表以后,无需在执行联接或聚合前在计算节点中间传输数据。Consequently, replicating a table removes the need to transfer data among compute nodes before a join or aggregation. 复制表尤为适用于小型表。Replicated tables are best utilized with small tables. 它需要额外存储并且在写入数据时会产生额外开销,因此不适用于大型表。Extra storage is required and there is additional overhead that is incurred when writing data, which make large tables impractical.

下图显示会在每个计算节点的第一个分布区上缓存复制表。The diagram below shows a replicated table that is cached on the first distribution on each compute node.

复制表Replicated table

后续步骤Next steps

对 Azure Synapse 有了初步的认识后,请学习如何快速创建专用 SQL 池(以前称为 SQL DW)加载示例数据Now that you know a bit about Azure Synapse, learn how to quickly create a dedicated SQL pool (formerly SQL DW) and load sample data. 如果不熟悉 Azure,遇到新术语时,Azure 词汇表 可以提供帮助。If you are new to Azure, you may find the Azure glossary helpful as you encounter new terminology. 或者,查看以下一些其他 Azure Synapse 资源。Or look at some of these other Azure Synapse Resources.