Azure SQL 数据仓库 - 大规模并行处理 (MPP) 体系结构Azure SQL Data Warehouse - Massively parallel processing (MPP) architecture

了解 Azure SQL 数据仓库如何将大规模并行处理 (MPP) 与 Azure 存储结合,实现高性能和可伸缩性。Learn how Azure SQL Data Warehouse combines massively parallel processing (MPP) with Azure storage to achieve high performance and scalability.

MPP 体系结构组件MPP architecture components

SQL 数据仓库利用向外扩展体系结构在多个节点间分布数据的计算处理。SQL Data Warehouse leverages a scale out architecture to distribute computational processing of data across multiple nodes. 缩放单位是计算能力(称为数据仓库单位)的抽象概念。The unit of scale is an abstraction of compute power that is known as a data warehouse unit. SQL 数据仓库将计算与存储分离开来,使用户能够独立于系统中的数据进行规模计算。SQL Data Warehouse separates compute from storage which enables you to scale compute independently of the data in your system.

SQL 数据仓库体系结构

SQL 数据仓库使用基于节点的体系结构。SQL Data Warehouse uses a node-based architecture. 应用程序将 T-SQL 命令连接到、发布给控制节点,该节点是数据仓库的单一入口点。Applications connect and issue T-SQL commands to a Control node, which is the single point of entry for the data warehouse. 控制节点运行用于优化并行处理查询的 MPP 引擎,然后将操作传递给计算节点以实现并行工作。The Control node runs the MPP engine which optimizes queries for parallel processing, and then passes operations to Compute nodes to do their work in parallel. 计算节点将所有用户数据存储在 Azure 存储中并运行并行查询。The Compute nodes store all user data in Azure Storage and run the parallel queries. 数据移动服务 (DMS) 是一项系统级内部服务,它根据需要在节点间移动数据以并行运行查询和返回准确的结果。The Data Movement Service (DMS) is a system-level internal service that moves data across the nodes as necessary to run queries in parallel and return accurate results.

使用脱偶联的存储和计算,SQL 数据仓库可以:With decoupled storage and compute, SQL Data Warehouse can:

  • 无论存储需求如何,都可独立计算大小。Independently size compute power irrespective of your storage needs.
  • 无需移动数据,即可增加或减少计算能力。Grow or shrink compute power without moving data.
  • 在保持数据不受影响的情况下暂停计算容量,因此只需为存储付费。Pause compute capacity while leaving data intact, so you only pay for storage.
  • 在操作期间恢复计算容量。Resume compute capacity during operational hours.

Azure 存储Azure storage

SQL 数据仓库使用 Azure 存储保护用户数据。SQL Data Warehouse uses Azure storage to keep your user data safe. 由于数据通过 Azure 存储进行存储和管理,因此 SQL 数据仓库对存储消耗单独收费。Since your data is stored and managed by Azure storage, SQL Data Warehouse charges separately for your storage consumption. 将数据本身分片到“分布区”中来优化系统性能 。The data itself is sharded into distributions to optimize the performance of the system. 可选择在定义表时用于分布数据的分片模式。You can choose which sharding pattern to use to distribute the data when you define the table. SQL 数据仓库支持以下分片模式:SQL Data Warehouse supports these sharding patterns:

  • 哈希Hash
  • 轮循机制Round Robin
  • 复制Replicate

控制节点Control node

控制节点是数据仓库的核心。The Control node is the brain of the data warehouse. 它是与所有应用程序和连接进行交互的前端。It is the front end that interacts with all applications and connections. MPP 引擎在控制节点上运行以优化和协调并行查询。The MPP engine runs on the Control node to optimize and coordinate parallel queries. 将 T-SQL 查询提交到 SQL 数据仓库时,控制节点会将其转换为可针对每个分布区并行运行的查询。When you submit a T-SQL query to SQL Data Warehouse, the Control node transforms it into queries that run against each distribution in parallel.

计算节点Compute nodes

计算节点提供计算能力。The Compute nodes provide the computational power. 分布区映射到计算节点以进行处理。Distributions map to Compute nodes for processing. 如果支付更多计算资源费用,SQL 数据仓库会将分布区重新映射到可用的计算节点。As you pay for more compute resources, SQL Data Warehouse re-maps the distributions to the available Compute nodes. 计算节点数的范围是 1 到 60,它由数据仓库的服务级别确定。The number of compute nodes ranges from 1 to 60, and is determined by the service level for the data warehouse.

每个计算节点均有一个节点 ID,该 ID 会显示在系统视图中。Each Compute node has a node ID that is visible in system views. 在名称以 sys.pdw_nodes 开头的系统视图中找到 node_id 列即可查看计算节点 ID。You can see the Compute node ID by looking for the node_id column in system views whose names begin with sys.pdw_nodes. 有关这些系统视图的列表,请参阅 MPP 系统视图For a list of these system views, see MPP system views.

数据移动服务Data Movement Service

数据移动服务 (DMS) 是一项数据传输技术,它可协调计算节点间的数据移动。Data Movement Service (DMS) is the data transport technology that coordinates data movement between the Compute nodes. 某些查询需要移动数据以确保并行查询返回准确的结果。Some queries require data movement to ensure the parallel queries return accurate results. 需要移动数据时,DMS 可确保正确的数据到达正确的位置。When data movement is required, DMS ensures the right data gets to the right location.


分布区是存储和处理针对分布式数据运行的并行查询的基本单位。A distribution is the basic unit of storage and processing for parallel queries that run on distributed data. SQL 数据仓库运行查询时,工作会被分割成 60 个并行运行的小型查询。When SQL Data Warehouse runs a query, the work is divided into 60 smaller queries that run in parallel. 每个小型查询各在一个数据分布区上运行。Each of the 60 smaller queries runs on one of the data distributions. 每个计算节点管理其中一个或多个分布区。Each Compute node manages one or more of the 60 distributions. 具有最多计算资源的数据仓库的每个分布区占 1 个计算节点。A data warehouse with maximum compute resources has one distribution per Compute node. 具有最少计算资源的数据仓库的所有分布区都在 1 个计算节点上。A data warehouse with minimum compute resources has all the distributions on one compute node.

哈希分布表Hash-distributed tables

哈希分布表可为大型表上的联接和聚合提供最高查询性能。A hash distributed table can deliver the highest query performance for joins and aggregations on large tables.

为了将数据分片到哈希分布式表中,SQL 数据仓库使用哈希函数明确将 1 个行分配到 1 个分布区。To shard data into a hash-distributed table, SQL Data Warehouse uses a hash function to deterministically assign each row to one distribution. 在表定义中,可以将一个列指定为分布列。In the table definition, one of the columns is designated as the distribution column. 哈希函数使用分布列中的值将 1 个行分配到 1 个分布区。The hash function uses the values in the distribution column to assign each row to a distribution.

下图说明了如何将完整的非分布式表存储为哈希分布表。The following diagram illustrates how a full (non-distributed table) gets stored as a hash-distributed table.

分布式表Distributed table

  • 一个行属于一个分布区。Each row belongs to one distribution.
  • 通过确定性哈希算法将一个行分配到一个分布区。A deterministic hash algorithm assigns each row to one distribution.
  • 不同大小的表显示,每个分布区的表行的数目各不相同。The number of table rows per distribution varies as shown by the different sizes of tables.

选择分布列时需考虑到性能,例如特异性、数据倾斜,以及在系统上运行的查询类型。There are performance considerations for the selection of a distribution column, such as distinctness, data skew, and the types of queries that run on the system.

轮循分布表Round-robin distributed tables

轮循机制表是最简单的表,在被用作负载临时表时,它可创造和提供高速性能。A round-robin table is the simplest table to create and delivers fast performance when used as a staging table for loads.

轮循机制分布表在表中均匀分布数据,但不会进行进一步优化。A round-robin distributed table distributes data evenly across the table but without any further optimization. 首先随机选择一个分布区,然后将行的缓冲区按顺序分配给分布区。A distribution is first chosen at random and then buffers of rows are assigned to distributions sequentially. 将数据加载到轮循机制表速度很快,但就查询性能而言,哈希分布式表的性能更佳。It is quick to load data into a round-robin table, but query performance can often be better with hash distributed tables. 轮循机制表上的联接要求重新安排数据,因此这需要花费更多时间。Joins on round-robin tables require reshuffling data and this takes additional time.

复制表Replicated Tables

复制表为小型表提供最快查询性能。A replicated table provides the fastest query performance for small tables.

复制表在每个计算节点上缓存表的完整副本。A table that is replicated caches a full copy of the table on each compute node. 因此复制表以后,无需在执行联接或聚合前在计算节点中间传输数据。Consequently, replicating a table removes the need to transfer data among compute nodes before a join or aggregation. 复制表尤为适用于小型表。Replicated tables are best utilized with small tables. 它需要额外存储并且在写入数据时会产生额外负载,因此不适用于大型表。Extra storage is required and there is additional overhead that is incurred when writing data which make large tables impractical.

下图显示了一个复制表。The following diagram shows a replicated table. 在 SQL 数据仓库中,会在每个计算节点的第一个分布区上缓存复制表。For SQL Data Warehouse, the replicated table is cached on the first distribution on each compute node.

Replicated tableReplicated table

后续步骤Next steps

对 SQL 数据仓库有了初步的认识后,请继续了解如何快速创建 SQL 数据仓库加载示例数据Now that you know a bit about SQL Data Warehouse, learn how to quickly create a SQL Data Warehouse and load sample data. 如果不熟悉 Azure,遇到新术语时,Azure 词汇表可以提供帮助。If you are new to Azure, you may find the Azure glossary helpful as you encounter new terminology. 或者,查看一下以下一些其他 SQL 数据仓库资源。Or look at some of these other SQL Data Warehouse Resources.