Azure Synapse Analytics(以前称为 SQL DW)体系结构Azure Synapse Analytics (formerly SQL DW) architecture

Azure Synapse 是一种无限制的分析服务,它将企业数据仓库和大数据分析结合在一起。Azure Synapse is a limitless analytics service that brings together enterprise data warehousing and Big Data analytics. 借助它可以使用无服务器的按需资源或预配资源,任意执行自己定义的大规模数据查询。It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources-at scale. Azure Synapse 将这两个领域结合在一起,提供统一的体验来引入、准备、管理和处理数据。Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data.

Azure Synapse 包含四个组件:Azure Synapse has four components:

  • SQL Analytics:基于 T-SQL 的完整分析SQL Analytics: Complete T-SQL based analytics

    • SQL 池(按预配的 DWU 付费)- 已公开发布SQL pool (pay per DWU provisioned) - Generally Available
    • SQL 随选(按处理的 TB 付费)-(预览)SQL on-demand (pay per TB processed) - (Preview)
  • Spark:深度集成的 Apache Spark(预览)Spark: Deeply integrated Apache Spark (Preview)

  • 数据集成:混合数据集成(预览)Data Integration: Hybrid data integration (Preview)

  • Studio:统一的用户体验。Studio: unified user experience. (预览版)(Preview)

Synapse SQL MPP 体系结构组件Synapse SQL MPP architecture components

Synapse SQL 利用横向扩展体系结构将数据的计算处理分布在多个节点上。Synapse SQL leverages a scale-out architecture to distribute computational processing of data across multiple nodes. 缩放单位是计算能力(称为数据仓库单位)的抽象概念。The unit of scale is an abstraction of compute power that is known as a data warehouse unit. 计算独立于存储,使你能够独立于系统中的数据来缩放计算。Compute is separate from storage, which enables you to scale compute independently of the data in your system.

Synapse SQL 体系结构

SQL Analytics 使用基于节点的体系结构。SQL Analytics uses a node-based architecture. 应用程序将 T-SQL 命令连接到、发布给控制节点,该节点是 SQL Analytics 的单一入口点。Applications connect and issue T-SQL commands to a Control node, which is the single point of entry for SQL Analytics. 控制节点运行用于优化并行处理查询的 MPP 引擎,然后将操作传递给计算节点以实现并行工作。The Control node runs the MPP engine, which optimizes queries for parallel processing, and then passes operations to Compute nodes to do their work in parallel.

计算节点将所有用户数据存储在 Azure 存储中并运行并行查询。The Compute nodes store all user data in Azure Storage and run the parallel queries. 数据移动服务 (DMS) 是一项系统级内部服务,它根据需要在节点间移动数据以并行运行查询和返回准确的结果。The Data Movement Service (DMS) is a system-level internal service that moves data across the nodes as necessary to run queries in parallel and return accurate results.

使用分离的存储和计算,用户可以在使用 Synapse SQL 池时执行以下操作:With decoupled storage and compute, when using Synapse SQL pool one can:

  • 无论存储需求如何,都可独立计算大小。Independently size compute power irrespective of your storage needs.
  • 无需移动数据,即可在 SQL 池(数据仓库)中增加或减少计算能力。Grow or shrink compute power, within a SQL pool (data warehouse), without moving data.
  • 在保持数据不受影响的情况下暂停计算容量,因此只需为存储付费。Pause compute capacity while leaving data intact, so you only pay for storage.
  • 在操作期间恢复计算容量。Resume compute capacity during operational hours.

Azure 存储Azure Storage

Synapse SQL 利用 Azure 存储来确保用户数据安全。Synapse SQL leverages Azure Storage to keep your user data safe. 由于数据由 Azure 存储进行存储和管理,因此会针对存储使用量单独收费。Since your data is stored and managed by Azure Storage, there is a separate charge for your storage consumption. 数据将分片到分布区,以优化系统性能。The data is sharded into distributions to optimize the performance of the system. 可选择在定义表时用于分布数据的分片模式。You can choose which sharding pattern to use to distribute the data when you define the table. 支持以下分片模式:These sharding patterns are supported:

  • 哈希Hash
  • 轮循机制Round Robin
  • 复制Replicate

控制节点Control node

控制节点是体系结构的核心。The Control node is the brain of the architecture. 它是与所有应用程序和连接进行交互的前端。It is the front end that interacts with all applications and connections. MPP 引擎在控制节点上运行以优化和协调并行查询。The MPP engine runs on the Control node to optimize and coordinate parallel queries. 提交 T-SQL 查询时,控制节点会将其转换为可针对每个分布区并行运行的查询。When you submit a T-SQL query, the Control node transforms it into queries that run against each distribution in parallel.

计算节点Compute nodes

计算节点提供计算能力。The Compute nodes provide the computational power. 分布区映射到计算节点以进行处理。Distributions map to Compute nodes for processing. 当你为更多计算资源付费时,分布区会重新映射到可用的计算节点。As you pay for more compute resources, distributions are remapped to available Compute nodes. 计算节点数的范围为从 1 到 60,它由 Synapse SQL 的服务级别确定。The number of compute nodes ranges from 1 to 60, and is determined by the service level for Synapse SQL.

每个计算节点均有一个节点 ID,该 ID 会显示在系统视图中。Each Compute node has a node ID that is visible in system views. 在名称以 sys.pdw_nodes 开头的系统视图中找到 node_id 列即可查看计算节点 ID。You can see the Compute node ID by looking for the node_id column in system views whose names begin with sys.pdw_nodes. 有关这些系统视图的列表,请参阅 MPP 系统视图For a list of these system views, see MPP system views.

数据移动服务Data Movement Service

数据移动服务 (DMS) 是一项数据传输技术,它可协调计算节点间的数据移动。Data Movement Service (DMS) is the data transport technology that coordinates data movement between the Compute nodes. 某些查询需要移动数据以确保并行查询返回准确的结果。Some queries require data movement to ensure the parallel queries return accurate results. 需要移动数据时,DMS 可确保正确的数据到达正确的位置。When data movement is required, DMS ensures the right data gets to the right location.

分发Distributions

分布区是存储和处理针对分布式数据运行的并行查询的基本单位。A distribution is the basic unit of storage and processing for parallel queries that run on distributed data. SQL Analytics 运行查询时,工作会被分割成 60 个并行运行的小型查询。When SQL Analytics runs a query, the work is divided into 60 smaller queries that run in parallel.

每个小型查询各在一个数据分布区上运行。Each of the 60 smaller queries runs on one of the data distributions. 每个计算节点管理其中一个或多个分布区。Each Compute node manages one or more of the 60 distributions. 具有最多计算资源的 SQL 池的每个分布区占 1 个计算节点。A SQL pool with maximum compute resources has one distribution per Compute node. 具有最小计算资源的 SQL 池的所有分布区占 1 个计算节点。A SQL pool with minimum compute resources has all the distributions on one compute node.

哈希分布表Hash-distributed tables

哈希分布表可为大型表上的联接和聚合提供最高查询性能。A hash distributed table can deliver the highest query performance for joins and aggregations on large tables.

为了将数据分片到哈希分布式表中,将使用哈希函数明确将 1 个行分配到 1 个分布区。To shard data into a hash-distributed table, a hash function is used to deterministically assign each row to one distribution. 在表定义中,可以将一个列指定为分布列。In the table definition, one of the columns is designated as the distribution column. 哈希函数使用分布列中的值将 1 个行分配到 1 个分布区。The hash function uses the values in the distribution column to assign each row to a distribution.

下图说明了如何将完整的非分布式表存储为哈希分布表。The following diagram illustrates how a full (non-distributed table) gets stored as a hash-distributed table.

分布式表Distributed table

  • 一个行属于一个分布区。Each row belongs to one distribution.
  • 通过确定性哈希算法将一个行分配到一个分布区。A deterministic hash algorithm assigns each row to one distribution.
  • 不同大小的表显示,每个分布区的表行的数目各不相同。The number of table rows per distribution varies as shown by the different sizes of tables.

选择分布列时需考虑到性能,例如特异性、数据倾斜,以及在系统上运行的查询类型。There are performance considerations for the selection of a distribution column, such as distinctness, data skew, and the types of queries that run on the system.

轮循分布表Round-robin distributed tables

轮循机制表是最简单的表,在被用作负载临时表时,它可创造和提供高速性能。A round-robin table is the simplest table to create and delivers fast performance when used as a staging table for loads.

轮循机制分布表在表中均匀分布数据,但不会进行进一步优化。A round-robin distributed table distributes data evenly across the table but without any further optimization. 首先随机选择一个分布区,然后将行的缓冲区按顺序分配给分布区。A distribution is first chosen at random and then buffers of rows are assigned to distributions sequentially. 将数据加载到轮循机制表速度很快,但就查询性能而言,哈希分布式表的性能更佳。It is quick to load data into a round-robin table, but query performance can often be better with hash distributed tables. 联接循环表要求重新组织数据,因此这需要花费更多时间。Joins on round-robin tables require reshuffling data, which takes additional time.

复制表Replicated Tables

复制表为小型表提供最快查询性能。A replicated table provides the fastest query performance for small tables.

复制表在每个计算节点上缓存表的完整副本。A table that is replicated caches a full copy of the table on each compute node. 因此复制表以后,无需在执行联接或聚合前在计算节点中间传输数据。Consequently, replicating a table removes the need to transfer data among compute nodes before a join or aggregation. 复制表尤为适用于小型表。Replicated tables are best utilized with small tables. 它需要额外存储并且在写入数据时会产生额外开销,因此不适用于大型表。Extra storage is required and there is additional overhead that is incurred when writing data, which make large tables impractical.

下图显示了在每个计算节点的第一个分布区上缓存的复制表。The diagram below shows a replicated table that is cached on the first distribution on each compute node.

复制表Replicated table

后续步骤Next steps

对 Azure Synapse 有了初步的认识后,请学习如何快速创建 SQL 池加载示例数据Now that you know a bit about Azure Synapse, learn how to quickly create a SQL pool and load sample data. 如果不熟悉 Azure,遇到新术语时,Azure 词汇表 可以提供帮助。If you are new to Azure, you may find the Azure glossary helpful as you encounter new terminology. 或者,查看以下一些其他 Azure Synapse 资源。Or look at some of these other Azure Synapse Resources.