Azure Synapse Analytics Azure Synapse Analytics

Azure Synapse Analytics(以前称为 SQL 数据仓库)是基于云的企业数据仓库,可利用大规模并行处理 (MPP) 对多达数 PB 的数据快速运行复杂的查询。Azure Synapse Analytics (formerly SQL Data Warehouse) is a cloud-based enterprise data warehouse that leverages massively parallel processing (MPP) to quickly run complex queries across petabytes of data. 将 Azure 用作大数据解决方案的关键组件。Use Azure as a key component of a big data solution. 使用简单的 PolyBase T-SQL 查询或 COPY 语句将大数据导入 Azure,然后利用 MPP 的能力运行高性能分析。Import big data into Azure with simple PolyBase T-SQL queries, or COPY statement and then use the power of MPP to run high-performance analytics. 进行集成和分析时,数据仓库是企业获取见解能够依赖的唯一事实来源。As you integrate and analyze, the data warehouse will become the single version of truth your business can count on for insights.

你可以使用 Azure Synapse 连接器(称为 Synapse 连接器)从 Azure Databricks 访问 Azure Synapse,该连接器是 Apache Spark 的数据源实现,它使用 Azure Blob 存储、PolyBase 或 Azure Synapse 中的 COPY 语句在 Azure Databricks 群集和 Azure Synapse 实例之间有效地传输大量数据。You can access Azure Synapse from Azure Databricks using the Azure Synapse connector (referred to as the Azure Synapse connector), a data source implementation for Apache Spark that uses Azure Blob storage, and PolyBase or the COPY statement in Azure Synapse to transfer large volumes of data efficiently between an Azure Databricks cluster and an Azure Synapse instance.

Azure Databricks 群集和 Azure Synapse 实例都访问公共 Blob 存储容器,以便在这两个系统之间交换数据。Both the Azure Databricks cluster and the Azure Synapse instance access a common Blob storage container to exchange data between these two systems. 在 Azure Databricks 中,Apache Spark 作业由 Azure Synapse 连接器触发,以便在 Blob 存储容器中读取和写入数据。In Azure Databricks, Apache Spark jobs are triggered by the Azure Synapse connector to read data from and write data to the Blob storage container. 在 Azure Synapse 端,PolyBase 执行的数据加载和卸载操作由 Azure Synapse 连接器通过 JDBC 触发。On the Azure Synapse side, data loading and unloading operations performed by PolyBase are triggered by the Azure Synapse connector through JDBC. 在 Databricks Runtime 7.0 及更高版本中,COPY 在默认情况下会通过 JDBC 由 Azure Synapse 连接器用来将数据加载到 Azure Synapse 中。In Databricks Runtime 7.0 and above, COPY is used by default to load data into Azure Synapse by the Azure Synapse connector through JDBC.

备注

COPYCOPY is

Azure Synapse 连接器更适合 ETL 而不是交互式查询,因为每次执行查询都可以将大量数据提取到 Blob 存储中。The Azure Synapse connector is more suited to ETL than to interactive queries, because each query execution can extract large amounts of data to Blob storage. 如果计划对同一 Azure Synapse 表执行多个查询,建议你以 Parquet 之类的格式保存提取的数据。If you plan to perform several queries against the same Azure Synapse table, we recommend that you save the extracted data in a format such as Parquet.

要求Requirements

用于 Azure Synapse 的数据库主密钥A database master key for the Azure Synapse.

身份验证 Authentication

Azure Synapse 连接器使用三种类型的网络连接:The Azure Synapse connector uses three types of network connections:

  • Spark 驱动程序到 Azure SynapseSpark driver to Azure Synapse
  • Spark 驱动程序和执行程序到 Azure 存储帐户Spark driver and executors to Azure storage account
  • Azure Synapse 到 Azure 存储帐户Azure Synapse to Azure storage account
                           ┌─────────┐
      ┌───────────────────>│ STORAGE │<──────────────────┐
      │ Storage acc key /  │ ACCOUNT │ Storage acc key / │
      │ Managed Service ID └─────────┘ OAuth 2.0         │
      │                         │                        │
      │                         │ Storage acc key /      │
      │                         │ OAuth 2.0              │
      v                         v                 ┌──────v────┐
┌──────────┐              ┌──────────┐            │┌──────────┴┐
│ Synapse  │              │  Spark   │            ││ Spark     │
│ Analytics│<────────────>│  Driver  │<───────────>| Executors │
└──────────┘  JDBC with   └──────────┘ Configured  └───────────┘
              username &               in Spark
              password

以下部分介绍每个连接的身份验证配置选项。The following sections describe each connection’s authentication configuration options.

Spark 驱动程序到 Azure SynapseSpark driver to Azure Synapse

Spark 驱动程序使用 JDBC 以及用户名和密码连接到 Azure Synapse。The Spark driver connects to Azure Synapse using JDBC with a username and password. 建议你使用 Azure 门户提供的连接字符串,该字符串使得通过 JDBC 连接在 Spark 驱动程序和 Azure Synapse 实例之间发送的所有数据都可以进行安全套接字层 (SSL) 加密。We recommend that you use the connection string provided by Azure portal, which enables Secure Sockets Layer (SSL) encryption for all data sent between the Spark driver and the Azure Synapse instance through the JDBC connection. 若要验证是否已启用 SSL 加密,可以在连接字符串中搜索 encrypt=trueTo verify that the SSL encryption is enabled, you can search for encrypt=true in the connection string. 为了使 Spark 驱动程序能够访问 Azure Synapse,建议你通过 Azure 门户在 Azure Synapse 服务器的防火墙窗格上将“允许访问 Azure 服务”设置为“打开”。 To allow the Spark driver to reach Azure Synapse, we recommend that you set Allow access to Azure services to ON on the firewall pane of the Azure Synapse server through Azure portal. 此设置允许来自所有 Azure IP 地址和所有 Azure 子网的通信,使 Spark 驱动程序能够访问 Azure Synapse 实例。This setting allows communications from all Azure IP addresses and all Azure subnets, which allows Spark drivers to reach the Azure Synapse instance.

Spark 驱动程序和执行程序到 Azure 存储帐户Spark driver and executors to Azure storage account

Azure 存储容器充当中介,用于在 Azure Synapse 中进行读取或写入操作时存储批量数据。The Azure storage container acts as an intermediary to store bulk data when reading from or writing to Azure Synapse. Spark 使用以下内置连接器之一连接到存储容器:Azure Blob 存储Azure Data Lake Storage (ADLS) Gen2Spark connects to the storage container using one of the built-in connectors: Azure Blob storage or Azure Data Lake Storage (ADLS) Gen2. 因此,只有 wasbsabfss 是受支持的 URI 方案。Therefore, the only supported URI schemes are wasbs and abfss.

用于设置此连接的凭据必须是存储帐户访问密钥和机密(Blob 和 ADLS Gen2)或 OAuth 2.0 令牌(仅 ADLS Gen2,请参阅使用服务主体直接通过 OAuth 2.0 访问 Azure Data Lake Storage Gen2 帐户)。The credentials used for setting up this connection must either be the storage account access key and secret (both Blob and ADLS Gen2) or OAuth 2.0 tokens (ADLS Gen2 only, see Access an Azure Data Lake Storage Gen2 account directly with OAuth 2.0 using the Service Principal). 提供这些凭据的方式有两种:笔记本会话配置和全局 Hadoop 配置。There are two ways of providing these credentials: notebook session configuration and global Hadoop configuration. 以下示例使用存储帐户访问密钥方法演示了这两种方式。The following examples illustrate these two ways using the storage account access key approach. 这同样适用于 OAuth 2.0 配置。The same applies to OAuth 2.0 configuration.

笔记本会话配置(首选)Notebook session configuration (preferred)

使用此方法时,将在与运行命令的笔记本关联的会话配置中设置帐户访问密钥。Using this approach, the account access key is set in the session configuration associated with the notebook that runs the command. 此配置不影响附加到同一群集的其他笔记本。This configuration does not affect other notebooks attached to the same cluster. spark 是笔记本中提供的 SparkSession 对象。spark is the SparkSession object provided in the notebook.

spark.conf.set(
  "fs.azure.account.key.<your-storage-account-name>.blob.core.chinacloudapi.cn",
  "<your-storage-account-access-key>")

全局 Hadoop 配置Global Hadoop configuration

此方法更新与 SparkContext 对象(由所有笔记本共享)相关联的全局 Hadoop 配置。This approach updates the global Hadoop configuration associated with the SparkContext object shared by all notebooks.

ScalaScala
sc.hadoopConfiguration.set(
  "fs.azure.account.key.<your-storage-account-name>.blob.core.chinacloudapi.cn",
  "<your-storage-account-access-key>")
PythonPython

hadoopConfiguration 并非在所有版本的 PySpark 中都公开。hadoopConfiguration is not exposed in all versions of PySpark. 尽管下面的命令依赖于某些 Spark 内部组件,但它应该适用于所有 PySpark 版本,将来不太可能停用或更改:Although the following command relies on some Spark internals, it should work with all PySpark versions and is unlikely to break or change in the future:

sc._jsc.hadoopConfiguration().set(
  "fs.azure.account.key.<your-storage-account-name>.blob.core.chinacloudapi.cn",
  "<your-storage-account-access-key>")

Azure Synapse 到 Azure 存储帐户Azure Synapse to Azure storage account

在加载和卸载临时数据的过程中,Azure Synapse 还会连接到存储帐户。Azure Synapse also connects to a storage account during loading and unloading of temporary data. 若要在连接的 Azure Synapse 实例中设置存储帐户的凭据,可以将 forwardSparkAzureStorageCredentials 设置为 true,这样 Azure Synapse 连接器就会自动发现笔记本会话配置或全局 Hadoop 配置中设置的帐户访问密钥,并通过 JDBC 将存储帐户访问密钥转发到连接的 Azure Synapse 实例。To set up the credential for a storage account in the connected Azure Synapse instance, you can set forwardSparkAzureStorageCredentials to true, in which case Azure Synapse connector automatically discovers the account access key set in the notebook session configuration or the global Hadoop configuration and forwards the storage account access key to the connected Azure Synapse instance over JDBC. 转发的存储访问密钥由 Azure Synapse 实例中的临时数据库范围的凭据表示。The forwarded storage access key is represented by a temporary database scoped credential in the Azure Synapse instance. Azure Synapse 连接器会在请求 Azure Synapse 加载或卸载数据之前创建数据库范围的凭据。Azure Synapse connector creates a database scoped credential before requesting Azure Synapse to load or unload data. 在加载或卸载操作完成后,连接器接着会删除数据库范围的凭据。Then the connector deletes the database scoped credential once the loading or unloading operation is completed.

或者,如果你使用 ADLS Gen2 + OAuth 2.0 身份验证,或将 Azure Synapse 实例配置为具有一个托管服务标识(通常与 VNet + 服务终结点设置配合使用),则必须将 useAzureMSI 设置为 trueAlternatively, if you use ADLS Gen2 + OAuth 2.0 authentication or your Azure Synapse instance is configured to have a Managed Service Identity (typically in conjunction with a VNet + Service Endpoints setup), you must set useAzureMSI to true. 在这种情况下,连接器会为数据库范围的凭据指定 IDENTITY = 'Managed Service Identity',并且不指定 SECRETIn this case the connector will specify IDENTITY = 'Managed Service Identity' for the databased scoped credential and no SECRET.

流式处理支持 Streaming support

Azure Synapse 连接器为 Azure Synapse 提供高效且可缩放的结构化流式写入支持,以便提供对批量写入一致的用户体验,并使用 PolyBaseThe Azure Synapse connector offers efficient and scalable Structured Streaming write support for Azure Synapse that provides consistent user experience with batch writes, and uses PolyBase

在 Azure Databricks 群集和 Azure Synapse 实例之间进行大型数据传输。for large data transfers between an Azure Databricks cluster and Azure Synapse instance. 与批量写入类似,流式处理主要用于 ETL,其延迟较高,因此在某些情况下可能不适合实时数据处理。Similar to the batch writes, streaming is designed largely for ETL, thus providing higher latency that may not be suitable for real-time data processing in some cases.

Azure Synapse 连接器支持用于记录追加和聚合的 AppendComplete 输出模式。The Azure Synapse connector supports Append and Complete output modes for record appends and aggregations. 请参阅结构化流式处理指南,详细了解输出模式和兼容性矩阵。See the Structured Streaming guide for more details on output modes and compatibility matrix.

容错语义Fault tolerance semantics

默认情况下,Azure Synapse 流式处理提供端到端“恰好一次”保证,可确保将数据写入 Azure Synapse 表,方法是:将 DBFS 中的检查点位置、Azure Synapse 中的检查点表以及锁定机制组合使用,从而可靠地跟踪查询进度,以确保流式处理可以应对任何类型的故障、重试和查询重启。By default, Azure Synapse Streaming offers end-to-end exactly-once guarantee for writing data into an Azure Synapse table by reliably tracking progress of the query using a combination of checkpoint location in DBFS, checkpoint table in Azure Synapse, and locking mechanism to ensure that streaming can handle any types of failures, retries, and query restarts. 也可为 Azure Synapse 流式处理选择限制较少的“至少一次”语义,方法是将 spark.databricks.sqldw.streaming.exactlyOnce.enabled 选项设置为 false,这样,如果在连接到 Azure Synapse 时出现间歇性故障,或者查询意外终止,则会进行数据复制。Optionally, you can select less restrictive at-least-once semantics for Azure Synapse Streaming by setting spark.databricks.sqldw.streaming.exactlyOnce.enabled option to false, in which case data duplication could occur in the event of intermittent connection failures to Azure Synapse or unexpected query termination.

用法(批处理)Usage (Batch)

可以在 Scala、Python、SQL 和 R 笔记本中通过数据源 API 使用此连接器。You can use this connector via the data source API in Scala, Python, SQL, and R notebooks.

ScalaScala

// Set up the Blob storage account access key in the notebook session conf.
spark.conf.set(
  "fs.azure.account.key.<your-storage-account-name>.blob.core.chinacloudapi.cn",
  "<your-storage-account-access-key>")

// Get some data from an Azure Synapse table.
val df: DataFrame = spark.read
  .format("com.databricks.spark.sqldw")
  .option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>")
  .option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.chinacloudapi.cn/<your-directory-name>")
  .option("forwardSparkAzureStorageCredentials", "true")
  .option("dbTable", "my_table_in_dw")
  .load()

// Load data from an Azure Synapse query.
val df: DataFrame = spark.read
  .format("com.databricks.spark.sqldw")
  .option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>")
  .option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.chinacloudapi.cn/<your-directory-name>")
  .option("forwardSparkAzureStorageCredentials", "true")
  .option("query", "select x, count(*) as cnt from my_table_in_dw group by x")
  .load()

// Apply some transformations to the data, then use the
// Data Source API to write the data back to another table in Azure Synapse.

df.write
  .format("com.databricks.spark.sqldw")
  .option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>")
  .option("forwardSparkAzureStorageCredentials", "true")
  .option("dbTable", "my_table_in_dw_copy")
  .option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.chinacloudapi.cn/<your-directory-name>")
  .save()

PythonPython

# Set up the Blob storage account access key in the notebook session conf.
spark.conf.set(
  "fs.azure.account.key.<your-storage-account-name>.blob.core.chinacloudapi.cn",
  "<your-storage-account-access-key>")

# Get some data from an Azure Synapse table.
df = spark.read \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
  .option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.chinacloudapi.cn/<your-directory-name>") \
  .option("forwardSparkAzureStorageCredentials", "true") \
  .option("dbTable", "my_table_in_dw") \
  .load()

# Load data from an Azure Synapse query.
df = spark.read \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
  .option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.chinacloudapi.cn/<your-directory-name>") \
  .option("forwardSparkAzureStorageCredentials", "true") \
  .option("query", "select x, count(*) as cnt from my_table_in_dw group by x") \
  .load()

# Apply some transformations to the data, then use the
# Data Source API to write the data back to another table in Azure Synapse.

df.write \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
  .option("forwardSparkAzureStorageCredentials", "true") \
  .option("dbTable", "my_table_in_dw_copy") \
  .option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.chinacloudapi.cn/<your-directory-name>") \
  .save()

SQLSQL

-- Set up the Blob storage account access key in the notebook session conf.
SET fs.azure.account.key.<your-storage-account-name>.blob.core.chinacloudapi.cn=<your-storage-account-access-key>;

-- Read data using SQL.
CREATE TABLE my_table_in_spark_read
USING com.databricks.spark.sqldw
OPTIONS (
  url 'jdbc:sqlserver://<the-rest-of-the-connection-string>',
  forwardSparkAzureStorageCredentials 'true',
  dbTable 'my_table_in_dw',
  tempDir 'wasbs://<your-container-name>@<your-storage-account-name>.blob.core.chinacloudapi.cn/<your-directory-name>'
);

-- Write data using SQL.
-- Create a new table, throwing an error if a table with the same name already exists:

CREATE TABLE my_table_in_spark_write
USING com.databricks.spark.sqldw
OPTIONS (
  url 'jdbc:sqlserver://<the-rest-of-the-connection-string>',
  forwardSparkAzureStorageCredentials 'true',
  dbTable 'my_table_in_dw_copy',
  tempDir 'wasbs://<your-container-name>@<your-storage-account-name>.blob.core.chinacloudapi.cn/<your-directory-name>'
)
AS SELECT * FROM table_to_save_in_spark;

RR

# Load SparkR
library(SparkR)

# Set up the Blob storage account access key in the notebook session conf.
conf <- sparkR.callJMethod(sparkR.session(), "conf")
sparkR.callJMethod(conf, "set", "fs.azure.account.key.<your-storage-account-name>.blob.core.chinacloudapi.cn", "<your-storage-account-access-key>")

# Get some data from an Azure Synapse table.
df <- read.df(
   source = "com.databricks.spark.sqldw",
   url = "jdbc:sqlserver://<the-rest-of-the-connection-string>",
   forward_spark_azure_storage_credentials = "true",
   dbTable = "my_table_in_dw",
   tempDir = "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.chinacloudapi.cn/<your-directory-name>")

# Load data from an Azure Synapse query.
df <- read.df(
   source = "com.databricks.spark.sqldw",
   url = "jdbc:sqlserver://<the-rest-of-the-connection-string>",
   forward_spark_azure_storage_credentials = "true",
   query = "select x, count(*) as cnt from my_table_in_dw group by x",
   tempDir = "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.chinacloudapi.cn/<your-directory-name>")

# Apply some transformations to the data, then use the
# Data Source API to write the data back to another table in Azure Synapse.

write.df(
  df,
  source = "com.databricks.spark.sqldw",
  url = "jdbc:sqlserver://<the-rest-of-the-connection-string>",
  forward_spark_azure_storage_credentials = "true",
  dbTable = "my_table_in_dw_copy",
  tempDir = "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.chinacloudapi.cn/<your-directory-name>")

用法(流式处理)Usage (Streaming)

可以在 Scala 和 Python 笔记本中使用结构化流式处理来写入数据。You can write data using Structured Streaming in Scala and Python notebooks.

ScalaScala

// Set up the Blob storage account access key in the notebook session conf.
spark.conf.set(
  "fs.azure.account.key.<your-storage-account-name>.blob.core.chinacloudapi.cn",
  "<your-storage-account-access-key>")

// Prepare streaming source; this could be Kafka or a simple rate stream.
val df: DataFrame = spark.readStream
  .format("rate")
  .option("rowsPerSecond", "100000")
  .option("numPartitions", "16")
  .load()

// Apply some transformations to the data then use
// Structured Streaming API to continuously write the data to a table in Azure Synapse.

df.writeStream
  .format("com.databricks.spark.sqldw")
  .option("url", <azure-sqldw-jdbc-url>)
  .option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.chinacloudapi.cn/<your-directory-name>")
  .option("forwardSparkAzureStorageCredentials", "true")
  .option("dbTable", <table-name>)
  .option("checkpointLocation", "/tmp_checkpoint_location")
  .start()

PythonPython

# Set up the Blob storage account access key in the notebook session conf.
spark.conf.set(
  "fs.azure.account.key.<your-storage-account-name>.blob.core.chinacloudapi.cn",
  "<your-storage-account-access-key>")

# Prepare streaming source; this could be Kafka or a simple rate stream.
df = spark.readStream \
  .format("rate") \
  .option("rowsPerSecond", "100000") \
  .option("numPartitions", "16") \
  .load()

# Apply some transformations to the data then use
# Structured Streaming API to continuously write the data to a table in Azure Synapse.

df.writeStream \
  .format("com.databricks.spark.sqldw") \
  .option("url", <azure-sqldw-jdbc-url>) \
  .option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.chinacloudapi.cn/<your-directory-name>") \
  .option("forwardSparkAzureStorageCredentials", "true") \
  .option("dbTable", <table-name>) \
  .option("checkpointLocation", "/tmp_checkpoint_location") \
  .start()

配置Configuration

此部分介绍如何配置连接器的写入语义、所需权限和其他配置参数。This section describes how to configure write semantics for the connector, required permissions, and miscellaneous configuration parameters.

本节内容:In this section:

写入语义Write semantics

除 PolyBase 外,Azure Synapse 连接器还支持 COPY 语句。In addition to PolyBase, the Azure Synapse connector supports the COPY statement. COPY 语句提供了一种更方便的将数据加载到 Azure Synapse 中的方法,无需创建外部表。另外,此语句只需较少的权限即可加载数据,并且提高了性能,可以将数据以高吞吐量的方式引入到 Azure Synapse 中。The COPY statement offers a more convenient way of loading data into Azure Synapse without the need to create an external table, requires fewer permissions to load data, and provides an improved performance for high-throughput data ingestion into Azure Synapse.

你可以使用以下配置来强制执行写入语义:You can use the following configuration to enforce write semantics:

ScalaScala

// Configure the write semantics for Azure Synapse connector in the notebook session conf.
spark.conf.set("spark.databricks.sqldw.writeSemantics", "<write-semantics>")

PythonPython

# Configure the write semantics for Azure Synapse connector in the notebook session conf.
spark.conf.set("spark.databricks.sqldw.writeSemantics", "<write-semantics>")

SQLSQL

-- Configure the write semantics for Azure Synapse connector in the notebook session conf.
SET spark.databricks.sqldw.writeSemantics=<write-semantics>;

RR

# Load SparkR
library(SparkR)

# Configure the write semantics for Azure Synapse connector in the notebook session conf.
conf <- sparkR.callJMethod(sparkR.session(), "conf")
sparkR.callJMethod(conf, "set", "spark.databricks.sqldw.writeSemantics", "<write-semantics>")

其中,<write-semantics> 为:where <write-semantics> is:

  • polybase(使用 PolyBase 将数据加载到 Azure Synapse 中)polybase (use PolyBase for loading data into Azure Synapse)
  • copy(在 Databricks Runtime 7.0 及更高版本中,使用 COPY 语句将数据加载到 Azure Synapse 中)copy (In Databricks Runtime 7.0 and above, use the COPY statement to load data into Azure Synapse)
  • 未指定(回退到默认值:对于 Databricks Runtime 7.0 及更高版本上的 ADLS Gen2,连接器会使用 copy,在其余情况下则使用 polybaseunspecified (falls back to default: for ADLS Gen2 on Databricks Runtime 7.0 and above the connector will use copy, else polybase)

PolyBase 所需的 Azure Synapse 权限 Required Azure Synapse permissions for PolyBase

使用 PolyBase 时,Azure Synapse 连接器要求 JDBC 连接用户有权在连接的 Azure Synapse 实例中运行以下命令:When you use PolyBase, the Azure Synapse connector requires the JDBC connection user to have permission to run the following commands in the connected Azure Synapse instance:

连接器要求已存在一个用于指定的 Azure Synapse 实例的数据库主密钥,这是第一个命令的先决条件。As a prerequisite for the first command, the connector expects that a database master key already exists for the specified Azure Synapse instance. 如果未满足该先决条件,可以使用 CREATE MASTER KEY 命令创建一个密钥。If not, you can create a key using the CREATE MASTER KEY command.

此外,若要读取通过 dbTable 设置的 Azure Synapse 表或在 query 中引用的表,JDBC 用户必须有权访问所需的 Azure Synapse 表。Additionally, to read the Azure Synapse table set through dbTable or tables referred in query, the JDBC user must have permission to access needed Azure Synapse tables. 若要将数据写回到通过 dbTable 设置的 Azure Synapse 表,JDBC 用户必须有权将数据写入此 Azure Synapse 表。To write data back to an Azure Synapse table set through dbTable, the JDBC user must have permission to write to this Azure Synapse table.

下表汇总了通过 PolyBase 执行的所有操作的权限:The following table summarizes the permissions for all operations with PolyBase:

操作Operation 权限Permissions
批量写入Batch write CONTROLCONTROL
流式写入Streaming write CONTROLCONTROL
读取Read CONTROLCONTROL

COPY 语句所需的 Azure Synapse 权限 Required Azure Synapse permissions for the COPY statement

备注

在 Databricks Runtime 7.0 及更高版本中可用。Available in Databricks Runtime 7.0 and above.

使用 COPY 语句时,Azure Synapse 连接器要求 JDBC 连接用户有权在连接的 Azure Synapse 实例中运行以下命令:When you use the COPY statement, the Azure Synapse connector requires the JDBC connection user to have permission to run the following commands in the connected Azure Synapse instance:

如果目标表不存在于 Azure Synapse 中,则需要运行以下命令以及上述命令的权限:If the destination table does not exist in Azure Synapse, permission to run the following command is required in addition to the command above:

下表汇总了通过 COPY 进行批量写入和流式写入的权限:The following table summarizes the permissions for batch and streaming writes with COPY:

操作Operation 权限(插入到现有表中)Permissions (insert into an existing table) 权限(插入到新表中)Permissions (insert into a new table)
批量写入Batch write ADMINISTER DATABASE BULK OPERATIONSADMINISTER DATABASE BULK OPERATIONS

INSERTINSERT
ADMINISTER DATABASE BULK OPERATIONSADMINISTER DATABASE BULK OPERATIONS

INSERTINSERT

CREATE TABLECREATE TABLE

ALTER ON SCHEMA :: dboALTER ON SCHEMA :: dbo
流式写入Streaming write ADMINISTER DATABASE BULK OPERATIONSADMINISTER DATABASE BULK OPERATIONS

INSERTINSERT
ADMINISTER DATABASE BULK OPERATIONSADMINISTER DATABASE BULK OPERATIONS

INSERTINSERT

CREATE TABLECREATE TABLE

ALTER ON SCHEMA :: dboALTER ON SCHEMA :: dbo

parametersParameters

Spark SQL 中提供的参数映射或 OPTIONS 支持以下设置:The parameter map or OPTIONS provided in Spark SQL support the following settings:

参数Parameter 必须Required 默认Default 说明Notes
dbTable 是,除非指定了 queryYes, unless query is specified 无默认值No default 要在 Azure Synapse 中创建或读取的表。The table to create or read from in Azure Synapse. 将数据保存回 Azure Synapse 时,此参数是必需的。This parameter is required when saving data back to Azure Synapse.

你还可以使用 {SCHEMA NAME}.{TABLE NAME} 来访问采用给定架构的表。You can also use {SCHEMA NAME}.{TABLE NAME} to access a table in a given schema. 如果未提供架构名称,则会使用与 JDBC 用户关联的默认架构。If schema name is not provided, the default schema associated with the JDBC user is used.

先前支持的 dbtable 变体已弃用,在将来的版本中会被忽略。The previously supported dbtable variant is deprecated and will be ignored in future releases. 请改用“混合大小写”名称。Use the “camel case” name instead.
query 是,除非指定了 dbTableYes, unless dbTable is specified 无默认值No default 要从 Azure Synapse 中进行读取的查询。The query to read from in Azure Synapse.

对于在查询中引用的表,你还可以使用 {SCHEMA NAME}.{TABLE NAME} 来访问采用给定架构的表。For tables referred in the query, you can also use {SCHEMA NAME}.{TABLE NAME} to access a table in a given schema. 如果未提供架构名称,则会使用与 JDBC 用户关联的默认架构。If schema name is not provided, the default schema associated with the JDBC user is used.
user No 无默认值No default Azure Synapse 用户名。The Azure Synapse username. 必须与 password 选项一起使用。Must be used in tandem with password option. 使用它的前提是未在 URL 中传递用户和密码。Can only be used if the user and password are not passed in the URL. 同时传递这两项会导致错误。Passing both will result in an error.
password No 无默认值No default Azure Synapse 密码。The Azure Synapse password. 必须与 user 选项一起使用。Must be used in tandem with user option. 使用它的前提是未在 URL 中传递用户和密码。Can only be used if the user and password are not passed in the URL. 同时传递这两项会导致错误。Passing both will result in an error.
url Yes 无默认值No default 一个 JDBC URL,其中的 sqlserver 设置为子协议。A JDBC URL with sqlserver set as the subprotocol. 建议使用 Azure 门户提供的连接字符串。It is recommended to use the connection string provided by Azure portal. 设置Setting
强烈建议使用 encrypt=true,因为它允许对 JDBC 连接进行 SSL 加密。encrypt=true is strongly recommended, because it enables SSL encryption of the JDBC connection. 如果单独设置了 userpassword,则不需要在 URL 中包含它们。If user and password are set separately, you do not need to include them in the URL.
jdbcDriver No 取决于 JDBC URL 的子协议Determined by the JDBC URL’s subprotocol 要使用的 JDBC 驱动程序的类名。The class name of the JDBC driver to use. 此类必须位于类路径中。This class must be on the classpath. 在大多数情况下无需指定此选项,因为相应的驱动程序类名会由 JDBC URL 的子协议自动确定。In most cases, it should not be necessary to specify this option, as the appropriate driver classname should automatically be determined by the JDBC URL’s subprotocol.

先前支持的 jdbc_driver 变体已弃用,在将来的版本中会被忽略。The previously supported jdbc_driver variant is deprecated and will be ignored in future releases. 请改用“混合大小写”名称。Use the “camel case” name instead.
tempDir Yes 无默认值No default 一个 wasbs URI。A wasbs URI. 建议将专用 Blob 存储容器用于 Azure Synapse。We recommend you use a dedicated Blob storage container for the Azure Synapse.

先前支持的 tempdir 变体已弃用,在将来的版本中会被忽略。The previously supported tempdir variant is deprecated and will be ignored in future releases. 请改用“混合大小写”名称。Use the “camel case” name instead.
tempFormat No PARQUET 将数据写入 Azure Synapse 时用来将临时文件保存到 blob 存储中的格式。The format in which to save temporary files to the blob store when writing to Azure Synapse. 默认为 PARQUET;目前不允许其他值。Defaults to PARQUET; no other values are allowed right now.
tempCompression No SNAPPY 由 Spark 和 Azure Synapse 用来进行临时编码/解码的压缩算法。The compression algorithm to be used to encode/decode temporary by both Spark and Azure Synapse. 目前支持的值为 UNCOMPRESSEDSNAPPYGZIPCurrently supported values are: UNCOMPRESSED, SNAPPY and GZIP.
forwardSparkAzureStorageCredentials No falsefalse 如果此项为 true,库会自动发现 Spark 用来连接到 Blob 存储容器的凭据,并会通过 JDBC 将这些凭据转发到 Azure Synapse。If true, the library automatically discovers the credentials that Spark is using to connect to the Blob storage container and forwards those credentials to Azure Synapse over JDBC. 这些凭据作为 JDBC 查询的一部分发送。These credentials are sent as part of the JDBC query. 因此,在使用此选项时,强烈建议你启用对 JDBC 连接进行 SSL 加密的功能。Therefore it is strongly recommended that you enable SSL encryption of the JDBC connection when you use this option.

Azure Synapse 连接器的当前版本要求将 forwardSparkAzureStorageCredentialsuseAzureMSI 中的(恰好)一个显式设置为 trueThe current version of Azure Synapse connector requires (exactly) one of forwardSparkAzureStorageCredentials or useAzureMSI to be explicitly set to true.

先前支持的 forward_spark_azure_storage_credentials 变体已弃用,在将来的版本中会被忽略。The previously supported forward_spark_azure_storage_credentials variant is deprecated and will be ignored in future releases. 请改用“混合大小写”名称。Use the “camel case” name instead.
useAzureMSI No falsefalse 如果此项为 true,则库会为它创建的数据库范围凭据指定 IDENTITY = 'Managed Service Identity' 并且不指定 SECRETIf true, the library will specify IDENTITY = 'Managed Service Identity' and no SECRET for the database scoped credentials it creates.

Azure Synapse 连接器的当前版本要求将 forwardSparkAzureStorageCredentialsuseAzureMSI 中的(恰好)一个显式设置为 trueThe current version of Azure Synapse connector requires (exactly) one of forwardSparkAzureStorageCredentials or useAzureMSI to be explicitly set to true.
tableOptions No CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBINCLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN 一个用于指定表选项的字符串。创建通过 dbTable 设置的 Azure Synapse 表时需要使用这些选项。A string used to specify table options when creating the Azure Synapse table set through dbTable. 此字符串会以文本形式传递到针对 Azure Synapse 发出的 CREATE TABLE SQL 语句的 WITH 子句。This string is passed literally to the WITH clause of the CREATE TABLE SQL statement that is issued against Azure Synapse.

先前支持的 table_options 变体已弃用,在将来的版本中会被忽略。The previously supported table_options variant is deprecated and will be ignored in future releases. 请改用“混合大小写”名称。Use the “camel case” name instead.
preActions No 无默认值(空字符串)No default (empty string) 在将数据写入 Azure Synapse 实例之前要在 Azure Synapse 中执行的 SQL 命令的列表,其中各命令之间以 ; 分隔。A ; separated list of SQL commands to be executed in Azure Synapse before writing data to the Azure Synapse instance. 这些 SQL 命令必须是 Azure Synapse 接受的有效命令。These SQL commands are required to be valid commands accepted by Azure Synapse.

如果这些命令中的任何一个失败,系统会将其视为错误,并且不会执行写入操作。If any of these commands fail, it is treated as an error and the write operation is not executed.
postActions No 无默认值(空字符串)No default (empty string) 在连接器成功将数据写入 Azure Synapse 实例后要在 Azure Synapse 中执行的 SQL 命令的列表,其中各命令之间以 ; 分隔。A ; separated list of SQL commands to be executed in Azure Synapse after the connector successfully writes data to the Azure Synapse instance. 这些 SQL 命令必须是 Azure Synapse 接受的有效命令。These SQL commands are required to be valid commands accepted by Azure Synapse.

如果这些命令中的任何一个失败,系统会将其视为错误,并且,当数据成功写入 Azure Synapse 实例后,会出现异常。If any of these commands fail, it is treated as an error and you’ll get an exception after the data is successfully written to the Azure Synapse instance.
maxStrLength No 256256 Spark 中的 StringType 会映射到 Azure Synapse 中的 NVARCHAR(maxStrLength) 类型。StringType in Spark is mapped to the NVARCHAR(maxStrLength) type in Azure Synapse. 你可以使用 maxStrLength 为所有 NVARCHAR(maxStrLength) 类型列设置字符串长度,这些列位于 Azure Synapse 内名为You can use maxStrLength to set the string length for all NVARCHAR(maxStrLength) type columns that are in the table with name
dbTable 的表中。dbTable in Azure Synapse.

先前支持的 maxstrlength 变体已弃用,在将来的版本中会被忽略。The previously supported maxstrlength variant is deprecated and will be ignored in future releases. 请改用“混合大小写”名称。Use the “camel case” name instead.
checkpointLocation Yes 无默认值No default DBFS 上的位置,可供结构化流式处理用来写入元数据和检查点信息。Location on DBFS that will be used by Structured Streaming to write metadata and checkpoint information. 请参阅结构化流式处理编程指南中的 Recovering from Failures with Checkpointing(使用检查点功能从故障中恢复)。See Recovering from Failures with Checkpointing in Structured Streaming programming guide.
numStreamingTempDirsToKeep No 00 指示要保留多少(最新的)临时目录,以便定期清理流式处理中的微型批。Indicates how many (latest) temporary directories to keep for periodic cleanup of micro batches in streaming. 如果将此项设置为 0,则系统会在微型批提交后立即触发目录删除操作;如果将此项设置为其他值,则系统会保留所设定数量的最新微型批并删除其余目录。When set to 0, directory deletion is triggered immediately after micro batch is committed, otherwise provided number of latest micro batches is kept and the rest of directories is removed. 使用 -1 可禁用定期清理。Use -1 to disable periodic cleanup.
applicationName No Databricks-User-Query 每个查询的连接的标记。The tag of the connection for each query. 如果未指定此项,或者值为空字符串,则会将标记的默认值添加到 JDBC URL。If not specified or the value is an empty string, the default value of the tag is added the JDBC URL. 默认值可防止 Azure DB 监视工具针对查询引发虚假 SQL 注入警报。The default value prevents the Azure DB Monitoring tool from raising spurious SQL injection alerts against queries.
maxbinlength No 无默认值No default 控制 BinaryType 列的列长度。Control the column length of BinaryType columns. 此参数会转换为 VARBINARY(maxbinlength)This parameter is translated as VARBINARY(maxbinlength).

备注

  • 仅当将数据从 Azure Databricks 写入 Azure Synapse 中的新表时,tableOptionspreActionspostActionsmaxStrLength 才适用。tableOptions, preActions, postActions, and maxStrLength are relevant only when writing data from Azure Databricks to a new table in Azure Synapse.
  • checkpointLocationnumStreamingTempDirsToKeep 仅适用于将数据从 Azure Databricks 流式写入到 Azure Synapse 中的新表。checkpointLocation and numStreamingTempDirsToKeep are relevant only for streaming writes from Azure Databricks to a new table in Azure Synapse.
  • 即使所有数据源选项名称不区分大小写,也建议你为了清楚起见,以“混合大小写”方式指定这些名称。Even though all data source option names are case-insensitive, we recommend that you specify them in “camel case” for clarity.

将查询下推到 Azure Synapse 中 Query pushdown into Azure Synapse

Azure Synapse 连接器实施了一组将以下运算符下推到 Azure Synapse 中的优化规则:The Azure Synapse connector implements a set of optimization rules to push the following operators down into Azure Synapse:

  • Filter
  • Project
  • Limit

ProjectFilter 运算符支持以下表达式:The Project and Filter operators support the following expressions:

  • 大多数布尔逻辑运算符Most boolean logic operators
  • 比较Comparisons
  • 基本算术运算Basic arithmetic operations
  • 数值和字符串强制转换Numeric and string casts

对于 Limit 运算符,仅在未指定排序的情况下才支持下推。For the Limit operator, pushdown is supported only when there is no ordering specified. 例如: 。For example:

SELECT TOP(10) * FROM table(而不是 SELECT TOP(10) * FROM table ORDER BY col)。SELECT TOP(10) * FROM table, but not SELECT TOP(10) * FROM table ORDER BY col.

备注

Azure Synapse 连接器不下推针对字符串、日期或时间戳进行运算的表达式。The Azure Synapse connector does not push down expressions operating on strings, dates, or timestamps.

默认启用通过 Azure Synapse 连接器构建的查询下推。Query pushdown built with the Azure Synapse connector is enabled by default. 可以通过将 spark.databricks.sqldw.pushdown 设置为 false 来禁用它。You can disable it by setting spark.databricks.sqldw.pushdown to false.

临时数据管理 Temporary data management

Azure Synapse 连接器不会删除它在 Blob 存储容器中创建的临时文件。The Azure Synapse connector does not delete the temporary files that it creates in the Blob storage container. 因此,建议你定期删除用户提供的 tempDir 位置下的临时文件。Therefore we recommend that you periodically delete temporary files under the user-supplied tempDir location.

为了便于进行数据清理,Azure Synapse 连接器不会直接在 tempDir 下存储数据文件,而是创建如下格式的子目录:<tempDir>/<yyyy-MM-dd>/<HH-mm-ss-SSS>/<randomUUID>/To facilitate data cleanup, the Azure Synapse connector does not store data files directly under tempDir, but instead creates a subdirectory of the form: <tempDir>/<yyyy-MM-dd>/<HH-mm-ss-SSS>/<randomUUID>/. 可以通过设置定期作业(使用 Azure Databricks 的作业功能或其他功能进行设置),以递归方式删除其创建时间早于给定阈值(例如 2 天)的任何子目录(假设 Spark 作业的运行时间不能超过该阈值)。You can set up periodic jobs (using the Azure Databricks jobs feature or otherwise) to recursively delete any subdirectories that are older than a given threshold (for example, 2 days), with the assumption that there cannot be Spark jobs running longer than that threshold.

一个更简单的替代方法是,定期删除整个容器,然后使用同一名称创建一个新容器。A simpler alternative is to periodically drop the whole container and create a new one with the same name. 这要求你将专用容器用于 Azure Synapse 连接器生成的临时数据,并且你可以找到一个时间窗口,在该窗口中,你可以保证任何涉及连接器的查询均未在运行。This requires that you use a dedicated container for the temporary data produced by the Azure Synapse connector and that you can find a time window in which you can guarantee that no queries involving the connector are running.

临时对象管理 Temporary object management

Azure Synapse 连接器在 Azure Databricks 群集和 Azure Synapse 实例之间自动进行数据传输。The Azure Synapse connector automates data transfer between an Azure Databricks cluster and an Azure Synapse instance. 为了从 Azure Synapse 表或查询中读取数据或将数据写入 Azure Synapse 表,Azure Synapse 连接器会创建临时对象,其中包括幕后的 DATABASE SCOPED CREDENTIALEXTERNAL DATA SOURCEEXTERNAL FILE FORMATEXTERNAL TABLEFor reading data from an Azure Synapse table or query or writing data to an Azure Synapse table, the Azure Synapse connector creates temporary objects, including DATABASE SCOPED CREDENTIAL, EXTERNAL DATA SOURCE, EXTERNAL FILE FORMAT, and EXTERNAL TABLE behind the scenes. 这些对象只在相应 Spark 作业的整个持续时间内生存,此后会被自动删除。These objects live only throughout the duration of the corresponding Spark job and should automatically be dropped thereafter.

当群集使用 Azure Synapse 连接器运行查询时,如果 Spark 驱动程序进程崩溃或被强制重启,或者群集被强制终止或重启,则可能不会删除临时对象。When a cluster is running a query using the Azure Synapse connector, if the Spark driver process crashes or is forcefully restarted, or if the cluster is forcefully terminated or restarted, temporary objects might not be dropped. 为了便于识别并手动删除这些对象,Azure Synapse 连接器会使用以下格式的标记为在 Azure Synapse 实例中创建的所有中间临时对象的名称加上前缀:tmp_<yyyy_MM_dd_HH_mm_ss_SSS>_<randomUUID>_To facilitate identification and manual deletion of these objects, Azure Synapse connector prefixes the names of all intermediate temporary objects created in the Azure Synapse instance with a tag of the form: tmp_<yyyy_MM_dd_HH_mm_ss_SSS>_<randomUUID>_.

建议使用如下所示的查询定期查找泄漏的对象:We recommend that you periodically look for leaked objects using queries such as the following:

  • SELECT * FROM sys.database_scoped_credentials WHERE name LIKE 'tmp_databricks_%'
  • SELECT * FROM sys.external_data_sources WHERE name LIKE 'tmp_databricks_%'
  • SELECT * FROM sys.external_file_formats WHERE name LIKE 'tmp_databricks_%'
  • SELECT * FROM sys.external_tables WHERE name LIKE 'tmp_databricks_%'

流式处理检查点表管理 Streaming checkpoint table management

Azure Synapse 连接器不会删除在新的流式查询启动时创建的流式处理检查点表。The Azure Synapse connector does not delete the streaming checkpoint table that is created when new streaming query is started. 此行为与 DBFS 上的 checkpointLocation 一致。This behavior is consistent with the checkpointLocation on DBFS. 因此,对于将来不会运行的查询或已删除检查点位置的查询,建议你在删除 DBFS 上的检查点位置的同时定期删除检查点表。Therefore we recommend that you periodically delete checkpoint tables at the same time as removing checkpoint locations on DBFS for queries that are not going to be run in the future or already have checkpoint location removed.

默认情况下,所有检查点表的名称都为 <prefix>_<query_id>,其中 <prefix> 是一个可配置前缀(默认值为 databricks_streaming_checkpoint),query_id 是删除了 _ 字符的流式查询 ID。By default, all checkpoint tables have the name <prefix>_<query_id>, where <prefix> is a configurable prefix with default value databricks_streaming_checkpoint and query_id is a streaming query ID with _ characters removed. 若要查找陈旧的或已删除的流式查询的所有检查点表,请运行以下查询:To find all checkpoint tables for stale or deleted streaming queries, run the query:

SELECT * FROM sys.tables WHERE name LIKE 'databricks_streaming_checkpoint%'

可以通过 Spark SQL 配置选项 spark.databricks.sqldw.streaming.exactlyOnce.checkpointTableNamePrefix 来配置前缀。You can configure the prefix with the Spark SQL configuration option spark.databricks.sqldw.streaming.exactlyOnce.checkpointTableNamePrefix.

常见问题 (FAQ)Frequently asked questions (FAQ)

我在使用 Azure Synapse 连接器时收到一个错误。如何判断此错误是来自 Azure Synapse 还是来自 Azure Databricks?I received an error while using the Azure Synapse connector. How can I tell if this error is from Azure Synapse or Azure Databricks?

为了帮助你调试错误,特定于 Azure Synapse 连接器的代码引发的任何异常都会包装在一个扩展 SqlDWException 特征的异常中。To help you debug errors, any exception thrown by code that is specific to the Azure Synapse connector is wrapped in an exception extending the SqlDWException trait. 异常还会进行以下区分:Exceptions also make the following distinction:

  • SqlDWConnectorException 表示 Azure Synapse 连接器引发的错误SqlDWConnectorException represents an error thrown by the Azure Synapse connector
  • SqlDWSideException 表示连接的 Azure Synapse 实例引发的错误SqlDWSideException represents an error thrown by the connected Azure Synapse instance

如果查询失败,并出现“在会话配置或全局 Hadoop 配置中找不到访问密钥”错误,该怎么办?What should I do if my query failed with the error “No access key found in the session conf or the global Hadoop conf”?

此错误意味着,Azure Synapse 连接器在适用于 tempDir 中指定的存储帐户的笔记本会话配置或全局 Hadoop 配置中找不到存储帐户访问密钥。This error means that Azure Synapse connector could not find the storage account access key in the notebook session configuration or global Hadoop configuration for the storage account specified in tempDir. 有关如何正确配置存储帐户访问权限的示例,请参阅用法(批处理)See Usage (Batch) for examples of how to configure Storage Account access properly. 如果使用 Azure Synapse 连接器创建 Spark 表,则仍然必须提供存储帐户访问凭据才能在 Spark 表中进行读取或写入操作。If a Spark table is created using Azure Synapse connector, you must still provide the storage account access credentials in order to read or write to the Spark table.

能否使用共享访问签名 (SAS) 访问通过 tempDir 指定的 Blob 存储容器?Can I use a Shared Access Signature (SAS) to access the Blob storage container specified by tempDir?

Azure Synapse 不支持使用 SAS 访问 Blob 存储Azure Synapse does not support using SAS to access Blob storage. 因此,Azure Synapse 连接器不支持使用 SAS 来访问通过 tempDir 指定的 Blob 存储容器。Therefore the Azure Synapse connector does not support SAS to access the Blob storage container specified by tempDir.

我通过将 Azure Synapse 连接器与 dbTable 选项配合使用创建了一个 Spark 表,并向此 Spark 表写入了一些数据,然后删除了此 Spark 表。在 Azure Synapse 端创建的表是否会被删除?I created a Spark table using Azure Synapse connector with the dbTable option, wrote some data to this Spark table, and then dropped this Spark table. Will the table created at the Azure Synapse side be dropped?

否。No. Azure Synapse 被视为外部数据源。Azure Synapse is considered an external data source. 删除 Spark 表时,不会删除其名称通过 dbTable 进行设置的 Azure Synapse 表。The Azure Synapse table with the name set through dbTable is not dropped when the Spark table is dropped.

将数据帧写入 Azure Synapse 时,为什么需要使用 .option("dbTable", tableName).save() 而不是直接使用 .saveAsTable(tableName)When writing a DataFrame to Azure Synapse, why do I need to say .option("dbTable", tableName).save() instead of just .saveAsTable(tableName)?

这是因为我们想要使以下区分非常清晰:.option("dbTable", tableName) 是指数据库(即 Azure Synapse)表,而 .saveAsTable(tableName) 是指 Spark 表。That is because we want to make the following distinction clear: .option("dbTable", tableName) refers to the database (that is, Azure Synapse) table, whereas .saveAsTable(tableName) refers to the Spark table. 事实上,你甚至可以将二者结合在一起:df.write. ... .option("dbTable", tableNameDW).saveAsTable(tableNameSpark),这样就会在 Azure Synapse 中创建一个名为 tableNameDW 的表,并会在 Spark 中创建一个名为 tableNameSpark 且受 Azure Synapse 表支持的外部表。In fact, you could even combine the two: df.write. ... .option("dbTable", tableNameDW).saveAsTable(tableNameSpark) which creates a table in Azure Synapse called tableNameDW and an external table in Spark called tableNameSpark that is backed by the Azure Synapse table.

警告

请注意 .save().saveAsTable() 之间的以下差异:Beware of the following difference between .save() and .saveAsTable():

  • 对于 df.write. ... .option("dbTable", tableNameDW).mode(writeMode).save()writeMode 会按预期方式作用于 Azure Synapse 表。For df.write. ... .option("dbTable", tableNameDW).mode(writeMode).save(), writeMode acts on the Azure Synapse table, as expected.
  • 对于 df.write. ... .option("dbTable", tableNameDW).mode(writeMode).saveAsTable(tableNameSpark)writeMode 会作用于 Spark 表,而 tableNameDW 则会被以无提示方式覆盖(如果它已存在于 Azure Synapse 中)。For df.write. ... .option("dbTable", tableNameDW).mode(writeMode).saveAsTable(tableNameSpark), writeMode acts on the Spark table, whereas tableNameDW is silently overwritten if it already exists in Azure Synapse.

此行为与将数据写入任何其他数据源没有什么不同。This behavior is no different from writing to any other data source. 这只是 Spark DataFrameWriter API 的注意事项。It is just a caveat of the Spark DataFrameWriter API.