StreamSets 集成StreamSets integration


此功能目前以公共预览版提供。This feature is in Public Preview.

StreamSets 可帮助你在数据流的整个生命周期管理和监视数据流。StreamSets helps you to manage and monitor your data flow throughout its lifecycle. 通过 Streamsets 与 Azure Databricks 和 Delta Lake 的本机集成,可从各种来源拉取数据,轻松管理管道。StreamSets native integration with Azure Databricks and Delta Lake allows you to pull data from various sources and manage your pipelines easily.

下面是结合使用 StreamSets 与 Azure Databricks 的步骤。Here are the steps for using StreamSets with Azure Databricks.

步骤 1:生成 Databricks 个人访问令牌 Step 1: Generate a Databricks personal access token

StreamSets 使用 Azure Databricks 个人访问令牌在 Azure Databricks 中进行身份验证。StreamSets authenticates with Azure Databricks using an Azure Databricks personal access token. 若要生成个人访问令牌,请按照生成个人访问令牌中的说明操作。To generate a personal access token, follow the instructions in Generate a personal access token.

步骤2:设置群集来支持集成需求 Step 2: Set up a cluster to support integration needs

StreamSets 会将数据写入 Azure Data Lake Storage 路径,而 Azure Databricks 集成群集将从该位置读取数据。StreamSets will write data to an Azure Data Lake Storage path and the Azure Databricks integration cluster will read data from that location. 因此,集成群集需要能够安全地访问 Azure Data Lake Storage 路径。Therefore the integration cluster requires secure access to the Azure Data Lake Storage path.

安全地访问 Azure Data Lake Storage 路径Secure access to an Azure Data Lake Storage path

若要安全地访问 Azure Data Lake Storage (ADLS) 中的数据,可使用 Azure 存储帐户访问密钥(推荐)或 Azure 服务主体。To secure access to data in Azure Data Lake Storage (ADLS) you can use an Azure storage account access key (recommended) or an Azure service principal.

使用 Azure 存储帐户访问密钥Use an Azure storage account access key

可在配置 Apache Spark 期间在集成群集上配置存储帐户访问密钥。You can configure a storage account access key on the integration cluster as part of the Apache Spark configuration. 确保存储帐户可访问用于暂存数据的 ADLS 容器和文件系统,以及要在其中写入 Delta Lake 表的 ADLS 容器和文件系统。Ensure that the storage account has access to the ADLS container and file system used for staging data and the ADLS container and file system where you want to write the Delta Lake tables. 若要将集成群集配置为使用密钥,请按照使用存储密钥访问 ADLS Gen2 中的步骤操作。To configure the integration cluster to use the key, follow the steps in Access ADLS Gen2 with storage key.

使用 Azure 服务主体Use an Azure service principal

可在配置 Apache Spark 期间在 Azure Databricks 集成群集上配置服务主体。You can configure a service principal on the Azure Databricks integration cluster as part of the Apache Spark configuration. 确保服务主体可访问用于暂存数据的 ADLS 容器,以及要在其中写入 Delta 表的 ADLS 容器。Ensure that the service principal has access to the ADLS container used for staging data and the ADLS container where you want to write the Delta tables.

指定群集配置Specify the cluster configuration

  1. 在“群集模式”下拉列表中,选择“标准” 。In the Cluster Mode drop-down, select Standard.

  2. 在“Databricks Runtime 版本”下拉列表中,选择 Runtime 6.3 或更高版本。In the Databricks Runtime Version drop-down, select Runtime: 6.3 or above.

  3. 将以下属性添加到 Spark 配置来打开自动优化Turn on Auto Optimize by adding the following properties to your Spark configuration: true true
  4. 根据集成和缩放需求配置群集。Configure your cluster depending on your integration and scaling needs.

有关群集配置的详细信息,请参阅配置群集For cluster configuration details, see Configure clusters.

有关获取 JDBC URL 和 HTTP 路径的步骤,请参阅服务器主机名、端口、HTTP 路径和 JDBC URLSee Server hostname, port, HTTP path, and JDBC URL for the steps to obtain the JDBC URL and HTTP Path.

步骤 3:获取 JDBC 和 ODBC 连接详细信息以连接到群集 Step 3: Obtain JDBC and ODBC connection details to connect to a cluster

若要将 Azure Databricks 群集连接到 StreamSets,需要以下 JDBC/ODBC 连接属性:To connect an Azure Databricks cluster to StreamSets you need the following JDBC/ODBC connection properties:

  • HTTP 路径HTTP Path

步骤 4:获取适用于 Azure Databricks 的 StreamsetsStep 4: Get StreamSets for Azure Databricks

注册并启动 Azure 上适用于 Databricks 的 StreamsetsRegister and start up StreamSets for Databricks on Azure.

步骤 5:了解如何使用 Streamsets 将数据加载到 Delta LakeStep 5: Learn how to use StreamSets to load data into Delta Lake

从示例管道开始,或查看 Streamsets 解决方案,了解如何构建将数据引入到 Delta Lake 的管道。Start with a sample pipeline or check out StreamSets solutions to learn how to build a pipeline that ingests data into Delta Lake.