Infoworks 集成Infoworks integration

重要

此功能目前以公共预览版提供。This feature is in Public Preview.

Infoworks DataFoundry 是自动化的企业数据运营和业务流程系统,本地运行于 Azure Databricks,可充分利用 Azure Databricks 来提供简单的数据加入解决方案,而数据加入是运营数据湖的关键第一步。Infoworks DataFoundry is an automated enterprise data operations and orchestration system that runs natively on Azure Databricks and leverages the full power of Azure Databricks to deliver a easy solution for data onboarding—an important first step in operationalizing your data lake. DataFoundry 不仅自动执行数据引入,还自动执行为建立分析的基础而必须伴随引入的关键功能。DataFoundry not only automates data ingestion, but also automates the key functionality that must accompany ingestion to establish a foundation for analytics. 使用 DataFoundry 进行数据加入可自动执行以下操作:Data onboarding with DataFoundry automates:

  • 数据引入:来自所有企业和外部数据源Data ingestion: from all enterprise and external data sources
  • 数据同步:CDC 使数据与源保持同步Data synchronization: CDC to keep data synchronized with the source
  • 数据治理:编目、世系、元数据管理、审计和历史记录Data governance: cataloging, lineage, metadata management, audit, and history

以下是结合使用 Infoworks 与 Azure Databricks 的步骤。Here are the steps for using Infoworks with Azure Databricks.

步骤 1:生成 Databricks 个人访问令牌 Step 1: Generate a Databricks personal access token

Infoworks 使用 Azure Databricks 个人访问令牌在 Azure Databricks 中进行身份验证。Infoworks authenticates with Azure Databricks using an Azure Databricks personal access token. 若要生成个人访问令牌,请按照生成个人访问令牌中的说明操作。To generate a personal access token, follow the instructions in Generate a personal access token.

步骤2:设置群集来支持集成需求 Step 2: Set up a cluster to support integration needs

Infoworks 会将数据写入 Azure Data Lake Storage 路径,而 Azure Databricks 集成群集将从该位置读取数据。Infoworks will write data to an Azure Data Lake Storage path and the Azure Databricks integration cluster will read data from that location. 因此,集成群集需要能够安全地访问 Azure Data Lake Storage 路径。Therefore the integration cluster requires secure access to the Azure Data Lake Storage path.

安全地访问 Azure Data Lake Storage 路径Secure access to an Azure Data Lake Storage path

若要安全地访问 Azure Data Lake Storage (ADLS) 中的数据,可使用 Azure 存储帐户访问密钥(推荐)或 Azure 服务主体。To secure access to data in Azure Data Lake Storage (ADLS) you can use an Azure storage account access key (recommended) or an Azure service principal.

使用 Azure 存储帐户访问密钥Use an Azure storage account access key

可在配置 Apache Spark 期间在集成群集上配置存储帐户访问密钥。You can configure a storage account access key on the integration cluster as part of the Apache Spark configuration. 确保存储帐户可访问用于暂存数据的 ADLS 容器和文件系统,以及要在其中写入 Delta Lake 表的 ADLS 容器和文件系统。Ensure that the storage account has access to the ADLS container and file system used for staging data and the ADLS container and file system where you want to write the Delta Lake tables. 若要将集成群集配置为使用密钥,请按照使用存储密钥访问 ADLS Gen2 中的步骤操作。To configure the integration cluster to use the key, follow the steps in Access ADLS Gen2 with storage key.

使用 Azure 服务主体Use an Azure service principal

可在配置 Apache Spark 期间在 Azure Databricks 集成群集上配置服务主体。You can configure a service principal on the Azure Databricks integration cluster as part of the Apache Spark configuration. 确保服务主体可访问用于暂存数据的 ADLS 容器,以及要在其中写入 Delta 表的 ADLS 容器。Ensure that the service principal has access to the ADLS container used for staging data and the ADLS container where you want to write the Delta tables.

指定群集配置Specify the cluster configuration

  1. 在“群集模式”下拉列表中,选择“标准” 。In the Cluster Mode drop-down, select Standard.

  2. 在“Databricks Runtime 版本”下拉列表中,选择一个 Databricks 运行时版本。In the Databricks Runtime Version drop-down, select a Databricks runtime version.

  3. 将以下属性添加到 Spark 配置来打开自动优化Turn on Auto Optimize by adding the following properties to your Spark configuration:

    spark.databricks.delta.optimizeWrite.enabled true
    spark.databricks.delta.autoCompact.enabled true
    
  4. 根据集成和缩放需求配置群集。Configure your cluster depending on your integration and scaling needs.

有关群集配置的详细信息,请参阅配置群集For cluster configuration details, see Configure clusters.

有关获取 JDBC URL 和 HTTP 路径的步骤,请参阅服务器主机名、端口、HTTP 路径和 JDBC URLSee Server hostname, port, HTTP path, and JDBC URL for the steps to obtain the JDBC URL and HTTP Path.

步骤 3:获取 JDBC 和 ODBC 连接详细信息以连接到群集 Step 3: Obtain JDBC and ODBC connection details to connect to a cluster

若要将 Azure Databricks 群集连接到 Infoworks,你需要以下 JDBC/ODBC 连接属性:To connect an Azure Databricks cluster to Infoworks you need the following JDBC/ODBC connection properties:

  • JDBC URLJDBC URL
  • HTTP 路径HTTP Path

步骤 4:获取 Azure Databricks 的 InfoworksStep 4: Get Infoworks for Azure Databricks

转到 Infoworks,了解更多信息并获取演示。Go to Infoworks to learn more and get a demo.