从 Azure Databricks 访问 Azure Cosmos DB Cassandra API 数据Access Azure Cosmos DB Cassandra API data from Azure Databricks
适用于:
Cassandra API
本文详细介绍了如何在 Azure Databricks 上通过 Spark 使用 Azure Cosmos DB Cassandra API。This article details how to workwith Azure Cosmos DB Cassandra API from Spark on Azure Databricks.
先决条件Prerequisites
预配 Azure Cosmos DB Cassandra API 帐户Provision an Azure Cosmos DB Cassandra API account
查看使用 Cassandra API 的代码示例Review the code samples for working with Cassandra API
Cassandra 连接器的 Cassandra API 实例配置:Cassandra API instance configuration for Cassandra connector:
Cassandra API 连接器要求将 Cassandra 连接的详细信息作为 spark 上下文的一部分进行初始化。The connector for Cassandra API requires the Cassandra connection details to be initialized as part of the spark context. 当启动 Databricks 笔记本时,已初始化 spark 上下文,不建议停止和重新初始化。When you launch a Databricks notebook, the spark context is already initialized and it is not advisable to stop and reinitialize it. 一种解决方案是在群集 spark 配置中添加群集级别的 Cassandra API 实例配置。One solution is to add the Cassandra API instance configuration at a cluster level, in the cluster spark configuration. 这是每个群集的一次性活动。This is a one-time activity per cluster. 将以下代码添加到 Spark 配置,作为空格分隔的键值对:Add the following code to the Spark configuration as a space separated key value pair:
spark.cassandra.connection.host YOUR_COSMOSDB_ACCOUNT_NAME.cassandra.cosmos.azure.cn spark.cassandra.connection.port 10350 spark.cassandra.connection.ssl.enabled true spark.cassandra.auth.username YOUR_COSMOSDB_ACCOUNT_NAME spark.cassandra.auth.password YOUR_COSMOSDB_KEY
添加必需的依赖项Add the required dependencies
Cassandra Spark 连接器:- 要将 Azure Cosmos DB Cassandra API 与 Spark 集成,Cassandra 连接器应附加到 Azure Databricks 群集。Cassandra Spark connector: - To integrate Azure Cosmos DB Cassandra API with Spark, the Cassandra connector should be attached to the Azure Databricks cluster. 若要附加群集:To attach the cluster:
- 查看 Databricks 运行时版本,即 Spark 版本。Review the Databricks runtime version, the Spark version. 然后找到与 Cassandra Spark 连接器兼容的 maven 坐标,并将其附加到群集。Then find the maven coordinates that are compatible with the Cassandra Spark connector, and attach it to the cluster. 请参阅“上传 Maven 包或 Spark 包”一文,将连接器库附加到群集。See "Upload a Maven package or Spark package" article to attach the connector library to the cluster. 例如,“Databricks Runtime 版本 4.3”、“Spark 2.3.1”和“Scala 2.11”的 maven 坐标为
spark-cassandra-connector_2.11-2.3.1
For example, maven coordinate for "Databricks Runtime version 4.3", "Spark 2.3.1", and "Scala 2.11" isspark-cassandra-connector_2.11-2.3.1
- 查看 Databricks 运行时版本,即 Spark 版本。Review the Databricks runtime version, the Spark version. 然后找到与 Cassandra Spark 连接器兼容的 maven 坐标,并将其附加到群集。Then find the maven coordinates that are compatible with the Cassandra Spark connector, and attach it to the cluster. 请参阅“上传 Maven 包或 Spark 包”一文,将连接器库附加到群集。See "Upload a Maven package or Spark package" article to attach the connector library to the cluster. 例如,“Databricks Runtime 版本 4.3”、“Spark 2.3.1”和“Scala 2.11”的 maven 坐标为
Azure Cosmos DB Cassandra API 特定的库:- 需要自定义连接工厂才能将重试策略从 Cassandra Spark 连接器配置到 Azure Cosmos DB Cassandra API。Azure Cosmos DB Cassandra API-specific library: - A custom connection factory is required to configure the retry policy from the Cassandra Spark connector to Azure Cosmos DB Cassandra API. 添加
com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.0.0
maven 坐标将库附加到群集。Add thecom.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.0.0
maven coordinates to attach the library to the cluster.
示例笔记本Sample notebooks
可在 GitHub 存储库下载一系列 Azure Databricks 示例笔记本。A list of Azure Databricks sample notebooks are available in GitHub repo for you to download. 这些示例包括如何从 Spark 连接到 Azure Cosmos DB Cassandra API 并对数据执行不同的 CRUD 操作。These samples include how to connect to Azure Cosmos DB Cassandra API from Spark and perform different CRUD operations on the data. 此外可以将所有笔记本导入到 Databricks 群集工作区并运行。You can also import all the notebooks into your Databricks cluster workspace and run it.
从 Spark Scala 程序访问 Azure Cosmos DB Cassandra APIAccessing Azure Cosmos DB Cassandra API from Spark Scala programs
在 Azure Databricks 上作为自动化流程运行的 Spark 程序将通过使用 spark-submit) 提交给群集,并安排在 Azure Databricks 作业中运行。Spark programs to be run as automated processes on Azure Databricks are submitted to the cluster by using spark-submit) and scheduled to run through the Azure Databricks jobs.
以下链接可以帮助你开始生成 Spark Scala 程序,以便与 Azure Cosmos DB Cassandra API 进行交互。The following are links to help you get started building Spark Scala programs to interact with Azure Cosmos DB Cassandra API.
- 如何从 Spark Scala 程序连接到 Azure Cosmos DB Cassandra APIHow to connect to Azure Cosmos DB Cassandra API from a Spark Scala program
- 如何在 Azure Databricks 上以自动化作业的形式运行 Spark Scala 程序How to run a Spark Scala program as an automated job on Azure Databricks
- 使用 Cassandra API 的代码示例完整列表Complete list of code samples for working with Cassandra API
后续步骤Next steps
使用 Java 应用程序开始创建 Cassandra API 帐户、数据库和表。Get started with creating a Cassandra API account, database, and a table by using a Java application.