从 Excel、Python 或 R 连接到 Azure DatabricksConnect to Azure Databricks from Excel, Python, or R

在本文中,你将学习如何使用 Microsoft Excel、Python 或 R 语言通过 Databricks ODBC 驱动程序来连接 Azure Databricks。In this article, you learn how to use the Databricks ODBC driver to connect Azure Databricks with Microsoft Excel, Python, or R language. 在建立连接后,可以从 Excel、Python 或 R 客户端访问 Azure Databricks 中的数据。Once you establish the connection, you can access the data in Azure Databricks from the Excel, Python, or R clients. 还可以使用这些客户端来进一步分析数据。You can also use the clients to further analyze the data.

先决条件Prerequisites

  • 必须具有 Azure Databricks 工作区、Spark 群集以及与群集关联的示例数据。You must have an Azure Databricks workspace, a Spark cluster, and sample data associated with your cluster. 如果尚不具备这些先决条件,请完成使用 Azure 门户在 Azure Databricks 上运行 Spark 作业中的快速入门。If you do not already have these prerequisites, complete the quickstart at Run a Spark job on Azure Databricks using the Azure portal.

  • Databricks 驱动程序下载页面下载 Databricks ODBC 驱动程序。Download the Databricks ODBC driver from Databricks driver download page. 根据要从中连接到 Azure Databricks 的应用程序安装 32 位或 64 位版本。Install the 32-bit or 64-bit version depending on the application from where you want to connect to Azure Databricks. 例如,若要从 Excel 进行连接,请安装该驱动程序的 32 位版本。For example, to connect from Excel, install the 32-bit version of the driver. 若要从 R 和 Python 进行连接,请安装该驱动程序的 64 位版本。To connect from R and Python, install the 64-bit version of the driver.

  • 在 Databricks 中设置个人访问令牌。Set up a personal access token in Databricks. 有关说明,请参阅令牌管理For instructions, see Token management.

设置 DSNSet up a DSN

数据源名称 (DSN) 包含有关特定数据源的信息。A data source name (DSN) contains the information about a specific data source. ODBC 驱动程序需要使用此 DSN 来连接到数据源。An ODBC driver needs this DSN to connect to a data source. 在本部分中,你将设置可以用于 Databricks ODBC 驱动程序的 DSN,以便从 Microsoft Excel、Python 或 R 等客户端连接到 Azure Databricks。In this section, you set up a DSN that can be used with the Databricks ODBC driver to connect to Azure Databricks from clients like Microsoft Excel, Python, or R.

  1. 从 Azure Databricks 工作区中,导航到 Databricks 群集。From the Azure Databricks workspace, navigate to the Databricks cluster.

    打开 Databricks 群集Open Databricks cluster

  2. 在“配置”选项卡下,单击“JDBC/ODBC”选项卡并复制“服务器主机名”和“HTTP 路径”的值。Under the Configuration tab, click the JDBC/ODBC tab and copy the values for Server Hostname and HTTP Path. 需要使用这些值来完成本文中的步骤。You need these values to complete the steps in this article.

    获取 Databricks 配置Get Databricks configuration

  3. 在计算机上,根据所用的应用程序启动 ODBC 数据源 应用程序(32 位或 64 位)。On your computer, start ODBC Data Sources application (32-bit or 64-bit) depending on the application. 若要从 Excel 进行连接,请使用 32 位版本。To connect from Excel, use the 32-bit version. 若要从 R 和 Python 进行连接,请使用 64 位版本。To connect from R and Python, use the 64-bit version.

    启动 ODBCLaunch ODBC

  4. 在“用户 DSN”选项卡下,单击“添加”。Under the User DSN tab, click Add. 在“创建新数据源”对话框中,选择“Simba Spark ODBC 驱动程序”,然后单击“完成”。In the Create New Data Source dialog box, select the Simba Spark ODBC Driver , and then click Finish.

    启动 ODBCLaunch ODBC

  5. 在“Simba Spark ODBC 驱动程序”对话框中,提供以下值:In the Simba Spark ODBC Driver dialog box, provide the following values:

    配置 DSNConfigure DSN

    下表提供了有关要在对话框中提供的值的信息。The following table provides information on the values to provide in the dialog box.

    字段Field Value
    数据源名称Data Source Name 为数据源提供一个名称。Provide a name for the data source.
    主机Host(s) 提供从 Databricks 工作区中为 服务器主机名 复制的值。Provide the value that you copied from the Databricks workspace for Server hostname.
    端口Port 输入 443Enter 443.
    身份验证 > 机制Authentication > Mechanism 选择“用户名和密码”。Select User name and password.
    用户名User name 输入 tokenEnter token.
    密码Password 输入从 Databricks 工作区中复制的令牌值。Enter the token value that you copied from the Databricks workspace.

    在“DSN 设置”对话框中执行以下附加步骤。Perform the following additional steps in the DSN setup dialog box.

    • 单击“HTTP 选项”。Click HTTP Options. 在打开的对话框中,粘贴从 Databricks 工作区中复制的“HTTP 路径”值。In the dialog box that opens up, paste the value for HTTP Path that you copied from Databricks workspace. 单击“确定”。Click OK.
    • 单击“SSL 选项”。Click SSL Options. 在打开的对话框中,选中“启用 SSL”复选框。In the dialog box that opens up, select the Enable SSL check box. 单击“确定”。Click OK.
    • 单击“测试”以测试到 Azure Databricks 的连接。Click Test to test the connection to Azure Databricks. 单击“确定”以保存配置。Click OK to save the configuration.
    • 在“ODBC 数据源管理器”对话框中,单击“确定”。In the ODBC Data Source Administrator dialog box, click OK.

现在已设置了 DSN。You now have your DSN set up. 在接下来的各部分中,你将使用此 DSN 从 Excel、Python 或 R 连接到 Azure Databricks。In the next sections, you use this DSN to connect to Azure Databricks from Excel, Python, or R.

从 Microsoft Excel 进行连接Connect from Microsoft Excel

在本部分中,你将使用之前创建的 DSN 将数据从 Azure Databricks 拉取到 Microsoft Excel 中。In this section, you pull data from Azure Databricks into Microsoft Excel using the DSN you created earlier. 在开始之前,请确保已在计算机上安装了 Microsoft Excel。Before you begin, make sure you have Microsoft Excel installed on your computer. 可以使用从 Microsoft Excel 试用版链接获取的 Excel 试用版。You can use a trial version of Excel from Microsoft Excel trial link.

  1. 在 Microsoft Excel 中打开一个空白工作簿。Open a blank workbook in Microsoft Excel. 在“数据”功能区中,单击“获取数据”。From the Data ribbon, click Get Data. 单击“从其他源”,然后单击“从 ODBC”。Click From Other Sources and then click From ODBC.

    从 Excel 启动 ODBCLaunch ODBC from Excel

  2. 在“从 ODBC”对话框中,选择之前创建的 DSN,然后单击“确定”。In the From ODBC dialog box, select the DSN that you created earlier and then click OK.

    选择 DSNSelect DSN

  3. 如果系统提示输入凭据,对于用户名,输入 tokenIf you are prompted for credentials, for user name enter token. 对于密码,提供从 Databricks 工作区中检索到的令牌值。For password, provide the token value that you retrieved from the Databricks workspace.

    为 Databricks 提供凭据Provide credentials for Databricks

  4. 从导航器窗口中,选择 Databricks 中要加载到 Excel 的表,然后单击“加载”。From the navigator window, select the table in Databricks that you want to load to Excel, and then click Load.

    将 dta 加载到 ExcelLoad dta into Excel

将数据加载到 Excel 工作簿后,可以对其执行分析操作。Once you have the data in your Excel workbook, you can perform analytical operations on it.

从 R 进行连接Connect from R

备注

本部分提供有关如何将桌面上运行的 R Studio 客户端与 Azure Databricks 集成的信息。This section provides information on how to integrate an R Studio client running on your desktop with Azure Databricks. 有关如何在 Azure Databricks 群集本身上使用 R Studio 的说明,请参阅 Azure Databricks 上的 R StudioFor instructions on how to use R Studio on the Azure Databricks cluster itself, see R Studio on Azure Databricks.

在本部分中,你将使用 R 语言 IDE 来引用 Azure Databricks 中可用的数据。In this section, you use an R language IDE to reference data available in Azure Databricks. 在开始之前,必须在计算机上安装以下各项。Before you begin, you must have the following installed on the computer.

  • R 语言的 IDE。An IDE for R language. 本文中使用了 RStudio for Desktop。This article uses RStudio for Desktop. 可以从 R Studio 下载安装该软件。You can install it from R Studio download.
  • 如果使用 RStudio for Desktop 作为 IDE,还需要从 https://aka.ms/rclient/ 安装 Microsoft R Client。If you use RStudio for Desktop as your IDE, also install Microsoft R Client from https://aka.ms/rclient/.

打开 RStudio 并执行以下步骤:Open RStudio and do the following steps:

  • 引用 RODBC 包。Reference the RODBC package. 这使得你可以使用之前创建的 DSN 连接到 Azure Databricks。This enables you to connect to Azure Databricks using the DSN you created earlier.
  • 使用 DSN 建立连接。Establish a connection using the DSN.
  • 对 Azure Databricks 中的数据运行 SQL 查询。Run a SQL query on the data in Azure Databricks. 在以下代码片段中, radio_sample_data 是 Azure Databricks 中已存在的表。In the following snippet, radio_sample_data is a table that already exists in Azure Databricks.
  • 对查询执行一些操作来验证输出。Perform some operations on the query to verify the output.

下面的代码片段执行以下任务:The following code snippet performs these tasks:

# reference the 'RODBC' package
require(RODBC)

# establish a connection using the DSN you created earlier
conn <- odbcConnect("<ENTER DSN NAME HERE>")

# run a SQL query using the connection you created
res <- sqlQuery(conn, "SELECT * FROM radio_sample_data")

# print out the column names in the query output
names(res) 
    
# print out the number of rows in the query output
nrow (res)

从 Python 进行连接Connect from Python

在本部分中,你将使用 Python IDE(例如 IDLE)来引用 Azure Databricks 中可用的数据。In this section, you use a Python IDE (such as IDLE) to reference data available in Azure Databricks. 在开始之前,请完成以下先决条件:Before you begin, complete the following prerequisites:

  • 此处安装 Python。Install Python from here. 从该链接安装 Python,还要安装 IDLE。Installing Python from this link also installs IDLE.

  • 在计算机上的命令提示符下,安装 pyodbc 包。From a command prompt on the computer, install the pyodbc package. 运行以下命令:Run the following command:

    pip install pyodbc
    

打开 IDLE 并执行以下步骤:Open IDLE and do the following steps:

  • 导入 pyodbc 包。Import the pyodbc package. 这使得你可以使用之前创建的 DSN 连接到 Azure Databricks。This enables you to connect to Azure Databricks using the DSN you created earlier.
  • 使用之前创建的 DSN 建立连接。Establish a connection using the DSN you created earlier.
  • 使用所创建的连接运行 SQL 查询。Run a SQL query using the connection you created. 在以下代码片段中, radio_sample_data 是 Azure Databricks 中已存在的表。In the following snippet, radio_sample_data is a table that already exists in Azure Databricks.
  • 对查询执行操作来验证输出。Perform operations on the query to verify the output.

下面的代码片段执行以下任务:The following code snippet performs these tasks:

# import the `pyodbc` package:
import pyodbc

# establish a connection using the DSN you created earlier
conn = pyodbc.connect("DSN=<ENTER DSN NAME HERE>", autocommit=True)

# run a SQL query using the connection you created
cursor = conn.cursor()
cursor.execute("SELECT * FROM radio_sample_data")

# print the rows retrieved by the query.
for row in cursor.fetchall():
    print(row)

后续步骤Next steps