如何从 Python 脚本运行 SQL 查询How to run SQL queries from Python scripts

你可能想要在 Azure Databricks 笔记本之外访问表。You may want to access your tables outside of Azure Databricks notebooks. 除了通过 JDBC 连接 BI 工具外,还可以使用 Python 脚本访问表。Besides connecting BI tools via JDBC, you can also access tables by using Python scripts. 可以使用 PyHive 通过 JDBC 连接到 Spark 群集,然后运行脚本。You can connect to a Spark cluster via JDBC using PyHive and then run a script. 应在运行 Python 脚本的计算机上安装 PyHive。You should have PyHive installed on the machine where you are running the Python script.

备注

Python 2 已被视为生命周期结束Python 2 is considered end-of-life. 应使用 Python 3 运行本文中提供的脚本。You should use Python 3 to run the script provided in this article. 如果你的系统上运行了 Python 2 和 Python 3,则在继续操作之前,应确保已将你的 pip 版本链接到 Python 3。If you have both Python 2 and Python 3 running on your system, you should make sure your version of pip is linked to Python 3 before you proceed.

可以通过在命令提示符下运行 pip -V 来检查你的 pip 版本。You can check your version of pip by running pip -V at the command prompt. 此命令返回 pip 版本以及它正在使用的 Python 版本。This command returns the version of pip and the version of Python it is using.

安装 PyHive 和 ThriftInstall PyHive and Thrift

请使用 pip 安装 PyHive 和 Thrift。Use pip to install PyHive and Thrift.

   pip install pyhive thrift

运行 SQL 脚本Run SQL script

此示例 Python 脚本将 SQL 查询 show tables 发送到群集,然后显示查询结果。This sample Python script sends the SQL query show tables to your cluster and then displays the result of the query.

运行脚本之前,请执行以下操作:Do the following before you run the script:

  1. <token> 替换为你的 Azure Databricks API 令牌。Replace <token> with your Azure Databricks API token.
  2. <databricks-instance> 替换为你的 Databricks 部署的域名。Replace <databricks-instance> with the domain name of your Databricks deployment.
  3. <workspace-id> 替换为工作区 ID。Replace <workspace-id> with the Workspace ID.
  4. <cluster-id> 替换为群集 ID。Replace <cluster-id> with a cluster ID.

若要获取 API 令牌,请参阅生成令牌To get the API token, see Generate a token. 若要确定其他值,请参阅 如何获取工作区、群集、笔记本和作业详细信息To determine the other values, see How to get Workspace, Cluster, Notebook, and Job Details.

#!/usr/bin/python

import os
import sys
from pyhive import hive
from thrift.transport import THttpClient
import base64

TOKEN = "<token>"
WORKSPACE_URL = "<databricks-instance>"
WORKSPACE_ID = "<workspace-id>"
CLUSTER_ID = "<cluster-id>"

conn = 'https://%s/sql/protocolv1/o/%s/%s' % (WORKSPACE_URL, WORKSPACE_ID, CLUSTER_ID)
print(conn)

transport = THttpClient.THttpClient(conn)

auth = "token:%s" % TOKEN
PY_MAJOR = sys.version_info[0]

if PY_MAJOR < 3:
  auth = base64.standard_b64encode(auth)
else:
  auth = base64.standard_b64encode(auth.encode()).decode()

transport.setCustomHeaders({"Authorization": "Basic %s" % auth})

cursor = hive.connect(thrift_transport=transport).cursor()

cursor.execute('show tables')
for table in cursor.fetchall():
    print(table)