数据库和表 Databases and tables

Azure Databricks 数据库是表的集合。An Azure Databricks database is a collection of tables. Azure Databricks 表是结构化数据的集合。An Azure Databricks table is a collection of structured data. 可在 Azure Databricks 表上缓存、筛选和执行 Apache Spark 数据帧支持的任何操作。You can cache, filter, and perform any operations supported by Apache Spark DataFrames on Azure Databricks tables. 可使用 Spark APISpark SQL 查询表。You can query tables with Spark APIs and Spark SQL.

有两种类型的表:全局和本地。There are two types of tables: global and local. 全局表可在所有群集中使用。A global table is available across all clusters. Azure Databricks 将全局表注册到 Azure Databricks Hive 元存储或外部 Hive 元存储Azure Databricks registers global tables either to the Azure Databricks Hive metastore or to an external Hive metastore. 有关 Hive 支持的详细信息,请参阅 Apache Hive 兼容性For details about Hive support, see Apache Hive compatibility. 本地表不能从其他群集访问,且未在 Hive 元存储中注册。A local table is not accessible from other clusters and is not registered in the Hive metastore. 这也称为临时视图。This is also known as a temporary view .

可使用创建表 UI以编程方式创建表。You can create a table using the Create Table UI or programmatically. 可从 DBFS 中的文件或存储在任何受支持数据源中的数据填充表。A table can be populated from files in DBFS or data stored in any of the supported data sources.

要求Requirements

若要查看和创建数据库和表,必须有一个正在运行的群集。To view and create databases and tables, you must have a running cluster.

查看数据库和表View databases and tables

单击边栏中的数据图标Click Data Icon in the sidebar. Azure Databricks 选择一个你有权访问的正在运行的群集。Azure Databricks selects a running cluster to which you have access. “数据库”文件夹显示已选定 default 数据库的数据库的列表。The Databases folder displays the list of databases with the default database selected. “表”文件夹显示 default 数据库中的表列表。The Tables folder displays the list of tables in the default database.

创建表列表Create tables list

可从“数据库”菜单更改群集,创建表 UI查看表 UIYou can change the cluster from the Databases menu, create table UI, or view table UI. 例如,在“数据库”菜单中:For example, from the Databases menu:

  1. 单击Click the “数据库”文件夹顶部的 向下脱字号。at the top of the Databases folder.

  2. 选择一个群集。Select a cluster.

    选择群集Select cluster

创建数据库Create a database

在 SQL 中创建数据库:To create a database in SQL:

CREATE DATABASE <database-name> ...

有关更多选项,请参阅创建数据库For more options, see Create Database.

创建表Create a table

可使用 UI 或以编程方式创建表。You can create a table using the UI or programmatically.

使用 UI 创建表 Create a table using the UI

备注

使用 UI 创建表时,无法When you create a table using the UI, you cannot

备注

管理员用户可禁用此功能。Admin users can disable this feature. 请查看管理数据上传See Manage data upload.

使用 UI 创建表时,将创建一个全局表。When you create a table using the UI, you create a global table.

  1. 单击Click 边栏中的 数据图标。in the sidebar. 显示“数据库”和“表”文件夹。The Databases and Tables folders display.

  2. 在“数据库”文件夹中,选择一个数据库。In the Database folder, select a database.

  3. 在“表”文件夹的上方,单击“添加数据”。Above the Tables folder, click Add Data .

    “添加表”图标

  4. 选择一个数据源,然后按照步骤配置该表。Choose a data source and follow the steps to configure the table.

    重要

    表名只能包含小写字母数字字符和下划线,且必须以小写字母或下划线开头。Table names can only contain lowercase alphanumeric characters and underscores and must start with a lowercase letter or underscore.

    配置表Configure table

    上传文件Upload File

    1. 将文件拖动到“文件”放置区域,或者单击放置区域以浏览并选择文件。Drag files to the File dropzone or click the dropzone to browse to and choose files. 上传后,将为每个文件显示一个路径。After upload, a path displays for each file. 路径将类似于 /FileStore/tables/<filename>-<random-number>.<file-type>,可在笔记本中使用此路径来读取数据。The path will be something like /FileStore/tables/<filename>-<random-number>.<file-type> and you use this path in a notebook to read data.

      文件放置区域File dropzone

    2. 单击“使用 UI 创建表”。Click Create Table with UI .

    3. 在群集下拉列表中,选择一个群集。In the Cluster drop-down, choose a cluster.

    DBFSDBFS

    1. 选择一个文件。Select a file.
    2. 单击“使用 UI 创建表”。Click Create Table with UI .
    3. 在群集下拉列表中,选择一个群集。In the Cluster drop-down, choose a cluster.
  5. 单击“预览表”以查看表。Click Preview Table to view the table.

  6. 在“表名称”字段中,可选择性地替代默认表名称。In the Table Name field, optionally override the default table name.

  7. 在“在数据库中创建”字段中,可选择性地替代所选的 default 数据库。In the Create in Database field, optionally override the selected default database.

  8. 在“文件类型”字段中,可选择性地替代推断得到的文件类型。In the File Type field, optionally override the inferred file type.

  9. 如果文件类型为 CSV:If the file type is CSV:

    1. 在“列分隔符”字段中,选择是否要替代推断得到的分隔符。In the Column Delimiter field, select whether to override the inferred delimiter.
    2. 指示是否将第一行用作列标题。Indicate whether to use the first row as the column titles.
    3. 指示是否推断架构。Indicate whether to infer the schema.
  10. 如果文件类型为 JSON,则指示文件是否为多行文件。If the file type is JSON, indicate whether the file is multi-line.

  11. 单击“创建表”。Click Create Table .

在笔记本中创建表Create a table in a notebook

在“创建新表 UI”中,可使用 Azure Databricks 提供的快速入门笔记本连接到任何数据源。In the Create New Table UI you can use quickstart notebooks provided by Azure Databricks to connect to any data source.

  • DBFS:单击“在笔记本中创建表”。DBFS: Click Create Table in Notebook .
  • 其他数据源:在“连接器”下拉列表中,选择一种数据源类型。Other Data Sources: In the Connector drop-down, select a data source type. 然后,单击“在笔记本中创建表”。Then click Create Table in Notebook .

以编程方式创建表 Create a table programmatically

本部分介绍如何以编程方式创建全局表和本地表。This section describes how to create global and local tables programmatically.

创建全局表Create a global table

在 SQL 中创建全局表:To create a global table in SQL:

CREATE TABLE <table-name> ...

有关更多选项,请参阅创建表For more options, see Create Table.

用 Python 或 Scala 从数据帧创建全局表:To create a global table from a DataFrame in Python or Scala:

dataFrame.write.saveAsTable("<table-name>")

创建本地表 Create a local table

用 Python 或 Scala 从数据帧创建本地表:To create a local table from a DataFrame in Python or Scala:

dataFrame.createOrReplaceTempView("<table-name>")

以下示例根据 Databricks 文件系统 (DBFS) 的文件创建一个名为 diamonds 的本地表:Here is an example that creates a local table called diamonds from a file in Databricks File System (DBFS):

dataFrame = "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"
spark.read.format("csv").option("header","true")\
  .option("inferSchema", "true").load(dataFrame)\
  .createOrReplaceTempView("diamonds")

访问表Access a table

可查看表详细信息,并读取、更新和删除表。You can view table details, read, update, and delete a table.

查看表详细信息 View table details

表详细信息视图显示了表架构和示例数据。The table details view shows the table schema and sample data.

  1. 单击Click 边栏中的 数据图标。in the sidebar.

  2. 在“数据库”文件夹中单击数据库。In the Databases folder, click a database.

  3. 在“表”文件夹中单击表名称。In the Tables folder, click the table name.

  4. 在“群集”下拉列表中,可另外选择一个要呈现表预览的群集。In the Cluster drop-down, optionally select another cluster to render the table preview.

    表详细信息Table details

    备注

    为了显示表预览,Spark SQL 查询在“群集”下拉列表中选择的群集上运行。To display the table preview, a Spark SQL query runs on the cluster selected in the Cluster drop-down. 如果群集上已有工作负载正在运行,则可能需要较长时间才能加载表预览。If the cluster already has a workload running on it, the table preview may take longer to load.

查询表Query a table

这些示例演示如何查询和显示名为 diamonds 的表。These examples show you how to query and display a table called diamonds.

SQLSQL

SELECT * FROM diamonds

PythonPython

diamonds = spark.sql("select * from diamonds")
display(diamonds.select("*"))

diamonds = spark.table("diamonds")
display(diamonds.select("*"))

RR

diamonds <- sql(sqlContext, "select * from diamonds")
display(diamonds)

diamonds <- table(sqlContext, "diamonds")
display(diamonds)

ScalaScala

val diamonds = spark.sql("select * from diamonds")
display(diamonds.select("*"))

val diamonds = spark.table("diamonds")
display(diamonds.select("*"))

更新表Update a table

表架构是不可变的。The table schema is immutable. 不过,可通过更改基础文件来更新表数据。However, you can update table data by changing the underlying files.

例如,对于从存储目录创建的表,添加或删除该目录中的文件将更改表的内容。For example, for tables created from a storage directory, adding or removing files in that directory changes the contents of the table.

更新表的基础文件后,请使用以下命令刷新表:After updating the files underlying a table, refresh the table using the following command:

REFRESH TABLE <table-name>

这可确保在访问表时,Spark SQL 会读取正确的文件,即使基础文件发生更改也是如此。This ensures that when you access the table, Spark SQL reads the correct files even if the underlying files change.

删除表 Delete a table

使用 UI 删除表Delete a table using the UI

  1. 单击Click 边栏中的 数据图标。in the sidebar.
  2. 单击表名称旁边的菜单下拉列表,然后选择“删除”。Click the Menu Dropdown next to the table name and select Delete .

以编程方式删除表Delete a table programmatically

DROP TABLE <table-name>

托管表和非托管表 Managed and unmanaged tables

每个 Spark SQL 表都具有存储架构和数据本身的元数据信息。Every Spark SQL table has metadata information that stores the schema and the data itself.

托管表是一个 Spark SQL 表,由 Spark 管理它的数据和元数据。A managed table is a Spark SQL table for which Spark manages both the data and the metadata. 对于托管表,Databricks 将元数据和数据存储在帐户中的 DBFS 中。In the case of managed table, Databricks stores the metadata and data in DBFS in your account. 表由 Spark SQL 进行管理,因此执行 DROP TABLE example_data 会删除元数据和数据。Since Spark SQL manages the tables, doing a DROP TABLE example_data deletes both the metadata and data .

创建托管表的一些常见方法是:Some common ways of creating a managed table are:

SQLSQL

CREATE TABLE <example-table>(id STRING, value STRING)

PythonPython

dataframe.write.saveAsTable("<example-table>")

另一种选择是让 Spark SQL 管理元数据,由你来控制数据位置。Another option is to let Spark SQL manage the metadata, while you control the data location. 这称为非托管表。We refer to this as an unmanaged table . 由 Spark SQL 管理相关元数据,因此在你执行 DROP TABLE <example-table> 时,Spark 只删除元数据,不删除数据本身。Spark SQL manages the relevant metadata, so when you perform DROP TABLE <example-table>, Spark removes only the metadata and not the data itself. 数据仍存在于你提供的路径中。The data is still present in the path you provided.

可在数据源(例如 Cassandra、JDBC 表等)中创建包含数据的非托管表。You can create an unmanaged table with your data in data sources such as Cassandra, JDBC table, and so on. 若要详细了解 Databricks 支持的数据源,请参阅数据源See Data sources for more information about the data sources supported by Databricks. 创建非托管表的一些常见方法是:Some common ways of creating an unmanaged table are:

SQLSQL

CREATE TABLE <example-table>(id STRING, value STRING) USING org.apache.spark.sql.parquet OPTIONS (PATH "<your-storage-path>")

PythonPython

dataframe.write.option('path', "<your-storage-path>").saveAsTable("<example-table>")

替换表内容Replace table contents

可通过两种主要方法来替换表内容:简单方法和推荐方法。There are two main approaches to replacing table contents: simple and recommended.

用于替换表内容的简单方法Simple way to replace table contents

要替换表内容,最简单的方法是删除表元数据和数据,并再创建一个表。The simplest way replace table contents is to delete the table metadata and data and create another table.

托管表Managed table
DROP TABLE IF EXISTS <example-table>     // deletes the metadata and data
CREATE TABLE <example-table> AS SELECT ...
非托管表Unmanaged table
DROP TABLE IF EXISTS <example-table>         // deletes the metadata
dbutils.fs.rm("<your-storage-path>", true)   // deletes the data
CREATE TABLE <example-table> using org.apache.spark.sql.parquet OPTIONS (path "<your-storage-path>") AS SELECT ...

另一种方法是:An alternative is to:

  1. 使用 SQL DDL 创建表:Create the table using the SQL DDL:

    CREATE TABLE <table-name> (id long, date string) USING PARQUET LOCATION "<storage-location>"
    
  2. 将新数据存储在 <storage-location> 中。Store new data in <storage-location>.

  3. 运行 refresh table <table-name>Run refresh table <table-name>.

为了避免潜在的一致性问题,替换表内容的最佳方法是覆盖表。To avoid potential consistency issues, the best approach to replacing table contents is to overwrite the table.

PythonPython
dataframe.write.mode("overwrite").saveAsTable("<example-table>") // Managed Overwrite
dataframe.write.mode("overwrite").option("path","<your-storage-path>").saveAsTable("<example-table>")  // Unmanaged Overwrite
SQLSQL

使用 insert overwrite 关键字。Use the insert overwrite keyword. 此方法适用于托管表和非托管表。This method applies to managed and unmanaged tables. 例如,对于非托管表:For example, for an unmanaged table:

CREATE TABLE <example-table>(id STRING, value STRING) USING org.apache.spark.sql.parquet OPTIONS (PATH "<your-storage-path>")
INSERT OVERWRITE TABLE <example-table> SELECT ...
ScalaScala
dataframe.write.mode(SaveMode.Overwrite).saveAsTable("<example-table>")    // Managed Overwrite
dataframe.write.mode(SaveMode.Overwrite).option("path", "<your-storage-path>").saveAsTable("<example-table>")  // Unmanaged Overwrite

分区表Partitioned tables

Spark SQL 能够在文件存储级别动态生成分区,以便为表提供分区列。Spark SQL is able to generate partitions dynamically at the file storage level to provide partition columns for tables.

创建已分区表Create a partitioned table

这些示例对写入的数据进行分区。These examples partition data that you write. Spark SQL 会发现分区,并在 Hive 元存储中注册它们。Spark SQL discovers the partitions and registers them in the Hive metastore.

// Create managed table as select
dataframe.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("<example-table>")

// Create unmanaged/external table as select
dataframe.write.mode(SaveMode.Overwrite).option("path", "<file-path>").saveAsTable("<example-table>")

但是,如果从现有数据创建已分区表,Spark SQL 不会自动发现分区,也不会在 Hive 元存储中注册它们。However, if you create a partitioned table from existing data, Spark SQL does not automatically discover the partitions and register them in the Hive metastore. 在这种情况下,SELECT * FROM <example-table> 不会返回结果。In this case, SELECT * FROM <example-table> does not return results. 若要注册分区,请运行以下命令来生成分区:MSCK REPAIR TABLE "<example-table>"To register the partitions, run the following to generate the partitions: MSCK REPAIR TABLE "<example-table>".

// Save data to external files
dataframe.write.mode(SaveMode.Overwrite).partitionBy("id").parquet("<file-path>")

// Create unmanaged/external table
spark.sql("CREATE TABLE <example-table>(id STRING, value STRING) USING parquet PARTITIONED BY(id) LOCATION "<file-path>"")
spark.sql("MSCK REPAIR TABLE "<example-table>"")

删除分区Partition pruning

扫描表时,Spark 会向下推送涉及 partitionBy 键的筛选器谓词。When the table is scanned, Spark pushes down the filter predicates involving the partitionBy keys. 在这种情况下,Spark 避免读取不符合这些谓词的数据。In that case, Spark avoids reading data that doesn’t satisfy those predicates. 例如,假设有一个按 <date> 分区的表 <example-data>For example, suppose you have a table <example-data> that is partitioned by <date>. 查询(如 SELECT max(id) FROM <example-data> WHERE date = '2010-10-10')仅读取包含其 date 值与查询中指定的元组相匹配的元组的数据文件。A query such as SELECT max(id) FROM <example-data> WHERE date = '2010-10-10' reads only the data files containing tuples whose date value matches the one specified in the query.

表访问控制Table access control

通过表访问控制,管理员和用户可向其他用户提供精细化的访问权限。Table access control allow admins and users to give fine-grained access to other users. 有关详细信息,请参阅数据对象特权See Data object privileges for details.