CREATE TABLE USINGCREATE TABLE USING

使用数据源来定义表。Defines a table using a data source.

语法Syntax

CREATE TABLE [ IF NOT EXISTS ] table_identifier
    [ ( col_name1 col_type1 [ COMMENT col_comment1 ], ... ) ]
    USING data_source
    [ OPTIONS ( key1=val1, key2=val2, ... ) ]
    [ PARTITIONED BY ( col_name1, col_name2, ... ) ]
    [ CLUSTERED BY ( col_name3, col_name4, ... )
        [ SORTED BY ( col_name [ ASC | DESC ], ... ) ]
        INTO num_buckets BUCKETS ]
    [ LOCATION path ]
    [ COMMENT table_comment ]
    [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ]
    [ AS select_statement ]

USING 子句和 AS SELECT 子句之间的子句可按任意顺序出现。The clauses between the USING clause and the AS SELECT clause can appear in any order. 例如,可在 TBLPROPERTIES 之后写入 COMMENT table_commentFor example, you can write COMMENT table_comment after TBLPROPERTIES.

参数Parameters

  • table_identifiertable_identifier

    表名,可选择使用数据库名称进行限定。A table name, optionally qualified with a database name.

    语法: [database_name.] table_nameSyntax: [database_name.] table_name

  • USING data_sourceUSING data_source

    将要用于表的文件格式。The file format to use for the table. data_source 必须是 TEXTCSVJSONJDBCPARQUETORCHIVEDELTALIBSVM 中的一个,或 org.apache.spark.sql.sources.DataSourceRegister 的自定义实现的完全限定的类名。data_source must be one of TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, or LIBSVM, or a fully-qualified class name of a custom implementation of org.apache.spark.sql.sources.DataSourceRegister. data_sourceDELTA 时,请参阅创建 Delta 表中的其他选项。When data_source is DELTA, see the additional options in Create Delta table.

    支持使用 HIVE 创建 Hive SerDe 表。HIVE is supported to create a Hive SerDe table. 你可以使用 OPTIONS 子句指定 Hive 特定的 file_formatrow_format,这是不区分大小写的字符串映射。You can specify the Hive-specific file_format and row_format using the OPTIONS clause, which is a case-insensitive string map. 选项键为 FILEFORMATINPUTFORMATOUTPUTFORMATSERDEFIELDDELIMESCAPEDELIMMAPKEYDELIMLINEDELIMThe option keys are FILEFORMAT, INPUTFORMAT, OUTPUTFORMAT, SERDE, FIELDDELIM, ESCAPEDELIM, MAPKEYDELIM, and LINEDELIM.

  • PARTITIONED BYPARTITIONED BY

    请按指定的列对表进行分区。Partitions the table by the specified columns.

  • CLUSTERED BYCLUSTERED BY

    根据指定的列将表上创建的分区使用 Bucket 存储到固定存储桶。Bucket the partitions created on the table into fixed buckets based on the specified columns. Delta Lake 不支持此子句。This clause is not supported by Delta Lake.

    备注

    桶存储是一项优化技术,它使用 Bucket(和桶存储列)来确定数据分区并避免数据无序。Bucketing is an optimization technique that uses buckets (and bucketing columns) to determine data partitioning and avoid data shuffle.

  • SORTED BYSORTED BY

    数据在 Bucket 中的存储顺序。The order in which the data is stored in buckets. 默认为升序。Default is Ascending. Delta Lake 不支持此子句。This clause is not supported by Delta Lake.

  • LOCATIONLOCATION

    用于存储表数据的目录路径,可以是分布式存储上的一个路径。Path to the directory where table data is stored, which could be a path on distributed storage.

  • OPTIONSOPTIONS

    用于优化表的行为或配置 HIVE 表的表选项。Table options used to optimize the behavior of the table or configure HIVE tables. Delta Lake 不支持此子句。This clause is not supported by Delta Lake.

  • COMMENTCOMMENT

    用于描述表的字符串字面量。A string literal to describe the table.

  • TBLPROPERTIESTBLPROPERTIES

    用于标记表定义的键值对列表。A list of key-value pairs used to tag the table definition.

  • AS select_statementAS select_statement

    使用来自 select_statement 的数据填充表。Populates the table using the data from select_statement.

创建 Delta 表 Create Delta table

除标准 CREATE TABLE 选项外,Delta 表还支持本部分中介绍的选项。In addition to the standard CREATE TABLE options, Delta tables support the options described in this section.

备注

在 Databricks Runtime 7.0 及更高版本中可用。Available in Databricks Runtime 7.0 and above.

CREATE [OR REPLACE] table_identifier
  [(col_name1 col_type1 [NOT NULL] [COMMENT col_comment1], ...)]
  USING DELTA
  [LOCATION <path-to-delta-files>]

参数Parameters

  • table_identifiertable_identifier

    • [database_name.] table_name:表名,可选择使用数据库名称进行限定。[database_name.] table_name: A table name, optionally qualified with a database name.
    • (Delta Lake) delta.`<path-to-table>`:在指定路径创建表,无需在元存储中创建条目。(Delta Lake) delta.`<path-to-table>`: Create a table at the specified path without creating an entry in the metastore.
  • OR REPLACEOR REPLACE

    如果已存在表,则将表替换为新的配置。If a table already exists, replace the table with the new configuration.

    备注

    Databricks 强烈建议使用 REPLACE,而不是删除再重新创建表。Databricks strongly recommends using REPLACE instead of dropping and re-creating tables.

  • NOT NULLNOT NULL

    指示列值不能为 NULLIndicate that a column value cannot be NULL. 默认设置为允许 NULL 值。The default is to allow a NULL value. 如果指定了该设置,则对 Delta 表的任何更改都将检查这些 NOT NULL 约束。If specified any change to the Delta table will check these NOT NULL constraints.

    有关详细信息,请参阅 NOT NULL 约束For details, see NOT NULL constraint.

  • LOCATION LOCATION

    如果指定的 LOCATION 已包含存储在 Delta Lake 中的数据,Delta Lake 会执行以下操作:If you specify a LOCATION that already contains data stored in Delta Lake, Delta Lake does the following:

    • 如果仅指定了表名称和位置,例如:If you specify only the table name and location, for example:

      CREATE TABLE events
        USING DELTA
        LOCATION '/mnt/delta/events'
      

      Hive 元存储中的表会自动继承现有数据的架构、分区和表属性。the table in the Hive metastore automatically inherits the schema, partitioning, and table properties of the existing data. 此功能可用于将数据“导入”到元存储中。This functionality can be used to “import” data into the metastore.

    • 如果你指定了任何配置(架构、分区或表属性),则 Delta Lake 会验证指定的内容是否与现有数据的配置完全匹配。If you specify any configuration (schema, partitioning, or table properties), Delta Lake verifies that the specification exactly matches the configuration of the existing data.

    警告

    如果指定的配置与数据的配置并非完全匹配,则 Delta Lake 会引发一个描述差异的异常。If the specified configuration does not exactly match the configuration of the data, Delta Lake throws an exception that describes the discrepancy.

数据源交互Data source interaction

数据源表的作用类似于指向基础数据源的指针。A data source table acts like a pointer to the underlying data source. 例如,你可以在 Azure Databricks 中创建表 foo该表指向 MySQL 中使用 JDBC 数据源的表 barFor example, you can create a table foo in Azure Databricks that points to a table bar in MySQL using the JDBC data source. 读取和写入表 foo 时,实际上是读取和写入表 barWhen you read and write table foo, you actually read and write table bar.

通常 CREATE TABLE 用于创建“指针”,因此必须确保它指向存在的内容。In general CREATE TABLE is creating a “pointer”, and you must make sure it points to something that exists. 文件源(如 Parquet、JSON)例外。An exception is file source such as Parquet, JSON. 如果未指定 LOCATION,Azure Databricks 会创建默认的表位置。If you don’t specify the LOCATION, Azure Databricks creates a default table location.

对于 CREATE TABLE AS SELECT,Azure Databricks 会用输入查询的数据覆盖基础数据源,确保创建的表包含与输入查询完全相同的数据。For CREATE TABLE AS SELECT, Azure Databricks overwrites the underlying data source with the data of the input query, to make sure the created table contains exactly the same data as the input query.

示例Examples

--Use data source
CREATE TABLE student (id INT, name STRING, age INT) USING CSV;

--Use data from another table
CREATE TABLE student_copy USING CSV
    AS SELECT * FROM student;

--Omit the USING clause, which uses the default data source (parquet by default)
CREATE TABLE student (id INT, name STRING, age INT);

--Specify table comment and properties
CREATE TABLE student (id INT, name STRING, age INT) USING CSV
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

--Specify table comment and properties with different clauses order
CREATE TABLE student (id INT, name STRING, age INT) USING CSV
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

--Create partitioned and bucketed table
CREATE TABLE student (id INT, name STRING, age INT)
    USING CSV
    PARTITIONED BY (age)
    CLUSTERED BY (Id) INTO 4 buckets;