创建表Create Table

Create Table UsingCreate Table Using

CREATE [OR REPLACE] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING data_source
  [OPTIONS (key1=val1, key2=val2, ...)]
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
  [LOCATION path]
  [COMMENT table_comment]
  [TBLPROPERTIES (key1=val1, key2=val2, ...)]
  [AS select_statement]

使用数据源创建表。Create a table using a data source. 如果数据库中已存在同名的表,则会引发异常。If a table with the same name already exists in the database, an exception is thrown.

OR REPLACE

如果已存在同名的表,则会使用新的配置替换该表。If a table with the same name already exists, the table is replaced with the new configuration.

备注

此语法在 Databricks Runtime 7.0 及更高版本中可用。This syntax is available in Databricks Runtime 7.0 and above. 强烈建议使用 REPLACE,而不是删除并重新创建表。We strongly recommend using REPLACE instead of dropping and re-creating tables.

IF NOT EXISTS

如果数据库中已存在同名的表,则不会执行任何操作。If a table with the same name already exists in the database, nothing will happen.

USING data_source

将要用于表的文件格式。The file format to use for the table. data_source 必须是 TEXTCSVJSONJDBCPARQUETORCHIVEDELTALIBSVM 中的一个,或 org.apache.spark.sql.sources.DataSourceRegister 的自定义实现的完全限定的类名。data_source must be one of TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, or LIBSVM, or a fully-qualified class name of a custom implementation of org.apache.spark.sql.sources.DataSourceRegister.

支持使用 HIVE 创建 Hive SerDe 表。HIVE is supported to create a Hive SerDe table. 你可以使用 OPTIONS 子句指定 Hive 特定的 file_formatrow_format,这是不区分大小写的字符串映射。You can specify the Hive-specific file_format and row_format using the OPTIONS clause, which is a case-insensitive string map. 选项键为 FILEFORMATINPUTFORMATOUTPUTFORMATSERDEFIELDDELIMESCAPEDELIMMAPKEYDELIMLINEDELIMThe option keys are FILEFORMAT, INPUTFORMAT, OUTPUTFORMAT, SERDE, FIELDDELIM, ESCAPEDELIM, MAPKEYDELIM, and LINEDELIM.

OPTIONS

用于优化表的行为或配置 HIVE 表的表选项。Table options used to optimize the behavior of the table or configure HIVE tables.

备注

Delta Lake 不支持此子句。This clause is not supported by Delta Lake.

PARTITIONED BY (col_name1, col_name2, ...)

按指定的列对创建的表进行分区。Partition the created table by the specified columns. 将为每个分区创建一个目录。A directory is created for each partition.

CLUSTERED BY col_name3, col_name4, ...)

所创建的表中的每个分区将按指定列拆分为固定数目的 Bucket。Each partition in the created table will be split into a fixed number of buckets by the specified columns. 这通常与分区操作配合使用,以便读取和无序处理较少的数据。This is typically used with partitioning to read and shuffle less data.

LOCATION path

用于存储表数据的目录。The directory to store the table data. 此子句自动隐含 EXTERNALThis clause automatically implies EXTERNAL.

(Azure Databricks 上的 Delta Lake)当指定的 LOCATION 已包含 Delta Lake 中存储的数据时,Delta Lake 会执行以下操作:(Delta Lake on Azure Databricks) When you specify a LOCATION that already contains data stored in Delta Lake, Delta Lake does the following:

  • 如果仅指定了表名称和位置,例如:If you specify only the table name and location, for example:

    CREATE TABLE events
      USING DELTA
      LOCATION '/mnt/delta/events'
    

    Hive 元存储中的表会自动继承现有数据的架构、分区和表属性。the table in the Hive metastore automatically inherits the schema, partitioning, and table properties of the existing data. 此功能可用于将数据“导入”到元存储中。This functionality can be used to “import” data into the metastore.

  • 如果你指定了任何配置(架构、分区或表属性),则 Delta Lake 会验证指定的内容是否与现有数据的配置完全匹配。If you specify any configuration (schema, partitioning, or table properties), Delta Lake verifies that the specification exactly matches the configuration of the existing data.

警告

如果指定的配置与数据的配置并非完全匹配,则 Delta Lake 会引发一个描述差异的异常。If the specified configuration does not exactly match the configuration of the data, Delta Lake throws an exception that describes the discrepancy.

AS select_statement

使用来自 SELECT 语句的输入数据填充该表。Populate the table with input data from the SELECT statement. 这不能包含列列表。This cannot contain a column list.

示例Examples

CREATE TABLE boxes (width INT, length INT, height INT) USING CSV

CREATE TABLE boxes
  (width INT, length INT, height INT)
  USING PARQUET
  OPTIONS ('compression'='snappy')

CREATE TABLE rectangles
  USING PARQUET
  PARTITIONED BY (width)
  CLUSTERED BY (length) INTO 8 buckets
  AS SELECT * FROM boxes

-- CREATE a HIVE SerDe table using the CREATE TABLE USING syntax.
CREATE TABLE my_table (name STRING, age INT, hair_color STRING)
  USING HIVE
  OPTIONS(
      INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat',
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat',
      SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe')
  PARTITIONED BY (hair_color)
  TBLPROPERTIES ('status'='staging', 'owner'='andrew')

使用 Hive 格式创建表Create Table with Hive format

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1[:] col_type1 [COMMENT col_comment1], ...)]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
  [ROW FORMAT row_format]
  [STORED AS file_format]
  [LOCATION path]
  [TBLPROPERTIES (key1=val1, key2=val2, ...)]
  [AS select_statement]

row_format:
  : SERDE serde_cls [WITH SERDEPROPERTIES (key1=val1, key2=val2, ...)]
  | DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]]
      [COLLECTION ITEMS TERMINATED BY char]
      [MAP KEYS TERMINATED BY char]
      [LINES TERMINATED BY char]
      [NULL DEFINED AS char]

file_format:
  : TEXTFILE | SEQUENCEFILE | RCFILE | ORC | PARQUET | AVRO
  | INPUTFORMAT input_fmt OUTPUTFORMAT output_fmt

使用 Hive 格式创建一个表。Create a table using the Hive format. 如果数据库中已存在同名的表,则会引发异常。If a table with the same name already exists in the database, an exception will be thrown. 以后删除该表时,将从文件系统中删除此表中的数据。When the table is dropped later, its data will be deleted from the file system.

备注

仅当启用了 Hive 支持时,才支持此命令。This command is supported only when Hive support is enabled.

EXTERNAL

该表使用通过 LOCATION 指定的自定义目录。The table uses the custom directory specified with LOCATION. 对表的查询将访问以前存储在目录中的现有数据。Queries on the table access existing data previously stored in the directory. 删除 EXTERNAL 表时,不会从文件系统中删除其数据。When an EXTERNAL table is dropped, its data is not deleted from the file system. 如果指定了 LOCATION,则隐含此标志。This flag is implied if LOCATION is specified.

IF NOT EXISTS

如果数据库中已存在同名的表,则不会执行任何操作。If a table with the same name already exists in the database, nothing will happen.

PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)

请按指定的列对表进行分区。Partition the table by the specified columns. 这组列必须不同于非分区列的组。This set of columns must be distinct from the set of non-partitioned columns. 不能通过 AS select_statement 指定已分区的列。You cannot specify partitioned columns with AS select_statement.

ROW FORMAT

使用 SERDE 子句为此表指定自定义 SerDe。Use the SERDE clause to specify a custom SerDe for this table. 否则,请使用 DELIMITED 子句来使用本机 SerDe,并指定分隔符、转义字符和空字符等。Otherwise, use the DELIMITED clause to use the native SerDe and specify the delimiter, escape character, null character, and so on.

STORED AS file_format

指定此表的文件格式。Specify the file format for this table. 可用格式包括 TEXTFILESEQUENCEFILERCFILEORCPARQUETAVROAvailable formats include TEXTFILE, SEQUENCEFILE, RCFILE, ORC, PARQUET, and AVRO. 或者,你可以通过 INPUTFORMATOUTPUTFORMAT 指定你自己的输入和输出格式。Alternatively, you can specify your own input and output formats through INPUTFORMAT and OUTPUTFORMAT. 只能将格式 TEXTFILESEQUENCEFILERCFILE 用于 ROW FORMAT SERDE,并且只能将 TEXTFILE 用于 ROW FORMAT DELIMITEDOnly formats TEXTFILE, SEQUENCEFILE, and RCFILE can be used with ROW FORMAT SERDE and only TEXTFILE can be used with ROW FORMAT DELIMITED.

LOCATION path

用于存储表数据的目录。The directory to store the table data. 此子句自动隐含 EXTERNALThis clause automatically implies EXTERNAL.

AS select_statement

使用来自 select 语句的输入数据填充该表。Populate the table with input data from the select statement. 不能通过 PARTITIONED BY 指定此项。You cannot specify this with PARTITIONED BY.

数据类型 Data types

Spark SQL 支持以下数据类型:Spark SQL supports the following data types:

  • 数字类型Numeric types
    • ByteType:表示 1 个字节的带符号整数。ByteType: Represents 1-byte signed integer numbers. 数字范围是从 -128127The range of numbers is from -128 to 127.
    • ShortType:表示 2 个字节的带符号整数。ShortType: Represents 2-byte signed integer numbers. 数字范围是从 -3276832767The range of numbers is from -32768 to 32767.
    • IntegerType:表示 4 个字节的带符号整数。IntegerType: Represents 4-byte signed integer numbers. 数字范围是从 -21474836482147483647The range of numbers is from -2147483648 to 2147483647.
    • LongType:表示 8 个字节的带符号整数。LongType: Represents 8-byte signed integer numbers. 数字范围是从 -92233720368547758089223372036854775807The range of numbers is from -9223372036854775808 to 9223372036854775807.
    • FloatType:表示 4 个字节的单精度浮点数。FloatType: Represents 4-byte single-precision floating point numbers.
    • DoubleType:表示 8 个字节的双精度浮点数。DoubleType: Represents 8-byte double-precision floating point numbers.
    • DecimalType:表示任意精度的带符号十进制数字。DecimalType: Represents arbitrary-precision signed decimal numbers. java.math.BigDecimal 在内部提供支持。Backed internally by java.math.BigDecimal. BigDecimal 由一个任意精度的非标度整数值和一个 32 位整数标度构成。A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
  • 字符串类型:StringType:表示字符串值。String type: StringType: Represents character string values.
  • 二进制类型:BinaryType:表示字节序列值。Binary type: BinaryType: Represents byte sequence values.
  • 布尔类型:BooleanType:表示布尔值。Boolean type: BooleanType: Represents boolean values.
  • 日期/时间类型Datetime types
    • TimestampType:表示由字段 year、month、day、hour、minute 和 second 的值构成的值,使用会话本地时区。TimestampType: Represents values comprising values of fields year, month, day, hour, minute, and second, with the session local time zone. 时间戳值表示绝对时间点。The timestamp value represents an absolute point in time.
    • DateType:表示由字段 year、month 和 day 的值构成的值,不包含时区。DateType: Represents values comprising values of fields year, month and day, without a time-zone.
  • 复杂类型Complex types
    • ArrayType(elementType, containsNull):表示由 elementType 类型的元素序列构成的值。ArrayType(elementType, containsNull): Represents values comprising a sequence of elements with the type of elementType. containsNull 用于指示 ArrayType 值中的元素是否可以具有 null 值。containsNull is used to indicate if elements in a ArrayType value can have null values.
    • MapType(keyType, valueType, valueContainsNull):表示由一组键值对构成的值。MapType(keyType, valueType, valueContainsNull): Represents values comprising a set of key-value pairs. 键的数据类型由 keyType 描述,而值的数据类型由 valueType 描述。The data type of keys is described by keyType and the data type of values is described by valueType. 对于 MapType 值,不允许键具有 null 值。For a MapType value, keys are not allowed to have null values. valueContainsNull 用于指示 MapType 值的值是否可以具有 null 值。valueContainsNull is used to indicate if values of a MapType value can have null values.
    • StructType(fields):表示多个值,其结构通过一系列 StructField (fields) 来描述。StructType(fields): Represents values with the structure described by a sequence of StructField (fields).
    • StructField(name, dataType, nullable):表示 StructType 中的字段。StructField(name, dataType, nullable): Represents a field in a StructType. 字段的名称由 name 指示。The name of a field is indicated by name. 字段的数据类型由 dataType 指示。The data type of a field is indicated by dataType. nullable 用于指示这些字段的值是否可以具有 null 值。nullable is used to indicate if values of these fields can have null values.

下表显示了每个数据类型的类型名称和别名。The following table shows the type names and aliases for each data type.

数据类型Data type SQL 名称SQL name
BooleanType BOOLEAN
ByteType BYTE, TINYINTBYTE, TINYINT
ShortType SHORT, SMALLINTSHORT, SMALLINT
IntegerType INT, INTEGERINT, INTEGER
LongType LONG, BIGINTLONG, BIGINT
FloatType FLOAT, REALFLOAT, REAL
DoubleType DOUBLE
DateType DATE
TimestampType TIMESTAMP
StringType STRING
BinaryType BINARY
DecimalType DECIMAL, DEC, NUMERICDECIMAL, DEC, NUMERIC
CalendarIntervalType INTERVAL
ArrayType ARRAY<element_type>
StructType STRUCT<field1_name: field1_type, field2_name: field2_type, ...>
MapType MAP<key_type, value_type>

示例Examples

CREATE TABLE my_table (name STRING, age INT)

CREATE TABLE my_table (name STRING, age INT)
  COMMENT 'This table is partitioned'
  PARTITIONED BY (hair_color STRING COMMENT 'This is a column comment')
  TBLPROPERTIES ('status'='staging', 'owner'='andrew')

CREATE TABLE my_table (name STRING, age INT)
  COMMENT 'This table specifies a custom SerDe'
  ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  STORED AS
      INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

CREATE TABLE my_table (name STRING, age INT)
  COMMENT 'This table uses the CSV format'
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  STORED AS TEXTFILE

CREATE TABLE your_table
  COMMENT 'This table is created with existing data'
  AS SELECT * FROM my_table

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
  COMMENT 'This table is created with existing data'
  LOCATION 'spark-warehouse/tables/my_existing_table'

Create Table LikeCreate Table Like

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name1 LIKE [db_name.]table_name2 [LOCATION path]

使用现有表或视图的定义/元数据创建托管表。Create a managed table using the definition/metadata of an existing table or view. 创建的表始终在默认仓库位置中使用其自己的目录。The created table always uses its own directory in the default warehouse location.

备注

Delta Lake 不支持 CREATE TABLE LIKEDelta Lake does not support CREATE TABLE LIKE. 请改用 CREATE TABLE ASInstead use CREATE TABLE AS. 请参阅 ASSee AS.