Apache Hive 兼容性 Apache Hive compatibility

Azure Databricks 中的 Apache Spark SQL 已设计为与 Apache Hive(包括元存储连接性、SerDe 和 UDF)兼容。Apache Spark SQL in Azure Databricks is designed to be compatible with the Apache Hive, including metastore connectivity, SerDes, and UDFs.

SerDe 和 UDFSerDes and UDFs

Hive SerDe 和 UDF 基于 Hive 1.2.1。Hive SerDes and UDFs are based on Hive 1.2.1.

元存储连接性Metastore connectivity

请参阅外部 Apache Hive 元存储,以了解如何将 Azure Databricks 连接到外部托管的 Hive 元存储。See External Apache Hive metastore for information on how to connect Azure Databricks to an externally hosted Hive metastore.

支持的 Hive 功能Supported Hive features

Spark SQL 支持绝大多数 Hive 功能,如:Spark SQL supports the vast majority of Hive features, such as:

  • Hive 查询语句,包括:Hive query statements, including:
    • SELECTSELECT
    • GROUP BYGROUP BY
    • ORDER BYORDER BY
    • CLUSTER BYCLUSTER BY
    • 排序依据SORT BY
  • 所有 Hive 表达式,包括:All Hive expressions, including:
    • 关系表达式(===<><>>=<=,等等)Relational expressions (=, , ==, <>, <, >, >=, <=, etc)
    • 算术表达式(+-*/%,等等)Arithmetic expressions (+, -, *, /, %, etc)
    • 逻辑表达式(AND、&&、OR、||,等等)Logical expressions (AND, &&, OR, ||, etc)
    • 复杂类型构造函数Complex type constructors
    • 数学表达式(sign、ln、cos,等等)Mathematical expressions (sign, ln, cos, etc)
    • 字符串表达式(instr、length、printf,等等)String expressions (instr, length, printf, etc)
  • 用户定义的函数 (UDF)User defined functions (UDF)
  • 用户定义的聚合函数 (UDAF)User defined aggregation functions (UDAF)
  • 用户定义的序列化格式 (SerDe)User defined serialization formats (SerDes)
  • 开窗函数Window functions
  • 联接Joins
    • JOINJOIN
    • {LEFT|RIGHT|FULL} OUTER JOIN{LEFT|RIGHT|FULL} OUTER JOIN
    • LEFT SEMI JOINLEFT SEMI JOIN
    • CROSS JOINCROSS JOIN
  • UnionsUnions
  • 子查询Sub-queries
    • SELECT col FROM ( SELECT a + b AS col from t1) t2SELECT col FROM ( SELECT a + b AS col from t1) t2
  • 采样Sampling
  • 说明Explain
  • 包括动态分区插入的已分区表Partitioned tables including dynamic partition insertion
  • 查看View
  • 绝大部分 DDL 语句,包括:Vast majority of DDL statements, including:
    • CREATE TABLECREATE TABLE
    • CREATE TABLE AS SELECTCREATE TABLE AS SELECT
    • ALTER TABLEALTER TABLE
  • 大多数 Hive 数据类型,包括:Most Hive data types, including:
    • TINYINTTINYINT
    • SMALLINTSMALLINT
    • INTINT
    • BIGINTBIGINT
    • BOOLEANBOOLEAN
    • FLOATFLOAT
    • DOUBLEDOUBLE
    • STRINGSTRING
    • BINARYBINARY
    • TIMESTAMPTIMESTAMP
    • DATEDATE
    • ARRAY<>ARRAY<>
    • MAP<>MAP<>
    • STRUCT<>STRUCT<>

不受支持的 Hive 功能Unsupported Hive functionality

以下各部分包含了 Spark SQL 不支持的 Hive 功能的列表。The following sections contain a list of Hive features that Spark SQL doesn’t support. 其中的大多数功能在 Hive 部署中很少使用。Most of these features are rarely used in Hive deployments.

主要 Hive 功能Major Hive features

  • 写入到由 Hive 创建的通过 Bucket 进行存储的表Writing to bucketed table created by Hive
  • ACID 细化的更新ACID fine-grained updates

复杂的 Hive 功能Esoteric Hive features

  • 联合类型Union type
  • 唯一联接Unique join
  • 列统计信息收集:Spark SQL 目前不会借助扫描来收集列统计信息,并且只支持对 Hive 元存储的 sizeInBytes 字段进行填充Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at the moment and only supports populating the sizeInBytes field of the Hive metastore

Hive 输入和输出格式Hive input and output formats

  • 适用于 CLI 的文件格式:对于显示回 CLI 中的结果,Spark SQL 只支持 TextOutputFormatFile format for CLI: For results showing back to the CLI, Spark SQL supports only TextOutputFormat
  • Hadoop 存档Hadoop archive

Hive 优化Hive optimizations

Spark 中未包括少量的 Hive 优化。A handful of Hive optimizations are not included in Spark. 其中一些(如索引)并不太重要,因为 Spark SQL 有内存中计算模型。Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model.

  • 块级别位图索引和虚拟列(用于生成索引)。Block level bitmap indexes and virtual columns (used to build indexes).
  • 自动为 join 和 groupby 确定减速器的数量:在 Spark SQL 中,需要使用 SET spark.sql.shuffle.partitions=[num_tasks]; 来控制在执行 shuffle 操作后的并行度。Automatically determine the number of reducers for joins and groupbys: In Spark SQL, you need to control the degree of parallelism post-shuffle using SET spark.sql.shuffle.partitions=[num_tasks];.
  • 倾斜数据标志:Spark SQL 不会遵循 Hive 中的倾斜数据标志。Skew data flag: Spark SQL does not follow the skew data flag in Hive.
  • join 中的 STREAMTABLE 提示:Spark SQL 不会遵循 STREAMTABLE 提示。STREAMTABLE hint in join: Spark SQL does not follow the STREAMTABLE hint.
  • 合并查询结果的多个小文件:如果结果输出包含多个小文件,Hive 可以选择性地将这些小文件合并为较少的大文件,以避免溢出 HDFS 元数据。Merge multiple small files for query results: if the result output contains multiple small files, Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Spark SQL 不支持此功能。Spark SQL does not support that.