Apache Hive compatibility

Applies to: check marked yes Databricks Runtime

Apache Spark SQL in Azure Databricks is designed to be compatible with the Apache Hive, including metastore connectivity, SerDes, and UDFs.

SerDes and UDFs

Hive SerDes and UDFs are based on Hive 1.2.1.

Metastore connectivity

See External Apache Hive metastore (legacy) for information on how to connect Azure Databricks to an externally hosted Hive metastore.

Supported Hive features

Spark SQL supports the vast majority of Hive features, such as:

  • Hive query statements, including:
    • SELECT
    • GROUP BY
    • ORDER BY
    • CLUSTER BY
    • SORT BY
  • All Hive expressions, including:
    • Relational expressions (=, , ==, <>, <, >, >=, <=, etc)
    • Arithmetic expressions (+, -, *, /, %, etc)
    • Logical expressions (AND, &&, OR, ||, etc)
    • Complex type constructors
    • Mathematical expressions (sign, ln, cos, etc)
    • String expressions (instr, length, printf, etc)
  • User defined functions (UDF)
  • User defined aggregation functions (UDAF)
  • User defined serialization formats (SerDes)
  • Window functions
  • Joins
    • JOIN
    • {LEFT|RIGHT|FULL} OUTER JOIN
    • LEFT SEMI JOIN
    • CROSS JOIN
  • Unions
  • Sub-queries
    • SELECT col FROM ( SELECT a + b AS col from t1) t2
  • Sampling
  • Explain
  • Partitioned tables including dynamic partition insertion
  • View
  • Vast majority of DDL statements, including:
    • CREATE TABLE
    • CREATE TABLE AS SELECT
    • ALTER TABLE
  • Most Hive data types, including:
    • TINYINT
    • SMALLINT
    • INT
    • BIGINT
    • BOOLEAN
    • FLOAT
    • DOUBLE
    • STRING
    • BINARY
    • TIMESTAMP
    • DATE
    • ARRAY<>
    • MAP<>
    • STRUCT<>

Unsupported Hive functionality

The following sections contain a list of Hive features that Spark SQL doesn't support. Most of these features are rarely used in Hive deployments.

Major Hive features

  • Writing to bucketed table created by Hive
  • ACID fine-grained updates

Esoteric Hive features

  • Union type
  • Unique join
  • Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at the moment and only supports populating the sizeInBytes field of the Hive metastore

Hive input and output formats

  • File format for CLI: For results showing back to the CLI, Spark SQL supports only TextOutputFormat
  • Hadoop archive

Hive optimizations

A handful of Hive optimizations are not included in Spark. Some of these (such as indexes) are less important due to Spark SQL's in-memory computational model.

  • Block level bitmap indexes and virtual columns (used to build indexes).
  • Automatically determine the number of reducers for joins and groupbys: In Spark SQL, you need to control the degree of parallelism post-shuffle using SET spark.sql.shuffle.partitions=[num_tasks];.
  • Skew data flag: Spark SQL does not follow the skew data flag in Hive.
  • STREAMTABLE hint in join: Spark SQL does not follow the STREAMTABLE hint.
  • Merge multiple small files for query results: if the result output contains multiple small files, Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Spark SQL does not support that.