Apache Hive compatibility
Applies to: Databricks Runtime
Apache Spark SQL in Azure Databricks is designed to be compatible with the Apache Hive, including metastore connectivity, SerDes, and UDFs.
SerDes and UDFs
Hive SerDes and UDFs are based on Hive 1.2.1.
Metastore connectivity
See External Apache Hive metastore (legacy) for information on how to connect Azure Databricks to an externally hosted Hive metastore.
Supported Hive features
Spark SQL supports the vast majority of Hive features, such as:
- Hive query statements, including:
- SELECT
- GROUP BY
- ORDER BY
- CLUSTER BY
- SORT BY
- All Hive expressions, including:
- Relational expressions (
=
,⇔
,==
,<>
,<
,>
,>=
,<=
, etc) - Arithmetic expressions (
+
,-
,*
,/
,%
, etc) - Logical expressions (AND, &&, OR, ||, etc)
- Complex type constructors
- Mathematical expressions (sign, ln, cos, etc)
- String expressions (instr, length, printf, etc)
- Relational expressions (
- User defined functions (UDF)
- User defined aggregation functions (UDAF)
- User defined serialization formats (SerDes)
- Window functions
- Joins
- JOIN
- {LEFT|RIGHT|FULL} OUTER JOIN
- LEFT SEMI JOIN
- CROSS JOIN
- Unions
- Sub-queries
- SELECT col FROM ( SELECT a + b AS col from t1) t2
- Sampling
- Explain
- Partitioned tables including dynamic partition insertion
- View
- Vast majority of DDL statements, including:
- CREATE TABLE
- CREATE TABLE AS SELECT
- ALTER TABLE
- Most Hive data types, including:
- TINYINT
- SMALLINT
- INT
- BIGINT
- BOOLEAN
- FLOAT
- DOUBLE
- STRING
- BINARY
- TIMESTAMP
- DATE
- ARRAY<>
- MAP<>
- STRUCT<>
Unsupported Hive functionality
The following sections contain a list of Hive features that Spark SQL doesn't support. Most of these features are rarely used in Hive deployments.
Major Hive features
- Writing to bucketed table created by Hive
- ACID fine-grained updates
Esoteric Hive features
- Union type
- Unique join
- Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at the moment and only supports populating the sizeInBytes field of the Hive metastore
Hive input and output formats
- File format for CLI: For results showing back to the CLI, Spark SQL supports only TextOutputFormat
- Hadoop archive
Hive optimizations
A handful of Hive optimizations are not included in Spark. Some of these (such as indexes) are less important due to Spark SQL's in-memory computational model.
- Block level bitmap indexes and virtual columns (used to build indexes).
- Automatically determine the number of reducers for joins and groupbys: In Spark SQL, you need to control the degree of parallelism post-shuffle using
SET spark.sql.shuffle.partitions=[num_tasks];
. - Skew data flag: Spark SQL does not follow the skew data flag in Hive.
STREAMTABLE
hint in join: Spark SQL does not follow theSTREAMTABLE
hint.- Merge multiple small files for query results: if the result output contains multiple small files, Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Spark SQL does not support that.