Databricks Runtime 7.x migration guide (EoS)
Note
Support for this Databricks Runtime version has ended. For the end-of-support date, see End-of-support history. For all supported Databricks Runtime versions, see Databricks Runtime release notes versions and compatibility.
This guide provides guidance to help you migrate your Azure Databricks workloads from Databricks Runtime 6.x, built on Apache Spark 2.4, to Databricks Runtime 7.3 LTS (EoS), both built on Spark 3.0.
This guide lists the Spark 3.0 behavior changes that might require you to update Azure Databricks workloads. Some of those changes include complete removal of Python 2 support, the upgrade to Scala 2.12, full support for JDK 11, and the switch from the Gregorian to the Proleptic calendar for dates and timestamps.
This guide is a companion to the Databricks Runtime 7.3 LTS (EoS) migration guide.
New features and improvements available on Databricks Runtime 7.x
For a list of new features, improvements, and library upgrades included in Databricks Runtime 7.3 LTS, see the release notes for each Databricks Runtime version above the one you are migrating from. Supported Databricks Runtime 7.x versions include:
Post-release maintenance updates are listed in Maintenance updates for Databricks Runtime (archived).
Databricks Runtime 7.3 LTS system environment
- Operating System: Ubuntu 18.04.5 LTS
- Java:
- 7.3 LTS: Zulu 8.48.0.53-CA-linux64 (build 1.8.0_265-b11)
- Scala: 2.12.10
- Python: 3.7.5
- R: 3.6.3 (2020-02-29)
- Delta Lake 0.7.0
Major Apache Spark 3.0 behavior changes
The following behavior changes from Spark 2.4 to Spark 3.0 might require you to update Azure Databricks workloads when you migrate from Databricks Runtime 6.x to Databricks Runtime 7.x.
Note
This article provides a list of the important Spark behavior changes for you to consider when you migrate to Databricks Runtime 7.x.
Core
- In Spark 3.0, the deprecated accumulator v1 is removed.
- Event log file will be written as UTF-8 encoding, and Spark History Server will replay event log files as UTF-8 encoding. Previously Spark wrote the event log file as default charset of driver JVM process, so Spark History Server of Spark 2.x is needed to read the old event log files in case of incompatible encoding.
- A new protocol for fetching shuffle blocks is used. It is recommended that external shuffle services be upgraded when running Spark 3.0 apps. You can still use old external shuffle services by setting the configuration
spark.shuffle.useOldFetchProtocol
totrue
. Otherwise, Spark may run into errors with messages likeIllegalArgumentException: Unexpected message type: <number>
.
PySpark
- In Spark 3.0,
Column.getItem
is fixed such that it does not callColumn.apply
. Consequently, ifColumn
is used as an argument togetItem
, the indexing operator should be used. For example,map_col.getItem(col('id'))
should be replaced withmap_col[col('id')]
. - As of Spark 3.0,
Row
field names are no longer sorted alphabetically when constructing with named arguments for Python versions 3.6 and above, and the order of fields will match that as entered. To enable sorted fields by default, as in Spark 2.4, set the environment variablePYSPARK_ROW_FIELD_SORTING_ENABLED
totrue
for both executors and driver. This environment variable must be consistent on all executors and driver. Otherwise, it may cause failures or incorrect answers. For Python versions lower than 3.6, the field names are sorted alphabetically as the only option. - Deprecated Python 2 support (SPARK-27884).
Structured Streaming
- In Spark 3.0, Structured Streaming forces the source schema into nullable when file-based datasources such as text, json, csv, parquet and orc are used via
spark.readStream(...)
. Previously, it respected the nullability in source schema; however, it caused issues tricky to debug with NPE. To restore the previous behavior, setspark.sql.streaming.fileSource.schema.forceNullable
tofalse
. - Spark 3.0 fixes the correctness issue on Stream-stream outer join, which changes the schema of state. See SPARK-26154 for more details. If you start your query from checkpoint constructed from Spark 2.x which uses stream-stream outer join, Spark 3.0 fails the query. To recalculate outputs, discard the checkpoint and replay previous inputs.
- In Spark 3.0, the deprecated class
org.apache.spark.sql.streaming.ProcessingTime
has been removed. Useorg.apache.spark.sql.streaming.Trigger.ProcessingTime
instead. Likewise,org.apache.spark.sql.execution.streaming.continuous.ContinuousTrigger
has been removed in favor ofTrigger.Continuous
, andorg.apache.spark.sql.execution.streaming.OneTimeTrigger
has been hidden in favor ofTrigger.Once
. See SPARK-28199.
SQL, Datasets, and DataFrame
- In Spark 3.0, when inserting a value into a table column with a different data type, the type coercion is performed as per ANSI SQL standard. Certain unreasonable type conversions such as converting
string
toint
anddouble
toboolean
are disallowed. A runtime exception will be thrown if the value is out-of-range for the data type of the column. In Spark version 2.4 and earlier, type conversions during table insertion are allowed as long as they are validCast
. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of byte type, the result is 1. The behavior is controlled by the optionspark.sql.storeAssignmentPolicy
, with a default value as "ANSI". Setting the option to "Legacy" restores the previous behavior. - In Spark 3.0, when casting string value to integral types (tinyint, smallint, int and bigint), datetime types (date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ACSII 32) are trimmed before being converted to these type values, for example
cast(' 1\t' as int)
returns1
,cast(' 1\t' as boolean)
returnstrue
,cast('2019-10-10\t as date)
returns the date value2019-10-10
. In Spark version 2.4 and earlier, while casting string to integrals and booleans, it will not trim the whitespaces from both ends, the foregoing results will benull
, while to datetimes, only the trailing spaces (= ASCII 32) will be removed. See https://databricks.com/blog/2020/07/22/a-comprehensive-look-at-dates-and-timestamps-in-apache-spark-3-0.html. - In Spark 3.0, the deprecated methods
SQLContext.createExternalTable
andSparkSession.createExternalTable
have been removed in favor of their replacement,createTable
. - In Spark 3.0, configuration
spark.sql.crossJoin.enabled
becomes internal configuration, and is true by default, so by default Spark won't raise an exception on SQL with implicit cross joins. - In Spark 3.0, we reversed argument order of the trim function from
TRIM(trimStr, str)
toTRIM(str, trimStr)
to be compatible with other databases. - In Spark version 2.4 and earlier, SQL queries such as
FROM <table>
orFROM <table> UNION ALL FROM <table>
are supported by accident. In hive-styleFROM <table> SELECT <expr>
, theSELECT
clause is not negligible. Neither Hive nor Presto support this syntax. Therefore we will treat these queries as invalid since Spark 3.0. - Since Spark 3.0, the Dataset and DataFrame API
unionAll
is not deprecated any more. It is an alias forunion
. - In Spark version 2.4 and earlier, the parser of JSON data source treats empty strings as null for some data types such as
IntegerType
. ForFloatType
andDoubleType
, it fails on empty strings and throws exceptions. Since Spark 3.0, we disallow empty strings and will throw exceptions for data types except forStringType
andBinaryType
. - Since Spark 3.0, the
from_json
functions support two modes -PERMISSIVE
andFAILFAST
. The modes can be set via themode
option. The default mode becamePERMISSIVE
. In previous versions, behavior offrom_json
did not conform to eitherPERMISSIVE
orFAILFAST,
especially in processing of malformed JSON records. For example, the JSON string{"a" 1}
with the schemaa INT
is converted tonull
by previous versions but Spark 3.0 converts it toRow(null)
.
DDL Statements
- In Spark 3.0,
CREATE TABLE
without a specific provider uses the value ofspark.sql.sources.default
as its provider. In Spark version 2.4 and below, it was Hive. To restore the behavior before Spark 3.0, you can setspark.sql.legacy.createHiveTableByDefault.enabled
totrue
. - In Spark 3.0, when inserting a value into a table column with a different data type, the type coercion is performed as per ANSI SQL standard. Certain unreasonable type conversions such as converting
string
toint
anddouble
toboolean
are disallowed. A runtime exception is thrown if the value is out-of-range for the data type of the column. In Spark version 2.4 and below, type conversions during table insertion are allowed as long as they are validCast
. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of byte type, the result is 1. The behavior is controlled by the optionspark.sql.storeAssignmentPolicy
, with a default value as "ANSI". Setting the option as "Legacy" restores the previous behavior. - In Spark 3.0,
SHOW CREATE TABLE
always returns Spark DDL, even when the given table is a Hive SerDe table. For generating Hive DDL, useSHOW CREATE TABLE AS SERDE
command instead. - In Spark 3.0, column of
CHAR
type is not allowed in non-Hive-Serde tables, andCREATE/ALTER TABLE
commands will fail ifCHAR
type is detected. Please useSTRING
type instead. In Spark version 2.4 and below,CHAR
type is treated asSTRING
type and the length parameter is simply ignored.
UDFs and Built-in Functions
- In Spark 3.0, using
org.apache.spark.sql.functions.udf(AnyRef, DataType)
is not allowed by default. Setspark.sql.legacy.allowUntypedScalaUDF
totrue
to keep using it. In Spark version 2.4 and below, iforg.apache.spark.sql.functions.udf(AnyRef, DataType)
gets a Scala closure with primitive-type argument, the returned UDF returns null if the input values is null. However, in Spark 3.0, the UDF returns the default value of the Java type if the input value is null. For example,val f = udf((x: Int) => x, IntegerType), f($"x")
returns null in Spark 2.4 and below if column x is null, and returns 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default. - In Spark version 2.4 and below, you can create a map with duplicated keys via built-in functions like
CreateMap
,StringToMap
, etc. The behavior of map with duplicated keys is undefined, for example, map look up respects the duplicated key appears first,Dataset.collect
only keeps the duplicated key appears last,MapKeys
returns duplicated keys, etc. In Spark 3.0, Spark throwsRuntimeException
when duplicated keys are found. You can setspark.sql.mapKeyDedupPolicy
toLAST_WIN
to deduplicate map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (for example, Parquet), the behavior is undefined.
Data Sources
- In Spark version 2.4 and below, partition column value is converted as null if it can't be cast to a corresponding user provided schema. In 3.0, partition column value is validated with a user provided schema. An exception is thrown if the validation fails. You can disable such validation by setting
spark.sql.sources.validatePartitionColumns
tofalse
. - In Spark version 2.4 and below, the parser of JSON data source treats empty strings as null for some data types such as
IntegerType
. ForFloatType
,DoubleType
,DateType
andTimestampType
, it fails on empty strings and throws exceptions. Spark 3.0 disallows empty strings and will throw an exception for data types except forStringType
andBinaryType
. The previous behavior of allowing an empty string can be restored by settingspark.sql.legacy.json.allowEmptyString.enabled
totrue
. - In Spark 3.0, if files or subdirectories disappear during recursive directory listing (that is, they appear in an intermediate listing but then cannot be read or listed during later phases of the recursive directory listing, due to either concurrent file deletions or object store consistency issues) then the listing will fail with an exception unless
spark.sql.files.ignoreMissingFiles
istrue
(default false). In previous versions, these missing files or subdirectories would be ignored. Note that this change of behavior only applies during initial table file listing (or duringREFRESH TABLE
), not during query execution: the net change is thatspark.sql.files.ignoreMissingFiles
is now obeyed during table file listing and query planning, not only at query execution time. - In Spark version 2.4 and below, CSV datasource converts a malformed CSV string to a row with all nulls in the PERMISSIVE mode. In Spark 3.0, the returned row can contain non-null fields if some of CSV column values were parsed and converted to desired types successfully.
- In Spark 3.0, parquet logical type
TIMESTAMP_MICROS
is used by default while savingTIMESTAMP
columns. In Spark version 2.4 and below,TIMESTAMP
columns are saved asINT96
in parquet files. Note that some SQL systems such as Hive 1.x and Impala 2.x can only read INT96 timestamps. You can setspark.sql.parquet.outputTimestampType
asINT96
to restore the previous behavior and keep interoperability. - In Spark 3.0, when Avro files are written with user provided schema, the fields are matched by field names between catalyst schema and Avro schema instead of positions.
Query Engine
- In Spark 3.0, Dataset query fails if it contains ambiguous column reference that is caused by self join. A typical example:
val df1 = ...; val df2 = df1.filter(...);, then df1.join(df2, df1("a") > df2("a"))
returns an empty result which is quite confusing. This is because Spark cannot resolve Dataset column references that point to tables being self joined, anddf1("a")
is exactly the same asdf2("a")
in Spark. To restore the behavior before Spark 3.0, you can setspark.sql.analyzer.failAmbiguousSelfJoin
tofalse
. - In Spark 3.0, numbers written in scientific notation (for example,
1E2
) are parsed asDouble
. In Spark version 2.4 and below, they're parsed asDecimal
. To restore the pre-Spark 3.0 behavior, you can setspark.sql.legacy.exponentLiteralAsDecimal.enabled
totrue
. - In Spark 3.0, configuration
spark.sql.crossJoin.enabled
becomes an internal configuration and is true by default. By default Spark won't raise exceptions on SQL with implicit cross joins. - In Spark version 2.4 and below, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys, and join keys. In Spark 3.0, this bug is fixed. For example,
Seq(-0.0, 0.0).toDF("d").groupBy("d").count()
returns[(0.0, 2)]
in Spark 3.0, and[(0.0, 1), (-0.0, 1)]
in Spark 2.4 and below. - In Spark 3.0,
TIMESTAMP
literals are converted to strings using the SQL configspark.sql.session.timeZone
. In Spark version 2.4 and below, the conversion uses the default time zone of the Java virtual machine. - In Spark 3.0, Spark casts
String
toDate/Timestamp
in binary comparisons with dates/timestamps. The previous behavior of castingDate/Timestamp
toString
can be restored by settingspark.sql.legacy.typeCoercion.datetimeToString.enabled
totrue
. - In Spark version 2.4 and below, invalid time zone ids are silently ignored and replaced by GMT time zone, for example, in the
from_utc_timestamp
function. In Spark 3.0, such time zone ids are rejected, and Spark throwsjava.time.DateTimeException
. - In Spark 3.0, Proleptic Gregorian calendar is used in parsing, formatting, and converting dates and timestamps as well as in extracting sub-components like years, days and so on. Spark 3.0 uses Java 8 API classes from the java.time packages that are based on ISO chronology. In Spark version 2.4 and below, those operations are performed using the hybrid calendar (Julian + Gregorian). The changes impact the results for dates before October 15, 1582 (Gregorian) and affect the following Spark 3.0 API:
- Parsing/formatting of timestamp/date strings. This effects on CSV/JSON datasources and on the
unix_timestamp
,date_format
,to_unix_timestamp
,from_unixtime
,to_date
,to_timestamp
functions when patterns specified by users is used for parsing and formatting. In Spark 3.0, we define our own pattern strings insql-ref-datetime-pattern.md
, which is implemented viajava.time.format.DateTimeFormatter
under the hood. The new implementation performs strict checking of its input. For example, the2015-07-22 10:00:00
timestamp cannot be parse if pattern isyyyy-MM-dd
because the parser does not consume whole input. Another example is the31/01/2015 00:00
input cannot be parsed by thedd/MM/yyyy hh:mm
pattern becausehh
presupposes hours in the range 1-12. In Spark version 2.4 and below,java.text.SimpleDateFormat
is used for timestamp/date string conversions, and the supported patterns are described in simpleDateFormat. The old behavior can be restored by settingspark.sql.legacy.timeParserPolicy
toLEGACY
. - The
weekofyear
,weekday
,dayofweek
,date_trunc
,from_utc_timestamp
,to_utc_timestamp
, andunix_timestamp
functions usejava.time
API for calculating week number of year, day number of week as well for conversion from/toTimestampType
values in UTC time zone. - The JDBC options
lowerBound
andupperBound
are converted to TimestampType/DateType values in the same way as casting strings to TimestampType/DateType values. The conversion is based on Proleptic Gregorian calendar, and time zone defined by the SQL configspark.sql.session.timeZone
. In Spark version 2.4 and below, the conversion is based on the hybrid calendar (Julian + Gregorian) and on default system time zone. - Formatting
TIMESTAMP
andDATE
literals. - Creating typed
TIMESTAMP
andDATE
literals from strings. In Spark 3.0, string conversion to typedTIMESTAMP/DATE
literals is performed via casting toTIMESTAMP/DATE
values. For example,TIMESTAMP '2019-12-23 12:59:30'
is semantically equal toCAST('2019-12-23 12:59:30' AS TIMESTAMP)
. When the input string does not contain information about time zone, the time zone from the SQL configspark.sql.session.timeZone
is used in that case. In Spark version 2.4 and below, the conversion is based on JVM system time zone. The different sources of the default time zone may change the behavior of typedTIMESTAMP
andDATE
literals.
- Parsing/formatting of timestamp/date strings. This effects on CSV/JSON datasources and on the
Apache Hive
- In Spark 3.0, we upgraded the built-in Hive version from 1.2 to 2.3 which brings the following impacts:
- You may need to set
spark.sql.hive.metastore.version
andspark.sql.hive.metastore.jars
according to the version of the Hive metastore you want to connect to. For example: setspark.sql.hive.metastore.version
to1.2.1
andspark.sql.hive.metastore.jars
tomaven
if your Hive metastore version is 1.2.1. - You need to migrate your custom SerDes to Hive 2.3 or build your own Spark with
hive-1.2
profile. See HIVE-15167 for more details. - The decimal string representation can be different between Hive 1.2 and Hive 2.3 when using
TRANSFORM
operator in SQL for script transformation, which depends on hive's behavior. In Hive 1.2, the string representation omits trailing zeroes. But in Hive 2.3, it is always padded to 18 digits with trailing zeroes if necessary. - In Databricks Runtime 7.x, when reading a Hive SerDe table, by default Spark disallows reading files under a subdirectory that is not a table partition. To enable it, set the configuration
spark.databricks.io.hive.scanNonpartitionedDirectory.enabled
astrue
. This does not affect Spark native table readers and file readers.
- You may need to set
MLlib
OneHotEncoder
, which is deprecated in 2.3, is removed in 3.0 andOneHotEncoderEstimator
is now renamed toOneHotEncoder
.org.apache.spark.ml.image.ImageSchema.readImages
, which is deprecated in 2.3, is removed in 3.0. Usespark.read.format('image')
instead.org.apache.spark.mllib.clustering.KMeans.train
with param Intruns
, which is deprecated in 2.1, is removed in 3.0. Use train method without runs instead.org.apache.spark.mllib.classification.LogisticRegressionWithSGD
, which is deprecated in 2.0, is removed in 3.0, useorg.apache.spark.ml.classification.LogisticRegression
orspark.mllib.classification.LogisticRegressionWithLBFGS
instead.org.apache.spark.mllib.feature.ChiSqSelectorModel.isSorted
, which is deprecated in 2.1, is removed in 3.0, is not intended for subclasses to use.org.apache.spark.mllib.regression.RidgeRegressionWithSGD
, which is deprecated in 2.0, is removed in 3.0. Useorg.apache.spark.ml.regression.LinearRegression
withelasticNetParam = 0.0
. Note the defaultregParam
is 0.01 forRidgeRegressionWithSGD
, but is 0.0 forLinearRegression
.org.apache.spark.mllib.regression.LassoWithSGD
, which is deprecated in 2.0, is removed in 3.0. Useorg.apache.spark.ml.regression.LinearRegression
withelasticNetParam = 1.0
. Note the defaultregParam
is 0.01 forLassoWithSGD
, but is 0.0 forLinearRegression
.org.apache.spark.mllib.regression.LinearRegressionWithSGD
, which is deprecated in 2.0, is removed in 3.0. Useorg.apache.spark.ml.regression.LinearRegression
orLBFGS
instead.org.apache.spark.mllib.clustering.KMeans.getRuns
andsetRuns
, which are deprecated in 2.1, are removed in 3.0, and have had no effect since Spark 2.0.0.org.apache.spark.ml.LinearSVCModel.setWeightCol
, which is deprecated in 2.4, is removed in 3.0, and is not intended for users.- In 3.0,
org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel
extendsMultilayerPerceptronParams
to expose the training params. As a result,layers
inMultilayerPerceptronClassificationModel
has been changed fromArray[Int]
toIntArrayParam
. You should useMultilayerPerceptronClassificationModel.getLayers
instead ofMultilayerPerceptronClassificationModel.layers
to retrieve the size of layers. org.apache.spark.ml.classification.GBTClassifier.numTrees
, which is deprecated in 2.4.5, is removed in 3.0. UsegetNumTrees
instead.org.apache.spark.ml.clustering.KMeansModel.computeCost
, which is deprecated in 2.4, is removed in 3.0, useClusteringEvaluator
instead.- The member variable precision in
org.apache.spark.mllib.evaluation.MulticlassMetrics
, which is deprecated in 2.0, is removed in 3.0. Use accuracy instead. - The member variable recall in
org.apache.spark.mllib.evaluation.MulticlassMetrics
, which is deprecated in 2.0, is removed in 3.0. Useaccuracy
instead. - The member variable
fMeasure
inorg.apache.spark.mllib.evaluation.MulticlassMetrics
, which is deprecated in 2.0, is removed in 3.0. Useaccuracy
instead. org.apache.spark.ml.util.GeneralMLWriter.context
, which is deprecated in 2.0, is removed in 3.0. Usesession
instead.org.apache.spark.ml.util.MLWriter.context
, which is deprecated in 2.0, is removed in 3.0. Usesession
instead.org.apache.spark.ml.util.MLReader.context
, which is deprecated in 2.0, is removed in 3.0. Usesession
instead.abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT, T]]
is changed toabstract class UnaryTransformer[IN: TypeTag, OUT: TypeTag, T <: UnaryTransformer[IN, OUT, T]]
in 3.0.- In Spark 3.0, a multiclass logistic regression in Pyspark will now (correctly) return
LogisticRegressionSummary
, not the subclassBinaryLogisticRegressionSummary
. The additional methods exposed byBinaryLogisticRegressionSummary
would not work in this case anyway. (SPARK-31681) - In Spark 3.0,
pyspark.ml.param.shared.Has*
mixins do not provide anyset*(self, value)
setter methods anymore, use the respectiveself.set(self.*, value)
instead. See SPARK-29093 for details. (SPARK-29093)
Other behavior changes
The upgrade to Scala 2.12 involves the following changes:
Package cell serialization is handled differently. The following example illustrates the behavior change and how to handle it.
Running
foo.bar.MyObjectInPackageCell.run()
as defined in the following package cell will trigger the errorjava.lang.NoClassDefFoundError: Could not initialize class foo.bar.MyObjectInPackageCell$
package foo.bar case class MyIntStruct(int: Int) import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ import org.apache.spark.sql.Column object MyObjectInPackageCell extends Serializable { // Because SparkSession cannot be created in Spark executors, // the following line triggers the error // Could not initialize class foo.bar.MyObjectInPackageCell$ val spark = SparkSession.builder.getOrCreate() def foo: Int => Option[MyIntStruct] = (x: Int) => Some(MyIntStruct(100)) val theUDF = udf(foo) val df = { val myUDFInstance = theUDF(col("id")) spark.range(0, 1, 1, 1).withColumn("u", myUDFInstance) } def run(): Unit = { df.collect().foreach(println) } }
To work around this error, you can wrap
MyObjectInPackageCell
inside a serializable class.Certain cases using
DataStreamWriter.foreachBatch
will require a source code update. This change is due to the fact that Scala 2.12 has automatic conversion from lambda expressions to SAM types and can cause ambiguity.For example, the following Scala code can't compile:
streams .writeStream .foreachBatch { (df, id) => myFunc(df, id) }
To fix the compilation error, change
foreachBatch { (df, id) => myFunc(df, id) }
toforeachBatch(myFunc _)
or use the Java API explicitly:foreachBatch(new VoidFunction2 ...)
.
Because the Apache Hive version used for handling Hive user-defined functions and Hive SerDes is upgraded to 2.3, two changes are required:
- Hive's
SerDe
interface is replaced by an abstract classAbstractSerDe
. For any custom HiveSerDe
implementation, migrating toAbstractSerDe
is required. - Setting
spark.sql.hive.metastore.jars
tobuiltin
means that the Hive 2.3 metastore client will be used to access metastores for Databricks Runtime 7.x. If you need to access Hive 1.2 based external metastores, setspark.sql.hive.metastore.jars
to the folder that contains Hive 1.2 jars.
- Hive's
Deprecations and removals
- Data skipping index was deprecated in Databricks Runtime 4.3 and removed in Databricks Runtime 7.x. We recommend that you use Delta tables instead, which offer improved data skipping capabilities.
- In Databricks Runtime 7.x, the underlying version of Apache Spark uses Scala 2.12. Since libraries compiled against Scala 2.11 can disable Databricks Runtime 7.x clusters in unexpected ways, clusters running Databricks Runtime 7.x do not install libraries configured to be installed on all clusters. The cluster Libraries tab shows a status
Skipped
and a deprecation message that explains the changes in library handling. However, if you have a cluster that was created on an earlier version of Databricks Runtime before Azure Databricks platform version 3.20 was released to your workspace, and you now edit that cluster to use Databricks Runtime 7.x, any libraries that were configured to be installed on all clusters will be installed on that cluster. In this case, any incompatible JARs in the installed libraries can cause the cluster to be disabled. The workaround is either to clone the cluster or to create a new cluster.
Known issues
- Parsing day of year using pattern letter 'D' returns the wrong result if the year field is missing. This can happen in SQL functions like
to_timestamp
which parses datetime string to datetime values using a pattern string. (SPARK-31939) - Join/Window/Aggregate inside subqueries may lead to wrong results if the keys have values -0.0 and 0.0. (SPARK-31958)
- A window query may fail with ambiguous self-join error unexpectedly. (SPARK-31956)
- Streaming queries with
dropDuplicates
operator may not be able to restart with the checkpoint written by Spark 2.x. (SPARK-31990)