Collation support for Delta Lake

You can specify collations on string fields and read Delta tables that use collations in Databricks Runtime 16.4 LTS and above.

Enabling collation for a table adds the collations-preview writer table feature. See Delta Lake feature compatibility and protocols.

Note

By default, Delta Lake sets the collation for string fields to UTF8_BINARY.

Create a table with collation at the column level

You can create a new table with collation at the column level using the following command:

CREATE TABLE $tableName (
 nonCollatedColName STRING,
 collatedColName STRING COLLATE UNICODE,
 structColName STRUCT<nestedFieldName: STRING COLLATE UNICODE>,
 mapColName MAP<STRING, STRING COLLATE UNICODE>,
 arrayColName ARRAY<STRING COLLATE UNICODE>
) USING delta

Alter a table column to specify collation

You can update an existing column to use collation using the following commands:

ALTER TABLE tableName ALTER COLUMN columnName TYPE newType

To remove a non-default collation (if one was present):

ALTER TABLE tableName ALTER COLUMN columnName TYPE STRING COLLATE UTF8_BINARY

To change the column collation to utf8_lcase:

ALTER TABLE tableName ALTER COLUMN columnName TYPE STRING COLLATE UTF8_LCASE

Altering the collation for a table does not automatically update statistics or data layout for previously written data. To improve file skipping over historical data under the new collation, Databricks recommends the following:

  • Run ANALYZE table_name COMPUTE DELTA STATISTICS to update file skipping statistics for existing data files.

  • For tables with liquid clustering enabled, run OPTIMIZE FULL table_name to update liquid clustering.

  • For tables that use ZORDER, do the following:

    • Disable incremental optimization in the Spark Session by overriding the default Spark configuration with the following command:

      SET spark.databricks.optimize.incremental=false
      
    • Run OPTIMIZE table_name ZORDER BY zorder_column to rewrite all existing data files.

Collation will always be respected by Azure Databricks in the query's results.

Disable collation for a table

You must explicitly disable collation for each string column in a table before dropping the collation feature.

Use the following syntax to set the collation for a column to UTF8_BINARY:

ALTER TABLE table_name
ALTER COLUMN column_name
TYPE STRING
COLLATE UTF8_BINARY

To drop the table feature, run the following command:

ALTER TABLE table_name
DROP FEATURE collations-preview

See Drop a Delta Lake table feature and downgrade table protocol.

Schema evolution and collation

Collation interacts with schema evolution using the following rules:

  • If a source column already exists in the target table, the collation of the column in the target table remains unchanged.
  • If a source column has collation specified, the column added to the target table uses the specified collation.
  • If the target table does not have collation enabled when a column with collation is added, the collations-preview table feature is enabled.

Limitations

The following limitations exist for tables with collation enabled:

  • Delta tables created externally with a collation not recognized by the Databricks Runtime throw an exception when queried.
  • There is no support for Delta Sharing.
  • Collated columns cannot be used with CHECK constraints.
  • Generated columns cannot use collation.
  • Collated columns cannot be used with bloom filter index columns.
  • There is no support for collation in OSS Delta Lake APIs for Scala or Python. You must use Spark SQL or DataFrame APIs to enable collation.
  • Dynamic partition overwrite is not supported on collated columns.
  • Collated columns cannot be referenced in Structured Streaming stateful queries.
  • External readers that do not respect the collations-preview table feature fallback to default collation of UTF8_BINARY.
  • A MAP cannot have a key that is a collated string.
  • UniForm does not work with collations.