Parquet format in Azure Data Factory and Azure Synapse Analytics

APPLIES TO: Azure Data Factory Azure Synapse Analytics

Follow this article when you want to parse the Parquet files or write the data into Parquet format.

Parquet format is supported for the following connectors:

For a list of supported features for all available connectors, visit the Connectors Overview article.

Using Self-hosted Integration Runtime

Important

For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not copying Parquet files as-is, you need to install the 64-bit JRE 8 (Java Runtime Environment), JDK 23 (Java Development Kit), or OpenJDK on your IR machine. Check the following paragraph with more details.

For copy running on Self-hosted IR with Parquet file serialization/deserialization, the service locates the Java runtime by firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for JRE, if not found, secondly checking system variable JAVA_HOME for OpenJDK.

To use JRE: The 64-bit IR requires 64-bit JRE. You can find it from here.
To use JDK: The 64-bit IR requires 64-bit JDK 23. You can find it from here. Be sure to update the JAVA_HOME system variable to the root folder of the JDK 23 installation i.e. C:\Program Files\Java\jdk-23, and add the path to both the C:\Program Files\Java\jdk-23\bin and C:\Program Files\Java\jdk-23\bin\server folders to the Path system variable.
To use OpenJDK: It's supported since IR version 3.13. Package the jvm.dll with all other required assemblies of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly, and then restart Self-hosted IR for taking effect immediately. To download the Microsoft Build of OpenJDK, see Microsoft Build of OpenJDK�.

Tip

If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space", you can add an environment variable _JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such copy, then rerun the pipeline.

Set JVM heap size on Self-hosted IR

Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g. The flag Xms specifies the initial memory allocation pool for a Java Virtual Machine (JVM), while Xmx specifies the maximum memory allocation pool. This means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx amount of memory. By default, the service uses min 64 MB and max 1G.

Dataset properties

For a full list of sections and properties available for defining datasets, see the Datasets article. This section provides a list of properties supported by the Parquet dataset.

Property	Description	Required
type	The type property of the dataset must be set to Parquet.	Yes
location	Location settings of the file(s). Each file-based connector has its own location type and supported properties under `location`. See details in connector article -> Dataset properties section.	Yes
compressionCodec	The compression codec to use when writing to Parquet files. When reading from Parquet files, Data Factories automatically determine the compression codec based on the file metadata. Supported types are "none", "gzip", "snappy" (default), and "lzo". Note currently Copy activity doesn't support LZO when read/write Parquet files.	No

Note

White space in column name is not supported for Parquet files.

Below is an example of Parquet dataset on Azure Blob Storage:

{
    "name": "ParquetDataset",
    "properties": {
        "type": "Parquet",
        "linkedServiceName": {
            "referenceName": "<Azure Blob Storage linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [ < physical schema, optional, retrievable during authoring > ],
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "container": "containername",
                "folderPath": "folder/subfolder",
            },
            "compressionCodec": "snappy"
        }
    }
}

Copy activity properties

For a full list of sections and properties available for defining activities, see the Pipelines article. This section provides a list of properties supported by the Parquet source and sink.

Parquet as source

The following properties are supported in the copy activity *source* section.

Property	Description	Required
type	The type property of the copy activity source must be set to ParquetSource.	Yes
storeSettings	A group of properties on how to read data from a data store. Each file-based connector has its own supported read settings under `storeSettings`. See details in connector article -> Copy activity properties section.	No

Parquet as sink

The following properties are supported in the copy activity *sink* section.

Property	Description	Required
type	The type property of the copy activity sink must be set to ParquetSink.	Yes
formatSettings	A group of properties. Refer to Parquet write settings table below.	No
storeSettings	A group of properties on how to write data to a data store. Each file-based connector has its own supported write settings under `storeSettings`. See details in connector article -> Copy activity properties section.	No

Supported Parquet write settings under formatSettings:

Property	Description	Required
type	The type of formatSettings must be set to ParquetWriteSettings.	Yes
maxRowsPerFile	When writing data into a folder, you can choose to write to multiple files and specify the max rows per file.	No
fileNamePrefix	Applicable when `maxRowsPerFile` is configured. Specify the file name prefix when writing data to multiple files, resulted in this pattern: `<fileNamePrefix>_00000.<fileExtension>`. If not specified, file name prefix will be auto generated. This property does not apply when source is file-based store or partition-option-enabled data store.	No

Mapping data flow properties

In mapping data flows, you can read and write to parquet format in the following data stores: Azure Blob Storage, Azure Data Lake Storage Gen2 and SFTP, and you can read parquet format in Amazon S3.

Source properties

The below table lists the properties supported by a parquet source. You can edit these properties in the Source options tab.

Name	Description	Required	Allowed values	Data flow script property
Format	Format must be `parquet`	yes	`parquet`	format
Wild card paths	All files matching the wildcard path will be processed. Overrides the folder and file path set in the dataset.	no	String[]	wildcardPaths
Partition root path	For file data that is partitioned, you can enter a partition root path in order to read partitioned folders as columns	no	String	partitionRootPath
List of files	Whether your source is pointing to a text file that lists files to process	no	`true` or `false`	fileList
Column to store file name	Create a new column with the source file name and path	no	String	rowUrlColumn
After completion	Delete or move the files after processing. File path starts from the container root	no	Delete: `true` or `false` Move: `[<from>, <to>]`	purgeFiles moveFiles
Filter by last modified	Choose to filter files based upon when they were last altered	no	Timestamp	modifiedAfter modifiedBefore
Allow no files found	If true, an error is not thrown if no files are found	no	`true` or `false`	ignoreNoFilesFound

Source example

The below image is an example of a parquet source configuration in mapping data flows.

Parquet source

The associated data flow script is:

source(allowSchemaDrift: true,
    validateSchema: false,
    rowUrlColumn: 'fileName',
    format: 'parquet') ~> ParquetSource

Sink properties

The below table lists the properties supported by a parquet sink. You can edit these properties in the Settings tab.

Name	Description	Required	Allowed values	Data flow script property
Format	Format must be `parquet`	yes	`parquet`	format
Clear the folder	If the destination folder is cleared prior to write	no	`true` or `false`	truncate
File name option	The naming format of the data written. By default, one file per partition in format `part-#####-tid-<guid>`	no	Pattern: String Per partition: String[] As data in column: String Output to single file: `['<fileName>']`	filePattern partitionFileNames rowUrlColumn partitionFileNames

Sink example

The below image is an example of a parquet sink configuration in mapping data flows.

Parquet sink

The associated data flow script is:

ParquetSource sink(
    format: 'parquet',
    filePattern:'output[n].parquet',
    truncate: true,
    allowSchemaDrift: true,
    validateSchema: false,
    skipDuplicateMapInputs: true,
    skipDuplicateMapOutputs: true) ~> ParquetSink

Data type mapping for Parquet

When reading data from the source connector in Parquet format, the following mappings are used from Parquet data types to interim data types used by the service internally.

Parquet type	Interim service data type
BOOLEAN	Boolean
INT_8	SByte
INT_16	Int16
INT_32	Int32
INT_64	Int64
INT96	DateTime
UINT_8	Byte
UINT_16	UInt16
UINT_32	UInt32
UINT_64	UInt64
DECIMAL	Decimal
FLOAT	Single
DOUBLE	Double
DATE	Date
TIME_MILLIS	TimeSpan
TIME_MICROS	Int64
TIMESTAMP_MILLIS	DateTime
TIMESTAMP_MICROS	Int64
STRING	String
UTF8	String
ENUM	Byte array
UUID	Byte array
JSON	Byte array
BSON	Byte array
BINARY	Byte array
FIXED_LEN_BYTE_ARRAY	Byte array

When writing data to the sink connector in Parquet format, the following mappings are used from interim data types used by the service internally to Parquet data types.

Interim service data type	Parquet type
Boolean	BOOLEAN
SByte	INT_8
Int16	INT_32
Int32	INT_32
Int64	INT_64
Byte	INT_32
UInt16	INT_32
UInt32	INT_64
UInt64	DECIMAL
Decimal	DECIMAL
Single	FLOAT
Double	DOUBLE
Date	DATE
DateTime	INT96
DateTimeOffset	INT96
TimeSpan	INT96
String	UTF8
GUID	UTF8
Byte array	BINARY

To learn about how the copy activity maps the source schema and data type to the sink, see Schema and data type mappings.

Parquet complex data types (e.g. MAP, LIST, STRUCT) are currently supported only in Data Flows, not in Copy Activity. To use complex types in data flows, do not import the file schema in the dataset, leaving schema blank in the dataset. Then, in the Source transformation, import the projection.

Last updated on 2026-07-16