Azure 数据工厂中的 Parquet 格式Parquet format in Azure Data Factory

适用于: Azure 数据工厂 Azure Synapse Analytics

如果要 分析 Parquet 文件或以 Parquet 格式写入数据,请遵循此文章中的说明。Follow this article when you want to parse the Parquet files or write the data into Parquet format.

以下连接器支持 Parquet 格式:Amazon S3Azure BlobAzure Data Lake Storage Gen2Azure 文件存储文件系统FTPGoogle 云存储HDFSHTTPSFTPParquet format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP.

数据集属性Dataset properties

有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集一文。For a full list of sections and properties available for defining datasets, see the Datasets article. 本部分提供了 Parquet 数据集支持的属性列表。This section provides a list of properties supported by the Parquet dataset.

属性Property 说明Description 必须Required
typetype 数据集的 type 属性必须设置为 ParquetThe type property of the dataset must be set to Parquet. Yes
locationlocation 文件的位置设置。Location settings of the file(s). 每个基于文件的连接器在 location 下都有其自己的位置类型和支持的属性。Each file-based connector has its own location type and supported properties under location. 请在连接器文章 -> 数据集属性部分中查看详细信息See details in connector article -> Dataset properties section. Yes
compressionCodeccompressionCodec 写入到 Parquet 文件时要使用的压缩编解码器。The compression codec to use when writing to Parquet files. 从 Parquet 文件中读取时,数据工厂会基于文件元数据自动确定压缩编解码器。When reading from Parquet files, Data Factories automatically determine the compression codec based on the file metadata.
支持的类型为“none”、“gzip”、“snappy”(默认值)和“lzo”。Supported types are “none”, “gzip”, “snappy” (default), and "lzo". 请注意,当前复制活动在读取/写入 Parquet 文件时不支持 LZO。Note currently Copy activity doesn't support LZO when read/write Parquet files.
No

备注

Parquet 文件不支持列名称中包含空格。White space in column name is not supported for Parquet files.

下面是 Azure Blob 存储上的 Parquet 数据集的示例:Below is an example of Parquet dataset on Azure Blob Storage:

{
    "name": "ParquetDataset",
    "properties": {
        "type": "Parquet",
        "linkedServiceName": {
            "referenceName": "<Azure Blob Storage linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [ < physical schema, optional, retrievable during authoring > ],
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "container": "containername",
                "folderPath": "folder/subfolder",
            },
            "compressionCodec": "snappy"
        }
    }
}

复制活动属性Copy activity properties

有关可用于定义活动的各部分和属性的完整列表,请参阅管道一文。For a full list of sections and properties available for defining activities, see the Pipelines article. 本部分提供了 Parquet 源和接收器支持的属性列表。This section provides a list of properties supported by the Parquet source and sink.

Parquet 作为源Parquet as source

复制活动的 __source* * 节支持以下属性。The following properties are supported in the copy activity __source** section.

属性Property 说明Description 必须Required
typetype 复制活动源的 type 属性必须设置为 ParquetSourceThe type property of the copy activity source must be set to ParquetSource. Yes
storeSettingsstoreSettings 有关如何从数据存储读取数据的一组属性。A group of properties on how to read data from a data store. 每个基于文件的连接器在 storeSettings 下都有其自己支持的读取设置。Each file-based connector has its own supported read settings under storeSettings. 请在连接器文章 -> 复制活动属性部分中查看详细信息See details in connector article -> Copy activity properties section. No

Parquet 作为接收器Parquet as sink

复制活动的 _sink* 节支持以下属性。The following properties are supported in the copy activity __sink** section.

属性Property 说明Description 必须Required
typetype 复制活动接收器的 type 属性必须设置为“ParquetSink”。The type property of the copy activity sink must be set to ParquetSink. Yes
formatSettingsformatSettings 一组属性。A group of properties. 请参阅下面的“Parquet 写入设置”表。Refer to Parquet write settings table below. No
storeSettingsstoreSettings 有关如何将数据写入到数据存储的一组属性。A group of properties on how to write data to a data store. 每个基于文件的连接器在 storeSettings 下都有其自身支持的写入设置。Each file-based connector has its own supported write settings under storeSettings. 请在连接器文章 -> 复制活动属性部分中查看详细信息See details in connector article -> Copy activity properties section. No

formatSettings 下支持的 Parquet 写入设置:Supported Parquet write settings under formatSettings:

属性Property 说明Description 必需Required
typetype formatSettings 的类型必须设置为 ParquetWriteSettings。The type of formatSettings must be set to ParquetWriteSettings. Yes
maxRowsPerFilemaxRowsPerFile 在将数据写入到文件夹时,可选择写入多个文件,并指定每个文件的最大行数。When writing data into a folder, you can choose to write to multiple files and specify the max rows per file. No
fileNamePrefixfileNamePrefix 配置 maxRowsPerFile 时适用。Applicable when maxRowsPerFile is configured.
在将数据写入多个文件时,指定文件名前缀,生成的模式为 <fileNamePrefix>_00000.<fileExtension>Specify the file name prefix when writing data to multiple files, resulted in this pattern: <fileNamePrefix>_00000.<fileExtension>. 如果未指定,将自动生成文件名前缀。If not specified, file name prefix will be auto generated. 如果源是基于文件的存储或已启用分区选项的数据存储,则此属性不适用。This property does not apply when source is file-based store or partition-option-enabled data store.
No

映射数据流属性Mapping data flow properties

在映射数据流中,可以在以下数据存储中读取和写入 parquet 格式:Azure Blob 存储Azure Data Lake Storage Gen2In mapping data flows, you can read and write to parquet format in the following data stores: Azure Blob Storage, and Azure Data Lake Storage Gen2.

源属性Source properties

下表列出了 parquet 源支持的属性。The below table lists the properties supported by a parquet source. 可以在“源选项”选项卡中编辑这些属性。You can edit these properties in the Source options tab.

名称Name 说明Description 必需Required 允许的值Allowed values 数据流脚本属性Data flow script property
格式Format 格式必须为 parquetFormat must be parquet yes parquet formatformat
通配符路径Wild card paths 所有匹配通配符路径的文件都会得到处理。All files matching the wildcard path will be processed. 重写数据集中设置的文件夹和文件路径。Overrides the folder and file path set in the dataset. no String[]String[] wildcardPathswildcardPaths
分区根路径Partition root path 对于已分区的文件数据,可以输入分区根路径,以便将已分区的文件夹读取为列For file data that is partitioned, you can enter a partition root path in order to read partitioned folders as columns no 字符串String partitionRootPathpartitionRootPath
文件列表List of files 源是否指向某个列出待处理文件的文本文件Whether your source is pointing to a text file that lists files to process no truefalsetrue or false fileListfileList
用于存储文件名的列Column to store file name 使用源文件名称和路径创建新列Create a new column with the source file name and path no 字符串String rowUrlColumnrowUrlColumn
完成后After completion 在处理后删除或移动文件。Delete or move the files after processing. 文件路径从容器根开始File path starts from the container root no 删除:truefalseDelete: true or false
移动:[<from>, <to>]Move: [<from>, <to>]
purgeFilespurgeFiles
moveFilesmoveFiles
按上次修改时间筛选Filter by last modified 选择根据上次更改文件的时间筛选文件Choose to filter files based upon when they were last altered no 时间戳Timestamp ModifiedAftermodifiedAfter
modifiedBeforemodifiedBefore
允许找不到文件Allow no files found 如果为 true,在找不到文件时不会引发错误If true, an error is not thrown if no files are found no truefalsetrue or false ignoreNoFilesFoundignoreNoFilesFound

源示例Source example

下图是映射数据流中 parquet 源配置的示例。The below image is an example of a parquet source configuration in mapping data flows.

Parquet 源

关联的数据流脚本为:The associated data flow script is:

source(allowSchemaDrift: true,
    validateSchema: false,
    rowUrlColumn: 'fileName',
    format: 'parquet') ~> ParquetSource

接收器属性Sink properties

下表列出了 parquet 接收器支持的属性。The below table lists the properties supported by a parquet sink. 你可以在“设置”选项卡中编辑这些属性。You can edit these properties in the Settings tab.

名称Name 说明Description 必需Required 允许的值Allowed values 数据流脚本属性Data flow script property
格式Format 格式必须为 parquetFormat must be parquet yes parquet formatformat
清除文件夹Clear the folder 如果在写入前目标文件夹已被清除If the destination folder is cleared prior to write no truefalsetrue or false truncatetruncate
文件名选项File name option 写入的数据的命名格式。The naming format of the data written. 默认情况下,每个分区有一个 part-#####-tid-<guid> 格式的文件By default, one file per partition in format part-#####-tid-<guid> no 模式:StringPattern: String
按分区:String[]Per partition: String[]
作为列中的数据:StringAs data in column: String
输出到单个文件:['<fileName>']Output to single file: ['<fileName>']
filePatternfilePattern
partitionFileNamespartitionFileNames
rowUrlColumnrowUrlColumn
partitionFileNamespartitionFileNames

接收器示例Sink example

下图是映射数据流中 parquet 接收器配置的示例。The below image is an example of a parquet sink configuration in mapping data flows.

Parquet 接收器

关联的数据流脚本为:The associated data flow script is:

ParquetSource sink(
    format: 'parquet',
    filePattern:'output[n].parquet',
    truncate: true,
    allowSchemaDrift: true,
    validateSchema: false,
    skipDuplicateMapInputs: true,
    skipDuplicateMapOutputs: true) ~> ParquetSink

数据类型支持Data type support

Parquet 复杂数据类型(如 MAP、LIST、STRUCT)目前仅在数据流中受支持,而在复制活动中不受支持。Parquet complex data types (e.g. MAP, LIST, STRUCT) are currently supported only in Data Flows, not in Copy Activity. 若要在数据流中使用复杂类型,请不要在数据集中导入文件架构,而是在数据集中将架构留空。To use complex types in data flows, do not import the file schema in the dataset, leaving schema blank in the dataset. 然后,在源转换中导入投影。Then, in the Source transformation, import the projection.

使用自承载集成运行时Using Self-hosted Integration Runtime

重要

对于由自承载集成运行时(例如,本地与云数据存储之间)提供支持的复制,如果不是 按原样 复制 Parquet 文件,则需要在 IR 计算机上安装 64 位 JRE 8(Java 运行时环境)或 OpenJDKMicrosoft Visual C++ 2010 Redistributable PackageFor copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not copying Parquet files as-is, you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK and Microsoft Visual C++ 2010 Redistributable Package on your IR machine. 请查看以下段落以了解更多详细信息。Check the following paragraph with more details.

对于使用 Parquet 文件序列化/反序列化在自承载集成运行时上运行的复制,ADF 将通过首先检查 JRE 的注册表项 (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) 来查找 Java 运行时,如果未找到,则会检查系统变量 JAVA_HOME 来查找 OpenJDK。For copy running on Self-hosted IR with Parquet file serialization/deserialization, ADF locates the Java runtime by firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for JRE, if not found, secondly checking system variable JAVA_HOME for OpenJDK.

  • 若要使用 JRE:64 位 IR 需要 64 位 JRE。To use JRE: The 64-bit IR requires 64-bit JRE. 可在此处找到它。You can find it from here.
  • 若要使用 OpenJDK:从 IR 版本 3.13 开始受支持。To use OpenJDK: It's supported since IR version 3.13. 将 jvm.dll 以及所有其他必需的 OpenJDK 程序集打包到自承载 IR 计算机中,并相应地设置系统环境变量 JAVA_HOME。Package the jvm.dll with all other required assemblies of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.
  • 若要安装 Visual C++ 2010 Redistributable Package:安装自承载 IR 时未安装 Visual C++ 2010 Redistributable Package。To install Visual C++ 2010 Redistributable Package: Visual C++ 2010 Redistributable Package is not installed with self-hosted IR installations. 可在此处找到它。You can find it from here.

提示

如果使用自承载集成运行时将数据复制为 Parquet 格式或从 Parquet 格式复制数据,并遇到“调用 java 时发生错误,消息: java.lang.OutOfMemoryError:Java 堆空间”的错误,则可以在托管自承载 IR 的计算机上添加环境变量 _JAVA_OPTIONS,以便调整 JVM 的最小/最大堆大小,以支持此类复制,然后重新运行管道 。If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space", you can add an environment variable _JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such copy, then rerun the pipeline.

在自承载 IR 上设置 JVM 堆大小

示例:将变量 _JAVA_OPTIONS 的值设置为 -Xms256m -Xmx16gExample: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g. 标志 Xms 指定 Java 虚拟机 (JVM) 的初始内存分配池,而 Xmx 指定最大内存分配池。The flag Xms specifies the initial memory allocation pool for a Java Virtual Machine (JVM), while Xmx specifies the maximum memory allocation pool. 这意味着 JVM 初始内存为 Xms,并且能够使用的最多内存为 XmxThis means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx amount of memory. 默认情况下,ADF 最少使用 64 MB 且最多使用 1G。By default, ADF use min 64 MB and max 1G.

后续步骤Next steps