如何处理具有不同架构的已损坏 Parquet 文件How to handle corrupted Parquet files with different schema

问题Problem

假设你有大量基本独立的 Parquet 文件,其中包含各种不同的架构。Let’s say you have a large list of essentially independent Parquet files, with a variety of different schemas. 你希望只读取与特定架构匹配的文件,而跳过不匹配的文件。You want to read only those files that match a specific schema and skip the files that don’t match.

一种解决方案是按顺序读取文件、识别架构,并将 DataFrames 合并在一起。One solution could be to read the files in sequence, identify the schema, and union the DataFrames together. 但是,当有数十万个文件时,这种方法会不切实际。However, this approach is impractical when there are hundreds of thousands of files.

解决方案Solution

将 Apache Spark 属性 spark.sql.files.ignoreCorruptFiles 设置为 true,然后读取包含所需架构的文件。Set the Apache Spark property spark.sql.files.ignoreCorruptFiles to true and then read the files with the desired schema. 与指定架构不匹配的文件会被忽略。Files that don’t match the specified schema are ignored. 生成的数据集仅包含那些与指定架构匹配的文件中的数据。The resultant dataset contains only data from those files that match the specified schema.

使用 spark.conf.set 设置 Spark 属性:Set the Spark property using spark.conf.set:

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")

或者,可以在 Spark 配置中设置此属性。Alternatively, you can set this property in your Spark configuration.