二进制文件Binary file

Databricks Runtime 支持二进制文件数据源,该数据源读取二进制文件并将每个文件转换为包含该文件的原始内容和元数据的单个记录。Databricks Runtime supports the binary file data source, which reads binary files and converts each file into a single record that contains the raw content and metadata of the file. 二进制文件数据源会生成一个包含以下列和可能的分区列的数据帧:The binary file data source produces a DataFrame with the following columns and possibly partition columns:

若要读取二进制文件,请将数据源 format 指定为 binaryFileTo read binary files, specify the data source format as binaryFile.

选项Options

若要加载其路径与给定 glob 模式匹配的文件,同时保留分区发现行为,可以使用 pathGlobFilter 选项。To load files with paths matching a given glob pattern while keeping the behavior of partition discovery, you can use the pathGlobFilter option. 以下代码使用分区发现从输入目录读取所有 JPG 文件:The following code reads all JPG files from the input directory with partition discovery:

df = spark.read.format("binaryFile").option("pathGlobFilter", "*.jpg").load("<path-to-dir>")

如果要忽略分区发现并以递归方式搜索输入目录下的文件,请使用 recursiveFileLookup 选项。If you want to ignore partition discovery and recursively search files under the input directory, use the recursiveFileLookup option. 此选项会搜索整个嵌套目录,即使这些目录的名称不遵循 date=2019-07-01 之类的分区命名方案。This option searches through nested directories even if their names do not follow a partition naming scheme like date=2019-07-01. 以下代码从输入目录中以递归方式读取所有 JPG 文件,并忽略分区发现:The following code reads all JPG files recursively from the input directory and ignores partition discovery:

df = spark.read.format("binaryFile") \
  .option("pathGlobFilter", "*.jpg") \
  .option("recursiveFileLookup", "true") \
  .load("<path-to-dir>")

Scala、Java 和 R 存在类似的 API。Similar APIs exist for Scala, Java, and R.

备注

若要在重新加载数据时提高读取性能,Azure Databricks 建议你在保存从二进制文件加载的数据时禁用压缩:To improve read performance when you load data back, Azure Databricks recommends turning off compression when you save data loaded from binary files:

spark.conf.set("spark.sql.parquet.compression.codec", "uncompressed")
df.write.format("delta").save("<path-to-table>")