将 Apache Spark 数据帧保存到 TFRecord 文件并使用 TensorFlow 进行加载Save Apache Spark DataFrames to TFRecord files and load with TensorFlow

TFRecord 文件格式是简单的面向记录的二进制格式,适用于 ML 训练数据。The TFRecord file format is a simple record-oriented binary format for ML training data. 使用 tf.data.TFRecordDataset 类可以将一个或多个 TFRecord 文件的内容作为输入管道的一部分进行流式传输。The tf.data.TFRecordDataset class enables you to stream over the contents of one or more TFRecord files as part of an input pipeline.

备注

本指南不是关于使用 TensorFlow 导入数据的综合指南。This guide is not a comprehensive guide on importing data with TensorFlow. 请参阅 TensorFlow API 指南See the TensorFlow API Guide.

将 Apache Spark 数据帧保存到 TFRecord 文件 Save Apache Spark DataFrames to TFRecord files

可以使用 spark-tensorflow-connector 将 Apache Spark 数据帧保存到 TFRecord 文件。You can use spark-tensorflow-connector to save Apache Spark DataFrames to TFRecord files.

spark-tensorflow-connectorTensorFlow 生态系统中的一个库,可实现 Spark 数据帧和 TFRecords(用于存储 TensorFlow 数据的常用格式)之间的转换。spark-tensorflow-connector is a library within the TensorFlow ecosystem that enables conversion between Spark DataFrames and TFRecords (a popular format for storing data for TensorFlow). 通过 spark-tensorflow-connector,可以使用 Spark 数据帧 API 将 TFRecords 文件读入数据帧,并以 TFRecords 形式写入数据帧。With spark-tensorflow-connector, you can use Spark DataFrame APIs to read TFRecords files into DataFrames and write DataFrames as TFRecords.

备注

spark-tensorflow-connector 库包含在用于机器学习的 Databricks Runtime(它是一个机器学习运行时,提供用于机器学习和数据科学的现成环境)中。The spark-tensorflow-connector library is included in Databricks Runtime for Machine Learning, a machine learning runtime that provides a ready-to-go environment for machine learning and data science. 不必使用以下说明来安装该库,可以直接使用用于机器学习的 Databricks Runtime 来创建群集。Instead of installing the library using the following instructions, you can simply create a cluster using Databricks Runtime for Machine Learning. 若要在 Databricks Runtime 上使用 spark-tensorflow-connector,需从 Maven 安装该库。To use spark-tensorflow-connector on Databricks Runtime, you need to install the library from Maven. 有关详细信息,请参阅 Maven 或 Spark 包See Maven or Spark package for details.

使用 TensorFlow 从 TFRecord 文件加载数据Load data from TFRecord Files with TensorFlow

可以使用 tf.data.TFRecordDataset 类加载 TFRecord 文件。You can load the TFRecord files using the tf.data.TFRecordDataset class. 有关详细信息,请参阅从 TensorFlow 读取 TFRecord 文件See Reading a TFRecord file from TensorFlow for details.

以下示例笔记本演示了如何将数据从 Apache Spark 数据帧保存到 TFRecord 文件并加载 TFRecord 文件以进行 ML 训练。The following example notebook demonstrates how to save data from Apache Spark DataFrames to TFRecord files and load TFRecord files for ML training.

准备图像数据以进行分布式 DLPrepare image data for Distributed DL

获取笔记本Get notebook