准备数据以进行分布式训练Prepare data for distributed training

本文介绍了两种用于准备数据进行分布式训练的方法:Petastorm 和 TFRecords。This article describes two methods for preparing data for distributed training: Petastorm and TFRecords.

Petastorm 是开源数据访问库,支持直接加载以 Apache Parquet 格式存储的数据。Petastorm is an open source data access library that enables directly loading data stored in Apache Parquet format. 这对于 Azure Databricks 和 Apache Spark 用户来说很方便,因为 Parquet 是推荐的数据格式。This is convenient for Azure Databricks and Apache Spark users because Parquet is the recommended data format. 下文对此用例进行了描述和说明:The following article describes and illustrates this use case:

TFRecordTFRecord

你也可以使用 TFRecord 格式作为数据源来进行分布式深度学习。You can also use TFRecord format as the data source for distributed deep learning. TFRecord 格式是简单的面向记录的二进制格式,许多 TensorFlow 应用程序将其用于训练数据。TFRecord format is a simple record-oriented binary format that many TensorFlow applications use for training data.

tf.data.TFRecordDataset 是 TensorFlow 数据集,由 TFRecords 文件中的记录构成。tf.data.TFRecordDataset is the TensorFlow dataset, which is comprised of records from TFRecords files. 若要详细了解如何使用 TFRecord 数据,请参阅 TensorFlow 指南使用 TFRecord 数据For more details about how to consume TFRecord data, see the TensorFlow guide Consuming TFRecord data.

以下文章描述并说明了关于将数据保存到 TFRecord 文件和加载 TFRecord 文件的推荐方法:The following articles describe and illustrate the recommended ways to save your data to TFRecord files and load TFRecord files: