使用 Petastorm 加载数据Load data using Petastorm
Petastorm 是一种开放源代码数据访问库。Petastorm is an open source data access library. 此库可直接从 Apache Parquet 格式的数据集和已加载为 Apache Spark 数据帧的数据集对深度学习模型进行单节点或分布式训练和评估。This library enables single-node or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format and datasets that are already loaded as Apache Spark DataFrames. Petastorm 支持基于 Python 的热门机器学习 (ML) 框架,如 Tensorflow、PyTorch 和 PySpark。Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, PyTorch, and PySpark. 有关 Petastorm 的详细信息,请参阅 Petastorm GitHub 页和 Petastorm API 文档。For more information about Petastorm, refer to the Petastorm GitHub page and Petastorm API documentation.
使用 Petastorm 从 Spark 数据帧加载数据Load data from Spark DataFrames using Petastorm
Petastorm Spark 转换器 API 简化了从 Spark 到 TensorFlow 或 PyTorch 的数据转换。The Petastorm Spark converter API simplifies data conversion from Spark to TensorFlow or PyTorch. 输入 Spark 数据帧首先以 Parquet 格式具体化,然后作为 tf.data.Dataset
或 torch.utils.data.DataLoader
进行加载。The input Spark DataFrame is first materialized in Parquet format and then loaded as a tf.data.Dataset
or torch.utils.data.DataLoader
.
请参阅 Petastorm API 文档中的 Spark 数据集转换器 API 部分。See the Spark Dataset Converter API section in the Petastorm API documentation.
建议的工作流为:The recommended workflow is:
- 使用 Apache Spark 来加载数据,还可以选择对数据进行预处理。Use Apache Spark to load and optionally preprocess data.
- 使用 Petastorm
spark_dataset_converter
方法可将 Spark 数据帧中的数据转换为 TensorFlow 数据集或 PyTorch DataLoader。Use the Petastormspark_dataset_converter
method to convert data from a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader. - 将数据传入 DL 框架进行训练或推理。Feed data into a DL framework for training or inference.
配置缓存目录Configure cache directory
Petastorm Spark 转换器以 Parquet 格式将输入 Spark 数据帧缓存到用户指定的缓存目录位置。The Petastorm Spark converter caches the input Spark DataFrame in Parquet format in a user-specified cache directory location. 缓存目录必须是以 file:///dbfs/
开头的 DBFS FUSE 路径,例如 file:///dbfs/tmp/foo/
表示 dbfs:/tmp/foo/
的相同位置。The cache directory must be a DBFS FUSE path starting with file:///dbfs/
, for example, file:///dbfs/tmp/foo/
which refers to the same location as dbfs:/tmp/foo/
. 可以通过两种方式配置缓存目录:You can configure the cache directory in two ways:
在群集 Spark 配置中添加行:
petastorm.spark.converter.parentCacheDirUrl file:///dbfs/...
In the cluster Spark config add the line:petastorm.spark.converter.parentCacheDirUrl file:///dbfs/...
在笔记本中,调用
spark.conf.set()
:In your notebook, callspark.conf.set()
:from petastorm.spark import SparkDatasetConverter, make_spark_converter spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'file:///dbfs/...')
在使用缓存后,可以通过调用 converter.delete()
来显式删除缓存,或通过在对象存储中配置生命周期规则隐式管理缓存。You can either explicitly delete the cache after using it by calling converter.delete()
or manage the cache implicitly by configuring the lifecycle rules in your object storage.
Databricks 支持三种方案中的 DL 训练:Databricks supports DL training in three scenarios:
- 单节点训练Single-node training
- 分布式超参数优化Distributed hyperparameter tuning
- 分布式训练Distributed training
有关端到端示例,请参阅以下笔记本:For end-to-end examples, see the following notebooks:
- 简化从 Spark 到 TensorFlow 的数据转换Simplify data conversion from Spark to TensorFlow
- 简化从 Spark 到 PyTorch 的数据转换Simplify data conversion from Spark to PyTorch
使用 Petastorm 直接加载 Parquet 文件Load Parquet files directly using Petastorm
此方法优先级低于 Petastorm Spark 转换器 API。This method is less preferred than the Petastorm Spark converter API.
建议的工作流为:The recommended workflow is:
- 使用 Apache Spark 来加载数据,还可以选择对数据进行预处理。Use Apache Spark to load and optionally preprocess data.
- 将 Parquet 格式的数据保存到具有随附 FUSE 装载的 DBFS 路径。Save data in Parquet format into a DBFS path that has a companion FUSE mount.
- 通过 FUSE 装入点以 Petastorm 格式加载数据。Load data in Petastorm format via the FUSE mount point.
- 在 DL 框架中使用数据进行训练或推理。Use data in a DL framework for training or inference.
有关端到端示例,请参阅示例笔记本。See example notebook for an end-to-end example.