Load data for machine learning and deep learning
This section covers information about loading data specifically for ML and DL applications. For general information about loading data, see Ingest data into a Databricks lakehouse.
Store files for data loading and model checkpointing
Machine learning applications may need to use shared storage for data loading and model checkpointing. This is particularly important for distributed deep learning.
Azure Databricks provides the Databricks File System (DBFS) for accessing data on a cluster using both Spark and local file APIs.
Load tabular data
You can load tabular machine learning data from tables or files (for example, see Read CSV files). You can convert Apache Spark DataFrames into pandas DataFrames using the PySpark method toPandas()
, and then optionally convert to NumPy format using the PySpark method to_numpy()
.
Prepare data to fine tune large language models
You can prepare your data for fine-tuning open source large language models with Hugging Face Transformers and Hugging Face Datasets.
Prepare data for fine tuning Hugging Face models
Prepare data for distributed deep learning training
This section covers two methods for preparing data for distributed training: Petastorm and TFRecords.