What is the medallion lakehouse architecture?
The medallion architecture describes a series of data layers that denote the quality of data stored in the lakehouse. Databricks recommends taking a multi-layered approach to building a single source of truth for enterprise data products. This architecture guarantees atomicity, consistency, isolation, and durability as data passes through multiple layers of validations and transformations before being stored in a layout optimized for efficient analytics. The terms bronze (raw), silver (validated), and gold (enriched) describe the quality of the data in each of these layers.
It is important to note that this medallion architecture does not replace other dimensional modeling techniques. Schemas and tables within each layer can take on a variety of forms and degrees of normalization depending on the frequency and nature of data updates and the downstream use cases for the data.
Organizations can leverage the Databricks lakehouse to create and maintain validated datasets accessible throughout the company. Adopting an organizational mindset focused on curating data-as-products is a key step in successfully building a data lakehouse.
Ingest raw data to the bronze layer
The bronze layer contains unvalidated data. Data ingested in the bronze layer typically:
- Maintains the raw state of the data source.
- Is appended incrementally and grows over time.
- Can be any combination of streaming and batch transactions.
Retaining the full, unprocessed history of each dataset in an efficient storage format provides the ability to recreate any state of a given data system.
Additional metadata (such as source file names or recording the time data was processed) may be added to data on ingest for enhanced discoverability, description of the state of the source dataset, and optimized performance in downstream applications.
Validate and deduplicate data in the silver layer
Recall that while the bronze layer contains the entire data history in a nearly raw state, the silver layer represents a validated, enriched version of our data that can be trusted for downstream analytics.
While Databricks believes strongly in the lakehouse vision driven by bronze, silver, and gold tables, simply implementing a silver layer efficiently will immediately unlock many of the potential benefits of the lakehouse.
For any data pipeline, the silver layer may contain more than one table.
Power analytics with the gold layer
This gold data is often highly refined and aggregated, containing data that powers analytics, machine learning, and production applications. While all tables in the lakehouse should serve an important purpose, gold tables represent data that has been transformed into knowledge, rather than just information.
Analysts largely rely on gold tables for their core responsibilities, and data shared with a customer would rarely be stored outside this level.
Updates to these tables are completed as part of regularly scheduled production workloads, which helps control costs and allows service level agreements (SLAs) for data freshness to be established.
While the lakehouse doesn't have the same deadlock issues that you may encounter in a enterprise data warehouse, gold tables are often stored in a separate storage container to help avoid cloud limits on data requests.
In general, because aggregations, joins, and filtering are handled before data is written to the gold layer, users should see low latency query performance on data in gold tables.