The scope of the lakehouse platform
A modern data and AI platform framework
To discuss the scope of the Databricks Data intelligence Platform, it is helpful to first define a basic framework for the modern data and AI platform:
Overview of the lakehouse scope
The Databricks Data Intelligence Platform covers the complete modern data platform framework. It is built on the lakehouse architecture and powered by a data intelligence engine that understands the unique qualities of your data. It is an open and unified foundation for ETL, ML/AI, and DWH/BI workloads, and has Unity Catalog as the central data and AI governance solution.
Personas of the platform framework
The framework covers the primary data team members (personas) working with the applications in the framework:
- Data engineers provide data scientists and business analysts with accurate and reproducible data for timely decision-making and real-time insights. They implement highly consistent and reliable ETL processes to increase user confidence and trust in data. They ensure that data is well integrated with the various pillars of the business and typically follow software engineering best practices.
- Data scientists blend analytical expertise and business understanding to transform data into strategic insights and predictive models. They are adept at translating business challenges into data-driven solutions, be that through retrospective analytical insights or forward-looking predictive modeling. Leveraging data modeling and machine learning techniques, they design, develop, and deploy models that unveil patterns, trends, and forecasts from data. They act as a bridge, converting complex data narratives into comprehensible stories, ensuring business stakeholders not only understand but can also act upon the data-driven recommendations, in turn driving a data-centric approach to problem-solving within an organization.
- ML engineers (machine learning engineers) lead the practical application of data science in products and solutions by building, deploying, and maintaining machine learning models. Their primary focus pivots towards the engineering aspect of model development and deployment. ML Engineers ensure the robustness, reliability, and scalability of machine learning systems in live environments, addressing challenges related to data quality, infrastructure, and performance. By integrating AI and ML models into operational business processes and user-facing products, they facilitate the utilization of data science in solving business challenges, ensuring models don't just stay in research but drive tangible business value.
- Business analysts empower stakeholders and business teams with actionable data. They often interpret data and create reports or other documentation for leadership using standard BI tools. They are typically the go-to point of contact for non-technical business and operations colleagues for quick analysis questions.
- Business partners are important stakeholders in an increasingly networked business world. They are defined as a company or individuals with whom a business has a formal relationship to achieve a common goal, and can include vendors, suppliers, distributors, and other third-party partners. Data sharing is an important aspect of business partnerships, as it enables the transfer and exchange of data to enhance collaboration and data-driven decision-making.
Domains of the platform framework
The platform consists of multiple domains:
- Storage: In the cloud, data is mainly stored in scalable, efficient, and resilient object storage on cloud providers.
- Governance: Capabilities around data governance, such as access control, auditing, metadata management, lineage tracking, and monitoring for all data and AI assets.
- AI engine: The AI engine provides generative AI capabilities for the whole platform.
- Ingest & transform: The capabilities for ETL workloads.
- Advanced analytics, ML, and AI: All capabilities around machine learning, AI, Generative AI, and also streaming analytics.
- Data warehouse: The domain supporting DWH and BI use cases.
- Orchestration: Central workflow management of data processing, machine learning, and analytics pipelines.
- ETL & DS tools: The front-end tools that data engineers, data scientists and ML engineers primarily use for work.
- BI tools: The front-end tools that BI analysts primarily use for work.
- Collaboration: Capabilities for data sharing between two or more parties.
The scope of the Databricks Platform
The Databricks Data Intelligence Platform and its components can be mapped to the framework in the following way:
Download: Scope of the lakehouse - Databricks components
Data workloads on Azure Databricks
Most importantly, the Databricks Data Intelligence Platform covers all relevant workloads for the data domain in one platform, with Apache Spark/Photon as the engine:
Ingest & transform
For data ingestion, Auto Loader incrementally and automatically processes files landing in cloud storage in scheduled or continuous jobs - without the need to manage state information. Once ingested, raw data needs to be transformed so it's ready for BI and ML/AI. Databricks provides powerful ETL capabilities for data engineers, data scientists, and analysts.
Delta Live Tables (DLT) allows ETL jobs to be written in a declarative way, simplifying the entire implementation process. Data quality can be improved by defining data expectations.
Advanced analytics, ML, and AI
The platform includes Databricks Mosaic AI, a set of fully integrated machine learning and AI tools for classic machine and deep learning . It covers the entire workflow from preparing data to building machine learning and deep learning models.
Spark Structured Streaming and DLT enable real-time analytics.
Data warehouse
The Databricks Data Intelligence Platform also has a complete data warehouse solution with Databricks SQL, centrally governed by Unity Catalog with fine-grained access control.
Outline of Azure Databricks feature areas
This is a mapping of the Databricks Data Intelligence Platform features to the other layers of the framework, from bottom to top:
Cloud storage
All data for the lakehouse is stored in the cloud provider's object storage. Databricks supports three cloud providers: AWS, Azure, and GCP. Files in various structured and semi-structured formats (for example, Parquet, CSV, JSON, and Avro) as well as unstructured formats (such as images and documents) are ingested and transformed using either batch or streaming processes.
Delta Lake is the recommended data format for the lakehouse (file transactions, reliability, consistency, updates, and so on) and is completely open source to avoid lock-in. And Delta Universal Format (UniForm) allows you to read Delta tables with Iceberg reader clients.
No proprietary data formats are used in the Databricks Data Intelligence Platform.
Data governance
On top of the storage layer, Unity Catalog offers a wide range of data governance capabilities, including metadata management in the metastore, access control, auditing, data discovery, data lineage.
External SQL sources can be integrated into the lakehouse and Unity Catalog through lakehouse federation.
Orchestration
Databricks Jobs enable you to run diverse workloads for the full data and AI lifecycle on any cloud. They allow you to orchestrate jobs as well as Delta Live Tables for SQL, Spark, notebooks, DBT, ML models, and more.
ETL & DS tools
At the consumption layer, data engineers and ML engineers typically work with the platform using IDEs. Data scientists often prefer notebooks and use the ML & AI runtimes, and the machine learning workflow system MLflow to track experiments and manage the model lifecycle.
BI tools
Business analysts typically use their preferred BI tool to access the Databricks data warehouse. Databricks SQL can be queried by different Analysis and BI tools, see BI and visualization
In addition, the platform offers query and analysis tools out of the box:
- Dashboards to drag-and-drop data visualizations and share insights.
- SQL editor for SQL analysts to analyze data.
Collaboration
Delta Sharing is an open protocol developed by Databricks for secure data sharing with other organizations regardless of the computing platforms they use.