Operational excellence for the data lakehouse

The architectural principles of the operational excellence pillar cover all operational processes that keep the lakehouse running. Operational excellence addresses the ability to operate the lakehouse efficiently and discusses how to operate, manage, and monitor the lakehouse to deliver business value.

Operational excellence lakehouse architecture diagram for Databricks.

Principles of operational excellence

  1. Optimize build and release processes

    Use software engineering best practices across your entire lakehouse environment. Build and release using continuous integration and continuous delivery pipelines for both DevOps and MLOps.

  2. Automate deployments and workloads

    Automating deployments and workloads for the lakehouse helps standardize these processes, eliminate human error, improve productivity, and provide greater repeatability. This includes using "configuration as code" to avoid configuration drift, and "infrastructure as code" to automate the provisioning of all required lakehouse and cloud services.

    For ML specifically, processes should drive automation: Not every step of a process can or should be automated. People still determine the business questions, and some models will always need human oversight before deployment. Therefore, the development process is primary and each module in the process should be automated as needed. This allows incremental build-out of automation and customization.

  3. Set up monitoring, alerting, and logging

    Workloads in the lakehouse typically integrate Databricks platform services and external cloud services, for example as data sources or targets. Successful execution can only occur if each service in the execution chain is functioning properly. When this is not the case, monitoring, alerting, and logging are important to detect and track problems and understand system behavior.

  4. Manage capacity and quotas

    For any service that is launched in a cloud, take limits into account, for example access rate limits, number of instances, number of users, and memory requirements. Before designing a solution, these limits must be understood.

Next: Best practices for operational excellence

See Best practices for operational excellence.