Best practices for interoperability and usability
This article covers best practices for interoperability and usability, organized by architectural principles listed in the following sections.
1. Define standards for integration
Use standard and reusable integration patterns for external integration
Integration standards are important because they provide guidelines for how data should be represented, exchanged, and processed across different systems and applications. These standards help ensure that data is compatible, high quality, and interoperable across various sources and destinations.
The Databricks Lakehouse comes with a comprehensive REST API that allows you to programmatically manage nearly all aspects of the platform. The REST API server runs in the control plane and provides a unified endpoint for managing the Azure Databricks platform.
The REST API provides the lowest level of integration that can always be used. However, the preferred way to integrate with Azure Databricks is to use higher level abstractions such as the Databricks SDKs or CLI tools. CLI tools are shell-based and allow easy integration of the Databricks platform into CI/CD and MLOps workflows.
Use optimized connectors to ingest data sources into the lakehouse
Azure Databricks offers a variety of ways to help you ingest data into Delta Lake.
Databricks provides optimized connectors for stream messaging services such as Apache Kafka for near-real time data ingestion of data.
Databricks provides built-in integrations to many cloud-native data systems and extensible JDBC support to connect to other data systems.
One option for integrating data sources without ETL is Lakehouse Federation. Lakehouse Federation is the query federation platform for Databricks. The term query federation describes a collection of features that allow users and systems to run queries against multiple data sources without having to migrate all the data into a unified system. Databricks uses Unity Catalog to manage query federation. Unity Catalog's data governance and data lineage tools ensure that data access is managed and audited for all federated queries run by users in your Databricks workspaces.
Note
Any query in the Databricks platform that uses a Lakehouse Federation source is sent to that source. Make sure the source system can handle the load. Also, be aware that if the source system is deployed in a different cloud region or cloud, there is an egress cost for each query.
Reduce complexity of data engineering pipelines
Investing in reducing the complexity of data engineering pipelines enables scalability, agility and flexibility to be able to expand and innovate faster. Simplified pipelines make it easier to manage and adapt all of the operational needs of a data engineering pipeline: task orchestration, cluster management, monitoring, data quality, and error handling.
Delta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations you want to perform on your data, and Delta Live Tables handles task orchestration, cluster management, monitoring, data quality, and error handling. See What is Delta Live Tables?.
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. It can reliably read data files from cloud storage. An important aspect of both Delta Live Tables and Auto Loader is their declarative nature: Without them, one must build complex pipelines that integrate different cloud services - such as a notification service and a queuing service - to reliably read cloud files based on events and to reliably combine batch and streaming sources.
Auto Loader and Delta Live Tables reduce system dependencies and complexity and greatly improve the interoperability with the cloud storage and between different paradigms such as batch and streaming. As a side effect, the simplicity of the pipelines increases the usability of the platform.
Use infrastructure as code (IaC) for deployments and maintenance
HashiCorp Terraform is a popular open source tool for creating safe and predictable cloud infrastructure across several cloud providers. See Operational Excellence: Use Infrastructure as code for deployments and maintenance
2. Utilize open interfaces and open data formats
Use open data formats
Using an open data format means there are no restrictions on its use. This is important because it removes barriers to accessing and using the data for analysis and driving business insights. Open formats, such as those built on Apache Spark, also add features that boost performance with support for ACID transactions, unified streaming, and batch data processing. Furthermore, open source is community-driven, meaning the community is constantly working on improving existing features and adding new ones, making it easier for users to get the most out of their projects.
The primary data format used in the Data Intelligence Platform is Delta Lake, a fully open data format that offers many benefits, from reliability features to performance enhancements, see Use a data format that supports ACID transactions and Best practices for performance efficiency.
Because of its open nature, Delta Lake comes with a large ecosystem. Dozens of third-party tools and applications support Delta Lake.
To further enhance interoperability, the Delta Universal Format (UniForm) allows you to read Delta tables with Iceberg reader clients. UniForm automatically generates Iceberg metadata asynchronously, without rewriting the data, so that Iceberg clients can read Delta tables as if they were Iceberg tables. A single copy of the data files serves both formats.
Enable secure data and AI sharing for all data assets
Sharing data and AI assets can lead to better collaboration and decision making. However, when sharing data, it's important to maintain control, protect your data, and ensure compliance with relevant data sharing laws and regulations.
Delta Sharing is an open protocol developed by Databricks for securely sharing data with other organizations, regardless of the computing platforms they use. If you want to share data with users outside of your Databricks workspace, regardless of whether they use Databricks, you can use open Delta Sharing to securely share your data. If you want to share data with users who have a Databricks workspace that is enabled for Unity Catalog, you can use Databricks-to-Databricks Delta Sharing.
In both cases, you can share tables, views, volumes, models, and notebooks.
Use the open Delta Sharing protocol for sharing data with partners.
Delta Sharing provides an open solution for securely sharing live data from your lakehouse to any computing platform. Recipients do not need to be on the Databricks platform, on the same cloud, or on any cloud at all. Delta Sharing natively integrates with Unity Catalog, enabling organizations to centrally manage and audit shared data and AI assets across the enterprise and confidently share data and AI assets that meet security and compliance requirements.
Data providers can share live data and AI models from where they are stored in the data platform without replicating or moving it to another system. This approach reduces the operational costs of data and AI sharing because data providers don't have to replicate data multiple times across clouds, geographies, or data platforms to each of their data consumers.
Use Databricks-to-Databricks Delta Sharing between Databricks users.
If you want to share data with users who don't have access to your Unity Catalog metastore, you can use Databricks-to-Databricks Delta Sharing, as long as the recipients have access to a Databricks workspace that is enabled for Unity Catalog. Databricks-to-Databricks sharing allows you to share data with users in other Databricks accounts, across cloud regions, and across cloud providers. It's a great way to securely share data across different Unity Catalog metastores in your own Databricks account.
Use open standards for your ML lifecycle management
Like using an open source data format, using open standards for your AI workflows has similar benefits in terms of flexibility, agility, cost, and security.
MLflow is an open source platform for managing the ML and AI lifecycle. Databricks offers a fully managed and hosted version of MLflow, integrated with enterprise security features, high availability, and other Databricks workspace features such as experiment and run management and notebook revision tracking.
The primary components are experimentation tracking to automatically log and track ML and deep learning models, models as a standard format for packaging machine learning models, a model registry integrated with Unity Catalog, and the scalable, enterprise-grade model serving .
3. Simplify new use case implementation
Provide a self-service experience across the platform
There are several benefits of a platform where users have autonomy to use the tools and capabilities depending on their needs. Investing in creating a self-service platform makes it easy to scale to serve more users and drives greater efficiency by minimizing the need for human involvement to provision users, resolve issues, and process access requests.
The Databricks Data Intelligence Platform has all the capabilities needed to provide a self-service experience. While there may be a mandatory approval step, the best practice is to fully automate the setup when a business unit requests access to the lakehouse. Automatically provision their new environment, synchronize users and use SSO for authentication, provide access control to shared data and separate object stores for their own data, and so on. Together with a central data catalog of semantically consistent and business-ready data sets, new business units can quickly and securely access lakehouse capabilities and the data they need.
Use serverless compute
For serverless compute on the Azure Databricks platform, the compute layer runs in the customer's Databricks account. Cloud administrators no longer need to manage complex cloud environments that require adjusting quotas, creating and maintaining network resources, and connecting to billing sources. Users benefit from near-zero cluster startup latency and improved query concurrency.
Use predefined compute templates
Predefined templates help control how compute resources can be used or created by users: Limit user cluster creation to prescribed settings or a certain number, simplify the user interface, or control costs by limiting the maximum cost per cluster.
The Data Intelligence Platform accomplishes this in two ways:
- Provide shared clusters as immediate environments for users. On these clusters, use autoscaling down to a very minimal number of nodes to avoid high idle costs.
- For a standardized environment, use compute policies to restrict cluster size or features or to define t-shirt-sized clusters (S, M, L).
4. Ensure data consistency and usability
Offer reusable data-as-products that the business can trust
Organizations seeking to become AI- and data-driven often need to provide their internal teams with high-quality, trustworthy data. One approach to prioritizing quality and usability is to apply product thinking to your published data assets by creating well-defined "data products". Building such data products ensures that organizations establish standards and a trusted foundation of business truth for their data and AI goals. Data products ultimately deliver value when users and applications have the right data, at the right time, with the right quality, in the right format. While this value has traditionally been realized in the form of more efficient operations through lower costs, faster processes, and reduced risk, modern data products can also pave the way for new value-added offerings and data sharing opportunities within an organization's industry or partner ecosystem.
See the blog post Building High-Quality and Trusted Data Products with Databricks.
Publish data products semantically consistent across the enterprise
A data lake typically contains data from multiple source systems. These systems may have different names for the same concept (e.g., customer vs. account) or use the same identifier to refer to different concepts. So that business users can easily combine these data sets in a meaningful way, the data must be made homogeneous across all sources to be semantically consistent. In addition, for some data to be valuable for analysis, internal business rules, such as revenue recognition, must be applied correctly. To ensure that all users are using the correctly interpreted data, datasets with these rules must be made available and published to Unity Catalog. Access to the source data must be restricted to teams that understand the correct usage.
Provide a central catalog for discovery and lineage
A central catalog for discovery and lineage helps data consumers access data from multiple sources across the enterprise, thus reducing operational overhead for the central governance team.
In Unity Catalog, administrators and data stewards manage users and their access to data centrally across all workspaces in an Azure Databricks account. Users in different workspaces can share the same data and, depending on the user privileges centrally granted in Unity Catalog, can access data together.
For data discovery, the Unity Catalog supports users with capabilities such as:
- Catalog Explorer is the primary user interface for many Unity Catalog features. You can use Catalog Explorer to view schema details, preview sample data, and view table details and properties. Administrators can view and change owners, and administrators and data object owners can grant and revoke permissions. You can also use Databricks Search, which enables users to easily and seamlessly find data assets (such as tables, columns, views, dashboards, models, and so on). Users are shown results that are relevant to their search requests and that they have access to.
- Data lineage across all queries run on an Azure Databricks cluster or SQL warehouse. Lineage is supported for all languages and is captured down to the column level. Lineage data includes notebooks, jobs, and dashboards related to the query. Lineage can be visualized in Catalog Explorer in near real-time and retrieved with the Azure Databricks REST API.
To allow enterprises to provide their users a holistic view of all data across all data platforms, Unity Catalog provides integration with enterprise data catalogs (sometimes referred to as the "catalog of catalogs").