Phase 9: Design observability strategy

In this phase, you design observability and monitoring strategies to ensure operational excellence and proactive issue resolution.

Azure Databricks provides built-in observability capabilities to monitor platform operations, workload performance, data quality, and model serving. Design your observability strategy to balance operational insights with monitoring costs and complexity.

Design job and pipeline monitoring strategy

Monitor job and pipeline execution to ensure data pipelines run successfully and identify failures quickly. Design your monitoring strategy based on workload criticality, SLAs, and operational requirements.

Job monitoring patterns

Real-time alerting: Configure email notifications or webhooks for critical job failures.
Trend analysis: Use Workflow and Pipelines Monitoring page to track job run history and identify patterns.
Anomaly detection: Set up SQL alerts to monitor job duration anomalies or repeated failures.
SLA monitoring: Define SLAs for critical jobs and alert when jobs exceed expected runtimes.
Dependency tracking: Monitor job dependencies and upstream failures that impact downstream workloads.

Pipeline monitoring considerations

Lakeflow Spark Declarative Pipelines observability: Monitor pipeline run status, data quality expectations, and lineage.
Incremental processing: Track checkpoint information and incremental processing metrics.
Data freshness: Monitor pipeline latency and ensure data arrives within SLA windows.
Error handling: Design retry strategies and dead-letter queues for failed records.

Best practices for job monitoring

Configure email notifications or webhooks for critical job failures.
Monitor jobs using system tables (system.workflow.job_runs, system.workflow.task_runs).
Set up SQL alerts to monitor job duration anomalies or repeated failures.
Define SLAs for critical jobs based on business requirements.
Implement runbook automation for common failure scenarios.
Review job performance trends regularly and optimize slow-running jobs.

For detailed job monitoring configuration, see Monitoring and observability for Lakeflow Jobs.

For Lakeflow Spark Declarative Pipelines observability, see Monitor pipelines.

Design Spark performance monitoring strategy

Monitor Spark job performance to identify bottlenecks such as skew, spill, long-running tasks, and memory or I/O issues. Design your Spark monitoring approach based on compute type and performance requirements.

Query profile for serverless and SQL warehouses

For serverless compute and SQL warehouses, use query profile to analyze and optimize query performance. Query profile provides detailed execution plans, stage-level metrics, and optimization recommendations.

Query profile capabilities

Visualize query execution plans with stage-level metrics.
Identify expensive operations (for example, sorts, joins, aggregations).
Analyze data skew and partition imbalance.
Review optimization recommendations from the query optimizer.
Compare query performance across runs.

Best practices for query profile

Review query profile for slow queries to identify optimization opportunities.
Focus on stages with high execution time or data skew.
Implement suggested optimizations such as partition pruning and broadcast joins.
Monitor query performance after optimization to measure improvement.

Spark UI for classic compute

For classic compute clusters, use the Spark UI to identify performance bottlenecks and resource constraints. The Spark UI provides detailed metrics on executors, stages, tasks, and storage.

Spark UI capabilities

Monitor stage execution time and task distribution.
Identify data skew by analyzing task duration variance.
Track memory usage and spill metrics.
Review executor metrics (for example, CPU, memory, disk I/O).
Analyze shuffle read/write patterns.

Best practices for Spark UI

Enable cluster log delivery to cloud storage for long-term log retention.
Monitor cluster metrics (for example, CPU, memory, disk I/O) to identify resource constraints.
Review Spark event logs to troubleshoot slow jobs and optimize configurations.
Focus on stages with high shuffle or spill to reduce memory pressure.
Optimize partition sizes to reduce task skew.

For query profile documentation, see Query profile.

For Spark UI troubleshooting guidance, see Apache Spark overview.

Design model monitoring strategy

Monitor deployed ML models to track performance, health, and request metrics. Design your model monitoring strategy based on model criticality, SLA requirements, and compliance needs.

Model Serving observability capabilities

Endpoint health: Monitor endpoint availability and health status.
Invocation metrics: Track request counts, latency, and throughput.
Inference tables: Log predictions and analyze model behavior over time.
Model version tracking: Monitor model version usage and deployment history.
Error monitoring: Track error rates and failure patterns.

Model monitoring patterns

Real-time alerting: Set up alerts for anomalies such as high latency or error rates.
SLA monitoring: Define latency and availability SLAs for production models.
Inference analysis: Use inference tables to analyze prediction distributions and detect drift.
A/B testing: Monitor performance across model versions to validate improvements.
Rollback procedures: Define automated rollback triggers based on performance thresholds.

Best practices for model monitoring

Monitor endpoint health and invocation metrics to identify performance issues.
Track latency and request throughput to ensure SLA compliance.
Use inference tables to log predictions and analyze model behavior.
Set up alerts for anomalies such as high latency or error rates.
Monitor model version usage to track deployments and rollbacks.
Document model performance baselines and acceptable thresholds.

Design third-party monitoring integration strategy

Integrate Azure Databricks with external monitoring solutions for centralized observability across your entire infrastructure. Design your integration strategy based on existing monitoring tools, operational requirements, and team expertise.

Third-party integration patterns

Centralized monitoring: Forward Azure Databricks metrics and logs to centralized monitoring platforms.
Multi-cloud observability: Use cloud-agnostic tools to monitor Azure Databricks across multiple clouds.
Custom dashboards: Build unified dashboards combining Azure Databricks and external system metrics.
Alerting integration: Route Azure Databricks alerts through existing incident management systems.
Compliance reporting: Aggregate logs for compliance and audit requirements.

Integration options

Datadog: Monitor cluster metrics, job runs, and application logs with Datadog integration.
Prometheus: Export cluster metrics to Prometheus for time-series monitoring and alerting.

Azure Monitor: Forward logs to Azure Log Analytics for Microsoft Azure Databricks deployments.
Microsoft Sentinel: Integrate audit logs with Microsoft Sentinel for security monitoring.

Best practices for third-party integrations

Use standard integrations (for example, Datadog, Prometheus) where available.
Forward logs to centralized logging platforms for long-term retention.
Correlate Azure Databricks metrics with infrastructure metrics for root cause analysis.
Implement consistent tagging across Azure Databricks and external systems.
Test alert routing and escalation procedures regularly.

Observability recommendations

Recommended

Enable system tables for all metastores to capture comprehensive usage data.
Create dashboards based on system tables for cost, performance, and security monitoring.
Configure job and pipeline monitoring with alerts for critical failures.
Enable Spark monitoring (query profile, Spark UI) for performance troubleshooting.
Create Lakehouse Monitoring for critical production tables (gold layer).
Monitor model serving endpoints for latency, throughput, and error rates.
Integrate with third-party monitoring solutions for centralized observability.
Define SLAs and alert thresholds for critical workloads.
Document runbooks for common operational scenarios.

Evaluate based on requirements

Balance monitoring granularity with operational overhead and costs.
Consider third-party integrations only if centralized monitoring is required.
Evaluate real-time alerting vs batch monitoring based on SLA requirements.
Consider data quality monitoring costs (for example, storage, compute) for large tables.
Test alert fatigue by starting with conservative thresholds and refining over time.

Phase 9 outcomes

After completing Phase 9, you should have:

Job and pipeline monitoring configured with alerts for critical failures.
Spark performance monitoring approach designed (query profile, Spark UI).
Model monitoring strategy designed for ML endpoints.
Third-party monitoring integration approach defined (if applicable).
SLAs and alert thresholds documented for critical workloads.
Operational runbooks created for common monitoring scenarios.

Next phase: Phase 10: Design high availability and disaster recovery

Implementation guidance: For step-by-step instructions to implement your observability strategy, see Monitoring and observability for Lakeflow Jobs.

Last updated on 2026-05-06