Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Azure Databricks provides access to the TPC-DS benchmark dataset, a widely used benchmark for testing the performance of systems built for data warehousing and analytics. The dataset is available in two sizes by default in every Unity Catalog-enabled workspace. These datasets are ideal for testing Azure Databricks performance on a standardized benchmark that simulates realistic retail and e-commerce business scenarios. To learn more about this dataset, see the TPC-DS benchmark documentation.
What's included
The TPC-DS datasets are available in the samples
catalog with the following schemas:
tpcds_sf1
— Small-scale dataset (approximately 1 GB)tpcds_sf1000
— Large-scale dataset (approximately 1 TB)
Both datasets share the following qualities:
- Are read-only and queryable by all users in the workspace
- Are compatible with SQL warehouses and all-purpose clusters
- Follow the TPC-DS specification for standardized benchmarking
Prerequisites
You must have access to a SQL warehouse or an all-purpose cluster.
Preview the data
To explore the data in the Catalog Explorer UI:
Click
Catalog in the sidebar.
Enter tpcds in the search bar. Both schemas are in the
samples
catalog. Click the name of the schema that you want to view.The Overview tab lists all tables in the schema. Click a table name to open an overview of the columns and data types in that table.
Use the top navigation to view the table's Sample Data or Details.
Query the data
The following queries use the smaller scale dataset, tpcds_sf1
. To use the larger scale dataset, replace the schema name with tpcds_sf1000
. Click SQL Editor in the sidebar to open the SQL editor. Then, use the following queries to start exploring the data.
Preview tables
SHOW TABLES IN samples.tpcds_sf1;
Explore a table
DESCRIBE TABLE samples.tpcds_sf1.customer;
SELECT * FROM samples.tpcds_sf1.customer LIMIT 10;
Sample join and aggregation
SELECT
i_category,
d_year,
SUM(ss_net_paid) AS total_revenue
FROM samples.tpcds_sf1.store_sales ss
JOIN samples.tpcds_sf1.item i ON ss.ss_item_sk = i.i_item_sk
JOIN samples.tpcds_sf1.date_dim d ON ss.ss_sold_date_sk = d.d_date_sk
WHERE d.d_year = 2001
GROUP BY i_category, d_year
ORDER BY total_revenue DESC
LIMIT 10;
Best practices
- Use the query history and query profile to understand performance characteristics and identify optimization opportunities.
- Start with the smaller
tpcds_sf1
dataset for initial testing, then scale up totpcds_sf1000
for comprehensive performance evaluation. - Compare query performance across different SQL warehouse sizes to determine optimal configurations for your workloads.
- Use these standardized datasets to establish performance baselines and track improvements over time.