Query external databases using the remote_query function

Important

This feature is in Public Preview.

The remote_query table-valued function (TVF) enables you to run SQL queries directly against external databases and data warehouses from within Azure Databricks, using the native SQL syntax of the remote system. This function provides a flexible alternative to query federation, allowing you to run queries written in the remote database's dialect without needing to translate them to Databricks SQL.

remote_query compared to query federation

The following table summarizes the key differences between the remote_query function and query federation:

Attribute remote_query function Query federation
Query syntax Write queries using the native SQL dialect of the remote database (for example, Oracle PL/SQL, BigQuery SQL). Write queries using Databricks SQL syntax. Databricks translates and pushes down compatible operations to the remote database.
Use case
  • You have existing SQL queries written in the remote database's dialect that you want to run without modification.
  • You need to use database-specific functions or syntax that might not be available in Databricks SQL.
  • You want ad hoc access to remote data without creating foreign catalogs.
  • You want to query external data using Databricks SQL syntax.
  • You need a long-term data access pattern with governed access through Unity Catalog foreign catalogs.
  • You want to combine data from multiple sources in a single query using consistent syntax.
  • You want to use the Genie assistant for writing queries.
Access control Users need USE CONNECTION privilege on the connection. Permission can be delegated through views. Users need table-level privileges on foreign catalog objects. Fine-grained control.

Before you begin

Workspace requirements:

  • Workspace enabled for Unity Catalog.

Compute requirements:

  • Network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database systems. See Networking recommendations for Lakehouse Federation.
  • Azure Databricks clusters must use Databricks Runtime 17.3 or above.
  • SQL warehouses must be Pro and use version 2025.35 or above.

Permissions required:

  • To create a connection, you must be a metastore admin or a user with the CREATE CONNECTION privilege on the Unity Catalog metastore.
  • To use the remote_query function, you must have the USE CONNECTION privilege on the connection or the SELECT privilege on a view that wraps the function. Single-user clusters also require the MANAGE permission on the connection.

Create a connection

To use the remote_query function, you first need to create a Unity Catalog connection to your external database. If you already have a connection created for query federation, you can reuse it.

The remote_query function supports connections to the following connection types:

For information about managing existing connections, see Manage connections for Lakehouse Federation.

Grant connection access

To use the remote_query function, you must have the USE CONNECTION privilege on the connection (or the MANAGE privilege on single-user clusters).

GRANT USE CONNECTION ON CONNECTION <connection-name> TO <user-or-group>;

Use the remote_query function

The remote_query function runs a query on the remote database and returns the results as a table that you can use in Databricks SQL queries.

Syntax

SELECT * FROM remote_query(
  '<connection-name>',
  <option-key> => '<option-value>'
  [, <option-key> => '<option-value>' ...]
)

Required parameters

  • connection-name: The name of the Unity Catalog connection to use.

All other required parameters vary by connection type. See Connector-specific options for details.

Connector-specific options

The available options vary by connection type. The following tables describe the options for each connector.

MySQL, PostgreSQL, SQL Server, Redshift, and Teradata

Parameter Required Description
database Yes The database name on the remote system.
query Yes (or dbtable) A SQL query string to run on the remote database. Cannot be used with dbtable.
dbtable Yes (or query) The table name to query. Cannot be used with query.
fetchsize No The number of rows to fetch per round trip. Larger values can improve performance but use more memory. Default: 0 (use driver default).
partitionColumn No A column with uniformly distributed values to use for parallel data fetching. Must be used with lowerBound, upperBound, and numPartitions. Cannot be used with the query option.
lowerBound No The minimum value of the partition column. Must be used with partitionColumn, upperBound, and numPartitions.
upperBound No The maximum value of the partition column. Must be used with partitionColumn, lowerBound, and numPartitions.
numPartitions No The number of parallel connections to use for fetching data. Do not set this too high (hundreds). Must be used with partitionColumn, lowerBound, and upperBound.

Note

When you use partition parameters, all four parameters (partitionColumn, lowerBound, upperBound, numPartitions) must be specified together, and you must use the dbtable option instead of query.

Oracle

Parameter Required Description
serviceName Yes The Oracle service name (used instead of database).
query Yes (or dbtable) A SQL query string to run on the remote database. Cannot be used with dbtable.
dbtable Yes (or query) The table name to query. Cannot be used with query.
fetchsize No The number of rows to fetch per round trip. Larger values can improve performance but use more memory. Default: 0 (use driver default).
partitionColumn No A column with uniformly distributed values to use for parallel data fetching. Must be used with lowerBound, upperBound, and numPartitions. Cannot be used with the query option.
lowerBound No The minimum value of the partition column. Must be used with partitionColumn, upperBound, and numPartitions.
upperBound No The maximum value of the partition column. Must be used with partitionColumn, lowerBound, and numPartitions.
numPartitions No The number of parallel connections to use for fetching data. Do not set this too high (hundreds). Must be used with partitionColumn, lowerBound, and upperBound.

Note

When you use partition parameters, all four parameters (partitionColumn, lowerBound, upperBound, numPartitions) must be specified together, and you must use the dbtable option instead of query.

Snowflake

Parameter Required Description
database Yes The database name in Snowflake.
query Yes (or dbtable) A SQL query string to run on the remote database. Cannot be used with dbtable.
dbtable Yes (or query) The table name to query (single-part name or multi-part name). Cannot be used with query.
schema No The schema name in Snowflake. Default: public.
query_timeout No The query timeout in seconds. Default: 0 (no timeout).
partition_size_in_mb No The expected partition size in megabytes for parallel data fetching. Default: 100 MB.

Additional pushdown control options

You can combine the remote_query function with Databricks SQL operations, and most of those operations can be pushed down as well. You can also control which Databricks SQL operations can be pushed down. These options apply to all connection types and are case-insensitive.

Parameter Default Description
pushdown.limit.enabled true Enable or disable pushing down LIMIT clauses to the remote database.
pushdown.offset.enabled true Enable or disable pushing down OFFSET clauses to the remote database.
pushdown.filters.enabled true Enable or disable pushing down WHERE filters to the remote database.
pushdown.aggregates.enabled true Enable or disable pushing down aggregate functions (COUNT, SUM, AVG, MAX, MIN) to the remote database.
pushdown.sortLimit.enabled true Enable or disable pushing down top-N queries (combination of ORDER BY and LIMIT) to the remote database.

By default, all pushdowns are enabled. You can disable specific pushdowns if needed for troubleshooting or to work around compatibility issues with specific remote databases.

Delegate access through views

You can delegate access to remote data without granting users direct USE CONNECTION privileges by wrapping the remote_query function in a view. This approach has the following benefits:

  • Simplified access control: Grant SELECT privilege on the view instead of managing USE CONNECTION privileges.
  • Data security: Control which columns and rows users can access by defining the view query.
  • Track lineage: Track data access through view lineage rather than direct connection usage.

To delegate access through a view:

  1. Create a view that calls the remote_query function:

    CREATE VIEW sales_data_view AS
    SELECT * FROM remote_query(
      'my_connection',
      database => 'sales_db',
      query => 'SELECT region, product, revenue FROM sales'
    );
    
  2. Grant SELECT privilege on the view to users or groups:

    GRANT SELECT ON VIEW sales_data_view TO <user-or-group>;
    
  3. Users can now query the view without needing the USE CONNECTION privilege:

    SELECT * FROM sales_data_view WHERE region = 'US';
    

Important

The view owner must have USE CONNECTION privilege on the connection. When users query the view, the connection access check is performed using the view owner's privileges, not the querying user's privileges.

Examples

Basic query execution

Execute a query on a PostgreSQL database:

SELECT * FROM remote_query(
  'my_postgres_connection',
  database => 'sales_db',
  query => 'SELECT * FROM orders WHERE order_date > CURRENT_DATE - INTERVAL \'30 days\''
);

Query a specific table

Query a MySQL table directly:

SELECT * FROM remote_query(
  'my_mysql_connection',
  database => 'inventory',
  dbtable => 'my_schema.products'
);

Oracle with service name

Query an Oracle database:

SELECT * FROM remote_query(
  'my_oracle_connection',
  serviceName => 'ORCL',
  query => 'SELECT * FROM customers WHERE ROWNUM <= 1000'
);

Snowflake query

Query Snowflake:

SELECT * FROM remote_query(
  'my_snowflake_connection',
  database => 'ANALYTICS_DB',
  query => 'SELECT * FROM SALES WHERE SALE_DATE >= DATEADD(day, -30, CURRENT_DATE())'
);

Performance tuning with partitioning

Fetch data in parallel from a SQL Server table:

SELECT * FROM remote_query(
  'my_sqlserver_connection',
  database => 'sales',
  dbtable => 'transactions',
  partitionColumn => 'transaction_id',
  lowerBound => '0',
  upperBound => '1000000',
  numPartitions => '10',
  fetchsize => '1000'
);

Combine with Databricks SQL operations

Apply additional filters and transformations:

SELECT customer_id, SUM(amount) as total_amount
FROM remote_query(
  'my_postgres_connection',
  database => 'orders_db',
  query => 'SELECT customer_id, amount, order_date FROM orders'
)
WHERE order_date >= '2025-01-01'
GROUP BY customer_id
HAVING total_amount > 1000
ORDER BY total_amount DESC
LIMIT 100;

Create a view for delegated access

Create a view that wraps the remote_query function. Users with SELECT privilege on the view can query the data without needing USE CONNECTION privilege on the underlying connection:

CREATE VIEW sales_summary AS
SELECT * FROM remote_query(
  'my_mysql_connection',
  database => 'sales',
  query => 'SELECT region, product, SUM(revenue) as total_revenue FROM sales_data GROUP BY region, product'
);

GRANT SELECT ON VIEW sales_summary TO <user-or-group>;

Control pushdown behavior

When you use the remote_query function, Databricks can push down additional operations to the remote database beyond the query you specify. This feature is useful when you query a view that uses the remote_query function.

The following operations can be pushed down:

  • Filters: WHERE clauses applied to the result of the remote query
  • Projections: Column selection (SELECT specific columns)
  • Limit: LIMIT clauses to restrict the number of rows returned
  • Offset: OFFSET clauses to skip rows
  • Aggregates: Aggregation functions like COUNT, SUM, AVG, MAX, MIN
  • Top-N: Combination of ORDER BY and LIMIT for top/bottom N queries

Pushdown support varies by data source. See the documentation for your specific connection type for details.

Disable specific pushdowns for troubleshooting or compatibility:

SELECT * FROM remote_query(
  'my_postgres_connection',
  database => 'analytics',
  query => 'SELECT * FROM complex_view',
  `pushdown.aggregates.enabled` => 'false',
  `pushdown.filters.enabled` => 'false'
);

Limitations

  • Read-only operations: The remote_query function only supports SELECT queries. Data modification operations (INSERT, UPDATE, DELETE, MERGE), DDL operations (CREATE, DROP, ALTER) and stored procedures are not supported.

  • Query validation: The query you provide is executed directly on the remote database. Databricks validates that the query is read-only by performing schema inspection, but syntax and semantic validation are performed by the remote database.

Troubleshooting

Permission errors

If you receive a permission error, verify that:

  1. You have the USE CONNECTION privilege on the connection or the SELECT privilege on a view that wraps the function.
  2. The credentials in the connection have appropriate permissions on the remote database.

Example error:

PERMISSION_DENIED: User does not have USE CONNECTION on Connection 'my_connection'

Resolution:

GRANT USE CONNECTION ON CONNECTION my_connection TO <user-or-group>;

Unsupported parameters

If you receive an error about unsupported parameters, check that you are using the correct parameters for your connection type. The error message lists the allowed parameters.

Example error:

REMOTE_QUERY_FUNCTION_UNSUPPORTED_CONNECTOR_PARAMETERS: The following parameters are not supported for connection type 'postgresql': 'materializationDataset'. Allowed parameters for this connection type are: 'database', 'query', 'dbtable', 'fetchsize', 'partitionColumn', 'lowerBound', 'upperBound', 'numPartitions'.

Resolution: Remove the unsupported parameter and use the correct parameters for your connection type.

DML operations not supported

The remote_query function only supports read-only SELECT queries.

Example error:

DML_OPERATIONS_NOT_SUPPORTED_FOR_REMOTE_QUERY_FUNCTION: Data modification operations are not supported in remote_query function.

Resolution: Remove any INSERT, UPDATE, DELETE, or DDL statements from your query. Only use SELECT statements.

Additional resources