Subqueries in Azure Cosmos DB for NoSQL
APPLIES TO: NoSQL
A subquery is a query nested within another query within Azure Cosmos DB for NoSQL. A subquery is also called an inner query or inner SELECT
. The statement that contains a subquery is typically called an outer query.
Types of subqueries
There are two main types of subqueries:
- Correlated: A subquery that references values from the outer query. The subquery is evaluated once for each row that the outer query processes.
- Non-correlated: A subquery that's independent of the outer query. It can be run on its own without relying on the outer query.
Note
Azure Cosmos DB supports only correlated subqueries.
Subqueries can be further classified based on the number of rows and columns that they return. There are three types:
- Table: Returns multiple rows and multiple columns.
- Multi-value: Returns multiple rows and a single column.
- Scalar: Returns a single row and a single column.
Queries in Azure Cosmos DB for NoSQL always return a single column (either a simple value or a complex item). Therefore, only multi-value and scalar subqueries are applicable. You can use a multi-value subquery only in the FROM
clause as a relational expression. You can use a scalar subquery as a scalar expression in the SELECT
or WHERE
clause, or as a relational expression in the FROM
clause.
Multi-value subqueries
Multi-value subqueries return a set of items and are always used within the FROM
clause. They're used for:
- Optimizing
JOIN
(self-join) expressions. - Evaluating expensive expressions once and referencing multiple times.
Optimize self-join expressions
Multi-value subqueries can optimize JOIN
expressions by pushing predicates after each select-many expression rather than after all cross-joins in the WHERE
clause.
Consider the following query:
SELECT VALUE
COUNT(1)
FROM
products p
JOIN
t in p.tags
JOIN
q in p.onHandQuantities
JOIN
s in p.warehouseStock
WHERE
t.name IN ("winter", "fall") AND
(q.quantity BETWEEN 0 AND 10) AND
NOT s.backstock
For this query, the index matches any item that has a tag with a name
of either "winter" or "fall", at least one quantity
between zero and ten, and at least one warehouse where the backstock
is false
. The JOIN
expression here performs the cross-product of all items of tags
, onHandQuantities
, and warehouseStock
arrays for each matching item before any filter is applied.
The WHERE
clause then applies the filter predicate on each <c, t, n, s>
tuple. For instance, if a matching item had ten items in each of the three arrays, it expands to 1 x 10 x 10 x 10
(that is, 1,000) tuples. Using subqueries here can help in filtering out joined array items before joining with the next expression.
This query is equivalent to the preceding one but uses subqueries:
SELECT VALUE
COUNT(1)
FROM
products p
JOIN
(SELECT VALUE t FROM t IN p.tags WHERE t.name IN ("winter", "fall"))
JOIN
(SELECT VALUE q FROM q IN p.onHandQuantities WHERE q.quantity BETWEEN 0 AND 10)
JOIN
(SELECT VALUE s FROM s IN p.warehouseStock WHERE NOT s.backstock)
Assume that only one item in the tags array matches the filter, and there are five items for both quantity and stock arrays. The JOIN
expressions then expand to 1 x 1 x 5 x 5
(25) items, as opposed to 1,000 items in the first query.
Evaluate once and reference many times
Subqueries can help optimize queries with expensive expressions such as user-defined functions (UDFs), complex strings, or arithmetic expressions. You can use a subquery along with a JOIN
expression to evaluate the expression once but reference it many times.
Let's assume that you have the following UDF (getTotalWithTax
) defined.
function getTotalWithTax(subTotal){
return subTotal * 1.25;
}
The following query runs the UDF getTotalWithTax
multiple times:
SELECT VALUE {
subtotal: p.price,
total: udf.getTotalWithTax(p.price)
}
FROM
products p
WHERE
udf.getTotalWithTax(p.price) < 22.25
Here's an equivalent query that runs the UDF only once:
SELECT VALUE {
subtotal: p.price,
total: totalPrice
}
FROM
products p
JOIN
(SELECT VALUE udf.getTotalWithTax(p.price)) totalPrice
WHERE
totalPrice < 22.25
Tip
Keep in mind the cross-product behavior of JOIN
expressions. If the UDF expression can evaluate to undefined
, you should ensure that the JOIN
expression always produces a single row by returning an object from the subquery rather than the value directly.
Mimic join with external reference data
You might often need to reference static data that rarely changes, such as units of measurement. It's ideal to not duplicate static data for each item in a query. Avoiding this duplication saves on storage and improve write performance by keeping the individual item size smaller. You can use a subquery to mimic inner-join semantics with a collection of static reference data.
For instance, consider this set of measurements:
Name | Multiplier | Base unit | |
---|---|---|---|
ng |
Nanogram | 1.00E-09 |
Gram |
µg |
Microgram | 1.00E-06 |
Gram |
mg |
Milligram | 1.00E-03 |
Gram |
g |
Gram | 1.00E+00 |
Gram |
kg |
Kilogram | 1.00E+03 |
Gram |
Mg |
Megagram | 1.00E+06 |
Gram |
Gg |
Gigagram | 1.00E+09 |
Gram |
The following query mimics joining with this data so that you add the name of the unit to the output:
SELECT
s.id,
(s.weight.quantity * m.multiplier) AS calculatedWeight,
m.unit AS unitOfWeight
FROM
shipments s
JOIN m IN (
SELECT VALUE [
{unit: 'ng', name: 'nanogram', multiplier: 0.000000001, baseUnit: 'gram'},
{unit: 'µg', name: 'microgram', multiplier: 0.000001, baseUnit: 'gram'},
{unit: 'mg', name: 'milligram', multiplier: 0.001, baseUnit: 'gram'},
{unit: 'g', name: 'gram', multiplier: 1, baseUnit: 'gram'},
{unit: 'kg', name: 'kilogram', multiplier: 1000, baseUnit: 'gram'},
{unit: 'Mg', name: 'megagram', multiplier: 1000000, baseUnit: 'gram'},
{unit: 'Gg', name: 'gigagram', multiplier: 1000000000, baseUnit: 'gram'}
]
)
WHERE
s.weight.units = m.unit
Scalar subqueries
A scalar subquery expression is a subquery that evaluates to a single value. The value of the scalar subquery expression is the value of the projection (SELECT
clause) of the subquery. You can use a scalar subquery expression in many places where a scalar expression is valid. For instance, you can use a scalar subquery in any expression in both the SELECT
and WHERE
clauses.
Using a scalar subquery doesn't always help optimize your query. For example, passing a scalar subquery as an argument to either a system or user-defined functions provides no benefit in reducing resource unit (RU) consumption or latency.
Scalar subqueries can be further classified as:
- Simple-expression scalar subqueries
- Aggregate scalar subqueries
Simple-expression scalar subqueries
A simple-expression scalar subquery is a correlated subquery that has a SELECT
clause that doesn't contain any aggregate expressions. These subqueries provide no optimization benefits because the compiler converts them into one larger simple expression. There's no correlated context between the inner and outer queries.
As a first example, consider this trivial query.
SELECT
1 AS a,
2 AS b
You can rewrite this query, by using a simple-expression scalar subquery.
SELECT
(SELECT VALUE 1) AS a,
(SELECT VALUE 2) AS b
Both queries produce the same output.
[
{
"a": 1,
"b": 2
}
]
This next example query concatenates the unique identifier with a prefix as a simple-expression scalar subquery.
SELECT
(SELECT VALUE Concat('ID-', p.id)) AS internalId
FROM
products p
This example uses a simple-expression scalar subquery to only return the relevant fields for each item. The query outputs something for each item, but it only includes the projected field if it meets the filter within the subquery.
SELECT
p.id,
(SELECT p.name WHERE CONTAINS(p.name, "glove")).name
FROM
products p
[
{
"id": "03230",
"name": "Winter glove"
},
{
"id": "03238"
},
{
"id": "03229"
}
]
Aggregate scalar subqueries
An aggregate scalar subquery is a subquery that has an aggregate function in its projection or filter that evaluates to a single value.
As a first example, consider an item with the following fields.
{
"name": "Snow coat",
"inventory": [
{
"location": "Redmond, WA",
"quantity": 50
},
{
"location": "Seattle, WA",
"quantity": 30
},
{
"location": "Washington, DC",
"quantity": 25
}
]
}
Here's a subquery with a single aggregate function expression in its projection. This query counts all tags for each item.
SELECT
p.name,
(SELECT VALUE COUNT(1) FROM i IN p.inventory) AS locationCount
FROM
products p
[
{
"name": "Snow coat",
"locationCount": 3
}
]
Here's the same subquery with a filter.
SELECT
p.name,
(SELECT VALUE COUNT(1) FROM i IN p.inventory WHERE ENDSWITH(i.location, "WA")) AS washingtonLocationCount
FROM
products p
[
{
"name": "Snow coat",
"washingtonLocationCount": 2
}
]
Here's another subquery with multiple aggregate function expressions:
SELECT
p.name,
(SELECT
COUNT(1) AS locationCount,
SUM(i.quantity) AS totalQuantity
FROM i IN p.inventory) AS inventoryData
FROM
products p
[
{
"name": "Snow coat",
"inventoryData": {
"locationCount": 2,
"totalQuantity": 75
}
}
]
Finally, here's a query with an aggregate subquery in both the projection and the filter:
SELECT
p.name,
(SELECT VALUE AVG(q.quantity) FROM q IN p.inventory WHERE q.quantity > 10) AS averageInventory
FROM
products p
WHERE
(SELECT VALUE COUNT(1) FROM i IN p.inventory WHERE i.quantity > 10) >= 1
[
{
"name": "Snow coat",
"averageInventory": 35
}
]
A more optimal way to write this query is to join on the subquery and reference the subquery alias in both the SELECT and WHERE clauses. This query is more efficient because you need to execute the subquery only within the join statement, and not in both the projection and filter.
SELECT
p.name,
inventoryData.inventoryAverage
FROM
products p
JOIN
(SELECT
COUNT(1) AS inventoryCount,
AVG(i.quantity) as inventoryAverage
FROM i IN p.inventory
WHERE i.quantity > 10) AS inventoryData
WHERE
inventoryData.inventoryCount >= 1
EXISTS expression
Azure Cosmos DB for NoSQL's query engine supports EXISTS
expressions. This expression is an aggregate scalar subquery built into the Azure Cosmos DB for NoSQL. EXISTS
takes a subquery expression and returns true
if the subquery returns any rows. Otherwise, it returns false
.
Because the query engine doesn't differentiate between boolean expressions and any other scalar expressions, you can use EXISTS
in both SELECT
and WHERE
clauses. This behavior is unlike T-SQL, where a boolean expression is restricted to only filters.
If the EXISTS
subquery returns a single value that's undefined
, EXISTS
evaluates to false. For example, consider the following query that returns nothing.
SELECT VALUE
undefined
If you use the EXISTS
expression and the preceding query as a subquery, the expression returns false
.
SELECT
EXISTS (SELECT VALUE undefined)
[
{
"$1": false
}
]
If the VALUE keyword in the preceding subquery is omitted, the subquery evaluates to an array with a single empty object.
SELECT
undefined
[
{}
]
At that point, the EXISTS
expression evaluates to true
since the object ({}
) technically exits.
SELECT
EXISTS (SELECT undefined)
[
{
"$1": true
}
]
A common use case of ARRAY_CONTAINS
is to filter an item by the existence of an item in an array. In this case, we're checking to see if the tags
array contains an item named "outerwear."
SELECT
p.name,
p.tags
FROM
products p
WHERE
ARRAY_CONTAINS(p.tags, "outerwear")
The same query can use EXISTS
as an alternative option.
SELECT
p.name,
p.tags
FROM
products p
WHERE
EXISTS (SELECT VALUE t FROM t IN p.tags WHERE t = "outerwear")
Additionally, ARRAY_CONTAINS
can only check if a value is equal to any element within an array. If you need more complex filters on array properties, use JOIN
instead.
Consider this example item in a set with multiple items each containing an accessories
array.
{
"name": "Unobtani road bike",
"accessories": [
{
"name": "Front/rear tire",
"type": "tire",
"quantityOnHand": 5
},
{
"name": "9-speed chain",
"type": "chains",
"quantityOnHand": 25
},
{
"name": "Clip-in pedals",
"type": "pedals",
"quantityOnHand": 15
}
]
}
Now, consider the following query that filters based on the type
and quantityOnHand
properties in the array within each item.
SELECT
p.name,
a.name AS accessoryName
FROM
products p
JOIN
a IN p.accessories
WHERE
a.type = "chains" AND
a.quantityOnHand >= 10
[
{
"name": "Unobtani road bike",
"accessoryName": "9-speed chain"
}
]
For each of the items in the collection, a cross-product is performed with its array elements. This JOIN
operation makes it possible to filter on properties within the array. However, this query's RU consumption is significant. For instance, if 1,000 items had 100 items in each array, it expands to 1,000 x 100
(that is, 100,000) tuples.
Using EXISTS
can help to avoid this expensive cross-product. In this next example, the query filters on array elements within the EXISTS
subquery. If an array element matches the filter, then you project it and EXISTS
evaluates to true.
SELECT VALUE
p.name
FROM
products p
WHERE
EXISTS (SELECT VALUE
a
FROM
a IN p.accessories
WHERE
a.type = "chains" AND
a.quantityOnHand >= 10)
[
"Unobtani road bike"
]
Queries can also alias EXISTS
and reference the alias in the projection:
SELECT
p.name,
EXISTS (SELECT VALUE
a
FROM
a IN p.accessories
WHERE
a.type = "chains" AND
a.quantityOnHand >= 10) AS chainAccessoryAvailable
FROM
products p
[
{
"name": "Unobtani road bike",
"chainAccessoryAvailable": true
}
]
ARRAY expression
You can use the ARRAY
expression to project the results of a query as an array. You can use this expression only within the SELECT
clause of the query.
For these examples, let's assume there's a container with at least this item.
{
"name": "Radimer mountain bike",
"tags": [
{
"name": "road"
},
{
"name": "bike"
},
{
"name": "competitive"
}
]
}
In this first example, the expression is used within the SELECT
clause.
SELECT
p.name,
ARRAY (SELECT VALUE t.name FROM t in p.tags) AS tagNames
FROM
products p
[
{
"name": "Radimer mountain bike",
"tagNames": [
"road",
"bike",
"competitive"
]
}
]
As with other subqueries, filters with the ARRAY
expression are possible.
SELECT
p.name,
ARRAY (SELECT VALUE t.name FROM t in p.tags) AS tagNames,
ARRAY (SELECT VALUE t.name FROM t in p.tags WHERE CONTAINS(t.name, "bike")) AS bikeTagNames
FROM
products p
[
{
"name": "Radimer mountain bike",
"tagNames": [
"road",
"bike",
"competitive"
],
"bikeTagNames": [
"bike"
]
}
]
Array expressions can also come after the FROM
clause in subqueries.
SELECT
p.name,
n.t.name AS nonBikeTagName
FROM
products p
JOIN
n IN (SELECT VALUE ARRAY(SELECT t FROM t in p.tags WHERE t.name NOT LIKE "%bike%"))
[
{
"name": "Radimer mountain bike",
"nonBikeTagName": "road"
},
{
"name": "Radimer mountain bike",
"nonBikeTagName": "competitive"
}
]