查询 JSON 字符串

本文介绍可用于查询和转换以 JSON 字符串形式存储的半结构化数据的 Databricks SQL 运算符。

注意

利用此功能，可以读取半结构化数据，而不需要展平文件。但是，为了获得最佳的读取查询性能，Databricks 建议提取具有正确数据类型的嵌套列。

可以使用语法 <column-name>:<extraction-path> 从包含 JSON 字符串的字段中提取列，其中 <column-name> 是字符串列名，而 <extraction-path> 是要提取的字段的路径。返回的结果为字符串。

创建具有高度嵌套数据的表

运行以下查询以创建具有高度嵌套数据的表。本文中的示例均引用此表。

CREATE TABLE store_data AS SELECT
'{
   "store":{
      "fruit": [
        {"weight":8,"type":"apple"},
        {"weight":9,"type":"pear"}
      ],
      "basket":[
        [1,2,{"b":"y","a":"x"}],
        [3,4],
        [5,6]
      ],
      "book":[
        {
          "author":"Nigel Rees",
          "title":"Sayings of the Century",
          "category":"reference",
          "price":8.95
        },
        {
          "author":"Herman Melville",
          "title":"Moby Dick",
          "category":"fiction",
          "price":8.99,
          "isbn":"0-553-21311-3"
        },
        {
          "author":"J. R. R. Tolkien",
          "title":"The Lord of the Rings",
          "category":"fiction",
          "reader":[
            {"age":25,"name":"bob"},
            {"age":26,"name":"jack"}
          ],
          "price":22.99,
          "isbn":"0-395-19395-8"
        }
      ],
      "bicycle":{
        "price":19.95,
        "color":"red"
      }
    },
    "owner":"amy",
    "zip code":"94025",
    "fb:testid":"1234"
 }' as raw

提取顶层列

若要提取列，请在提取路径中指定 JSON 字段的名称。

可以在括号中提供列名。在括号内引用的列区分大小写地匹配。列名的引用也可不区分大小写。

SELECT raw:owner, RAW:owner FROM store_data

+-------+-------+
| owner | owner |
+-------+-------+
| amy   | amy   |
+-------+-------+

-- References are case sensitive when you use brackets
SELECT raw:OWNER case_insensitive, raw:['OWNER'] case_sensitive FROM store_data

+------------------+----------------+
| case_insensitive | case_sensitive |
+------------------+----------------+
| amy              | null           |
+------------------+----------------+

使用反引号转义空格和特殊字符。字段名称不区分大小写地匹配。

-- Use backticks to escape special characters. References are case insensitive when you use backticks.
-- Use brackets to make them case sensitive.
SELECT raw:`zip code`, raw:`Zip Code`, raw:['fb:testid'] FROM store_data

+----------+----------+-----------+
| zip code | Zip Code | fb:testid |
+----------+----------+-----------+
| 94025    | 94025    | 1234      |
+----------+----------+-----------+

注意

如果 JSON 记录包含多个由于不区分大小写的匹配而与提取路径匹配的列，则会收到一个错误提示，要求您使用方括号。如果行之间的列匹配，则不会遇到任何错误。 {"foo":"bar", "Foo":"bar"} 会引发错误，但以下内容将不会引发错误：

{"foo":"bar"}
{"Foo":"bar"}

提取嵌套字段

可以通过点表示法或使用方括号指定嵌套字段。使用方括号时，各列区分大小写。

-- Use dot notation
SELECT raw:store.bicycle FROM store_data
-- the column returned is a string

+------------------+
| bicycle          |
+------------------+
| {                |
|   "price":19.95, |
|   "color":"red"  |
| }                |
+------------------+

-- Use brackets
SELECT raw:store['bicycle'], raw:store['BICYCLE'] FROM store_data

+------------------+---------+
| bicycle          | BICYCLE |
+------------------+---------+
| {                | null    |
|   "price":19.95, |         |
|   "color":"red"  |         |
| }                |         |
+------------------+---------+

从数组中提取值

可以使用方括号对数组中的元素进行索引。索引从 0 开始。可以使用星号 (*)，后跟点或括号表示法来从数组中的所有元素中提取子字段。

注意

[*] 语法仅在 JSON 路径表达式中有效，且必须在 JSON 字符串列上使用 : 运算符之后。以下情况不受支持：

原生 ARRAY 列。 [*]应用于数组列将返回错误[INVALID_USAGE_OF_STAR_OR_REGEX]。若要从结构数组的每个元素中提取字段，请改用 array_column.field_name、转换或分解。
VARIANT 列。请参阅如何查询变体数据？。

-- Index elements
SELECT raw:store.fruit[0], raw:store.fruit[1] FROM store_data

+------------------+-----------------+
| fruit            | fruit           |
+------------------+-----------------+
| {                | {               |
|   "weight":8,    |   "weight":9,   |
|   "type":"apple" |   "type":"pear" |
| }                | }               |
+------------------+-----------------+

-- Extract subfields from arrays
SELECT raw:store.book[*].isbn FROM store_data

+--------------------+
| isbn               |
+--------------------+
| [                  |
|   null,            |
|   "0-553-21311-3", |
|   "0-395-19395-8"  |
| ]                  |
+--------------------+

-- Access arrays within arrays or structs within arrays
SELECT
    raw:store.basket[*],
    raw:store.basket[*][0] first_of_baskets,
    raw:store.basket[0][*] first_basket,
    raw:store.basket[*][*] all_elements_flattened,
    raw:store.basket[0][2].b subfield
FROM store_data

+----------------------------+------------------+---------------------+---------------------------------+----------+
| basket                     | first_of_baskets | first_basket        | all_elements_flattened          | subfield |
+----------------------------+------------------+---------------------+---------------------------------+----------+
| [                          | [                | [                   | [1,2,{"b":"y","a":"x"},3,4,5,6] | y        |
|   [1,2,{"b":"y","a":"x"}], |   1,             |   1,                |                                 |          |
|   [3,4],                   |   3,             |   2,                |                                 |          |
|   [5,6]                    |   5              |   {"b":"y","a":"x"} |                                 |          |
| ]                          | ]                | ]                   |                                 |          |
+----------------------------+------------------+---------------------+---------------------------------+----------+

类型转换值

可以使用 :: 将值转换为基本数据类型。使用 from_json 方法将嵌套结果转换为更复杂的数据类型，例如数组或结构体。

-- price is returned as a double, not a string
SELECT raw:store.bicycle.price::double FROM store_data

+------------------+
| price            |
+------------------+
| 19.95            |
+------------------+

-- use from_json to cast into more complex types
SELECT from_json(raw:store.bicycle, 'price double, color string') bicycle FROM store_data
-- the column returned is a struct containing the columns price and color

+------------------+
| bicycle          |
+------------------+
| {                |
|   "price":19.95, |
|   "color":"red"  |
| }                |
+------------------+

SELECT from_json(raw:store.basket[*], 'array<array<string>>') baskets FROM store_data
-- the column returned is an array of string arrays

+------------------------------------------+
| basket                                   |
+------------------------------------------+
| [                                        |
|   ["1","2","{\"b\":\"y\",\"a\":\"x\"}]", |
|   ["3","4"],                             |
|   ["5","6"]                              |
| ]                                        |
+------------------------------------------+

NULL 行为

如果存在具有 null 值的 JSON 字段，则会收到该列的 SQL null 值，而不是 null 文本值。

select '{"key":null}':key is null sql_null, '{"key":null}':key == 'null' text_null

+-------------+-----------+
| sql_null    | text_null |
+-------------+-----------+
| true        | null      |
+-------------+-----------+

使用 Spark SQL 运算符转换嵌套数据

Apache Spark 有许多用于处理复杂嵌套数据的内置函数。以下笔记本包含示例。

此外，当内置的 Spark 运算符无法以所需的方式转换数据时，高阶函数会提供许多其他选项。

复杂嵌套数据笔记本

获取笔记本

Last updated on 2026-06-29