JSON 文件 JSON files

可在单行或多行模式下读取 JSON 文件。You can read JSON files in single-line or multi-line mode. 在单行模式下,可将一个文件拆分成多个部分进行并行读取。In single-line mode, a file can be split into many parts and read in parallel.

单行模式Single-line mode

在此示例中,每行有一个 JSON 对象:In this example, there is one JSON object per line:

{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}

若要读取 JSON 数据,应使用类似于以下代码示例的内容:To read the JSON data, you should use something like this code sample:

val df = spark.read.json("example.json")

Spark 会自动推断架构。Spark infers the schema automatically.

df.printSchema
root
 |-- array: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- dict: struct (nullable = true)
 |    |-- extra_key: string (nullable = true)
 |    |-- key: string (nullable = true)
 |-- int: long (nullable = true)
 |-- string: string (nullable = true)

多行模式Multi-line mode

如果某个 JSON 对象占用多行,则必须启用 Spark 的多行模式来加载该文件。If a JSON object occupies multiple lines, you must enable multi-line mode for Spark to load the file. 文件将作为一个整体实体加载且无法拆分。Files will be loaded as a whole entity and cannot be split.

[
    {"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
    {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
    {
        "string": "string3",
        "int": 3,
        "array": [
            3,
            6,
            9
        ],
        "dict": {
            "key": "value3",
            "extra_key": "extra_value3"
        }
    }
]

若要读取这样的 JSON 数据,应启用多行模式:To read the JSON data, you should enable multi-line mode:

val mdf = spark.read.option("multiline", "true").json("multi.json")
mdf.show(false)
+---------+---------------------+---+-------+
|array    |dict                 |int|string |
+---------+---------------------+---+-------+
|[1, 2, 3]|[null,value1]        |1  |string1|
|[2, 4, 6]|[null,value2]        |2  |string2|
|[3, 6, 9]|[extra_value3,value3]|3  |string3|
+---------+---------------------+---+-------+

字符集自动检测 Charset auto-detection

默认情况下,Spark 会自动检测输入文件的字符集,但你始终可通过此选项显式指定字符集:By default, Spark detects the charset of input files automatically, but you can always specify the charset explicitly via this option:

spark.read.option("charset", "UTF-16BE").json("fileInUTF16.json")

下面是一些受支持的字符集:UTF-8、UTF-16BE、UTF-16LE、UTF-16、UTF-32BE、UTF-32LE、UTF-32。Some supported charsets include: UTF-8, UTF-16BE, UTF-16LE, UTF-16, UTF-32BE, UTF-32LE, UTF-32. 若要查看 Oracle Java SE 支持的字符集的完整列表,请参阅受支持的编码For the full list of charsets supported by Oracle Java SE, see Supported Encodings.

读取 JSON 文件笔记本Read JSON files notebook

获取笔记本Get notebook