JSON 文件 JSON files
可在单行或多行模式下读取 JSON 文件。You can read JSON files in single-line or multi-line mode. 在单行模式下,可将一个文件拆分成多个部分进行并行读取。In single-line mode, a file can be split into many parts and read in parallel.
单行模式Single-line mode
在此示例中,每行有一个 JSON 对象:In this example, there is one JSON object per line:
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}
若要读取 JSON 数据,应使用类似于以下代码示例的内容:To read the JSON data, you should use something like this code sample:
val df = spark.read.json("example.json")
Spark 会自动推断架构。Spark infers the schema automatically.
df.printSchema
root
|-- array: array (nullable = true)
| |-- element: long (containsNull = true)
|-- dict: struct (nullable = true)
| |-- extra_key: string (nullable = true)
| |-- key: string (nullable = true)
|-- int: long (nullable = true)
|-- string: string (nullable = true)
多行模式Multi-line mode
如果某个 JSON 对象占用多行,则必须启用 Spark 的多行模式来加载该文件。If a JSON object occupies multiple lines, you must enable multi-line mode for Spark to load the file. 文件将作为一个整体实体加载且无法拆分。Files will be loaded as a whole entity and cannot be split.
[
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
{
"string": "string3",
"int": 3,
"array": [
3,
6,
9
],
"dict": {
"key": "value3",
"extra_key": "extra_value3"
}
}
]
若要读取这样的 JSON 数据,应启用多行模式:To read the JSON data, you should enable multi-line mode:
val mdf = spark.read.option("multiline", "true").json("multi.json")
mdf.show(false)
+---------+---------------------+---+-------+
|array |dict |int|string |
+---------+---------------------+---+-------+
|[1, 2, 3]|[null,value1] |1 |string1|
|[2, 4, 6]|[null,value2] |2 |string2|
|[3, 6, 9]|[extra_value3,value3]|3 |string3|
+---------+---------------------+---+-------+
字符集自动检测 Charset auto-detection
默认情况下,Spark 会自动检测输入文件的字符集,但你始终可通过此选项显式指定字符集:By default, Spark detects the charset of input files automatically, but you can always specify the charset explicitly via this option:
spark.read.option("charset", "UTF-16BE").json("fileInUTF16.json")
下面是一些受支持的字符集:UTF-8、UTF-16BE、UTF-16LE、UTF-16、UTF-32BE、UTF-32LE、UTF-32。Some supported charsets include: UTF-8, UTF-16BE, UTF-16LE, UTF-16, UTF-32BE, UTF-32LE, UTF-32. 若要查看 Oracle Java SE 支持的字符集的完整列表,请参阅受支持的编码。For the full list of charsets supported by Oracle Java SE, see Supported Encodings.