数据类型Data types
支持的数据类型Supported data types
Apache Spark SQL 和 DataFrames 支持以下数据类型:Apache Spark SQL and DataFrames support the following data types:
- 数字类型Numeric types
- ByteType:表示 1 个字节的带符号整数。ByteType: Represents 1-byte signed integer numbers. 数字范围是从
-128
到127
。The range of numbers is from-128
to127
. - ShortType:表示 2 个字节的带符号整数。ShortType: Represents 2-byte signed integer numbers. 数字范围是从
-32768
到32767
。The range of numbers is from-32768
to32767
. - IntegerType:表示 4 个字节的带符号整数。IntegerType: Represents 4-byte signed integer numbers. 数字范围是从
-2147483648
到2147483647
。The range of numbers is from-2147483648
to2147483647
. - LongType:表示 8 个字节的带符号整数。LongType: Represents 8-byte signed integer numbers. 数字范围是从
-9223372036854775808
到9223372036854775807
。The range of numbers is from-9223372036854775808
to9223372036854775807
. - FloatType:表示 4 个字节的单精度浮点数。FloatType: Represents 4-byte single-precision floating point numbers.
- DoubleType:表示 8 个字节的双精度浮点数。DoubleType: Represents 8-byte double-precision floating point numbers.
- DecimalType:表示任意精度的带符号十进制数字。DecimalType: Represents arbitrary-precision signed decimal numbers. 由
java.math.BigDecimal
在内部提供支持。Backed internally byjava.math.BigDecimal
.BigDecimal
由一个任意精度的非标度整数值和一个 32 位整数标度构成。ABigDecimal
consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
- ByteType:表示 1 个字节的带符号整数。ByteType: Represents 1-byte signed integer numbers. 数字范围是从
- 字符串类型:String type:
- StringType:表示字符串值。StringType: Represents character string values.
- 二进制类型:Binary type:
- BinaryType:表示字节序列值。BinaryType: Represents byte sequence values.
- 布尔类型:Boolean type:
- BooleanType:表示布尔值。BooleanType: Represents Boolean values.
- 日期/时间类型Datetime types
- TimestampType:表示由字段 year、month、day、hour、minute 和 second 的值构成的值,使用会话本地时区。TimestampType: Represents values comprising values of fields year, month, day, hour, minute, and second, with the session local time-zone. 时间戳值表示绝对时间点。The timestamp value represents an absolute point in time.
- DateType:表示由字段 year、month 和 day 的值构成的值,不包含时区。DateType: Represents values comprising values of fields year, month and day, without a time-zone.
- 复杂类型Complex types
- ArrayType(elementType, containsNull):表示由 elementType 类型的元素序列构成的值。ArrayType(elementType, containsNull): Represents values comprising a sequence of elements with the type of elementType.
containsNull
指示 ArrayType 值中的元素是否可以具有null
值。containsNull
indicates if elements in a ArrayType value can havenull
values. - MapType(keyType, valueType, valueContainsNull):表示由一组键值对构成的值。MapType(keyType, valueType, valueContainsNull): Represents values comprising a set of key-value pairs. 键的数据类型由 keyType 描述,而值的数据类型由 valueType 描述。The data type of keys is described by keyType and the data type of values is described by valueType. 对于 MapType 值,不允许键具有
null
值。For a MapType value, keys are not allowed to havenull
values.valueContainsNull
指示 MapType 值的值是否可以具有null
值。valueContainsNull
indicates if values of a MapType value can havenull
values. - StructType(fields):表示多个值,其结构通过一系列 StructFields(字段)来描述。StructType(fields): Represents values with the structure described by a sequence of StructFields (fields).
- StructField(name, dataType, nullable):表示 StructType 内的字段。StructField(name, dataType, nullable): Represents a field in a StructType. 字段的名称由
name
指示。The name of a field is indicated byname
. 字段的数据类型由 dataType 指示。The data type of a field is indicated by dataType.nullable
指示这些字段的值是否可以具有null
值。nullable
indicates if values of these fields can havenull
values.
- StructField(name, dataType, nullable):表示 StructType 内的字段。StructField(name, dataType, nullable): Represents a field in a StructType. 字段的名称由
- ArrayType(elementType, containsNull):表示由 elementType 类型的元素序列构成的值。ArrayType(elementType, containsNull): Represents values comprising a sequence of elements with the type of elementType.
语言映射Language mappings
ScalaScala
Spark SQL 数据类型是在包 org.apache.spark.sql.types
中定义的。Spark SQL data types are defined in the package org.apache.spark.sql.types
. 可以通过导入此包来访问这些数据类型:You access them by importing the package:
import org.apache.spark.sql.types._
数据类型Data type | 值类型Value type | 用于访问或创建数据类型的 APIAPI to access or create data type |
---|---|---|
ByteTypeByteType | ByteByte | ByteTypeByteType |
ShortTypeShortType | ShortShort | ShortTypeShortType |
IntegerTypeIntegerType | intInt | IntegerTypeIntegerType |
LongTypeLongType | LongLong | LongTypeLongType |
FloatTypeFloatType | FloatFloat | FloatTypeFloatType |
DoubleTypeDoubleType | DoubleDouble | DoubleTypeDoubleType |
DecimalTypeDecimalType | java.math.BigDecimaljava.math.BigDecimal | DecimalTypeDecimalType |
StringTypeStringType | 字符串String | StringTypeStringType |
BinaryTypeBinaryType | Array[Byte]Array[Byte] | BinaryTypeBinaryType |
BooleanTypeBooleanType | 布尔值Boolean | BooleanTypeBooleanType |
TimestampTypeTimestampType | java.sql.Timestampjava.sql.Timestamp | TimestampTypeTimestampType |
DateTypeDateType | java.sql.Datejava.sql.Date | DateTypeDateType |
ArrayTypeArrayType | scala.collection.Seqscala.collection.Seq | ArrayType(elementType, [containsNull]),注意:containsNull 的默认值为 true。ArrayType(elementType, [containsNull]) Note: The default value of containsNull is true. |
MapTypeMapType | scala.collection.Mapscala.collection.Map | MapType(keyType, valueType, valueContainsNull),注意:valueContainsNull 的默认值为 true。MapType(keyType, valueType, [valueContainsNull]) Note: The default value of valueContainsNull is true. |
StructTypeStructType | org.apache.spark.sql.Roworg.apache.spark.sql.Row | StructType(fields),注意:字段是一系列 StructFields。StructType(fields) Note: fields is a Seq of StructFields. 此外,不允许使用名称相同的两个字段。Also, two fields with the same name are not allowed. |
StructFieldStructField | 此字段的数据类型的值类型(例如,数据类型为 IntegerType 的 StructField 的 Int)The value type of the data type of this field(For example, Int for a StructField with the data type IntegerType) | StructField(name, dataType, nullable),注意:nullable 的默认值为 true。StructField(name, dataType, [nullable]) Note: The default value of nullable is true. |
JavaJava
Spark SQL 数据类型是在包 org.apache.spark.sql.types
中定义的。Spark SQL data types are defined in the package org.apache.spark.sql.types
. 若要访问或创建数据类型,请使用 org.apache.spark.sql.types.DataTypes
中提供的工厂方法。To access or create a data type, use factory methods provided in org.apache.spark.sql.types.DataTypes
.
数据类型Data type | 值类型Value type | 用于访问或创建数据类型的 APIAPI to access or create data type |
---|---|---|
ByteTypeByteType | byte 或 Bytebyte or Byte | DataTypes.ByteTypeDataTypes.ByteType |
ShortTypeShortType | short 或 Shortshort or Short | DataTypes.ShortTypeDataTypes.ShortType |
IntegerTypeIntegerType | int 或 Integerint or Integer | DataTypes.IntegerTypeDataTypes.IntegerType |
LongTypeLongType | long 或 Longlong or Long | DataTypes.LongTypeDataTypes.LongType |
FloatTypeFloatType | float 或 Floatfloat or Float | DataTypes.FloatTypeDataTypes.FloatType |
DoubleTypeDoubleType | double 或 Doubledouble or Double | DataTypes.DoubleTypeDataTypes.DoubleType |
DecimalTypeDecimalType | java.math.BigDecimaljava.math.BigDecimal | DataTypes.createDecimalType() DataTypes.createDecimalType(precision, scale)。DataTypes.createDecimalType() DataTypes.createDecimalType(precision, scale). |
StringTypeStringType | 字符串String | DataTypes.StringTypeDataTypes.StringType |
BinaryTypeBinaryType | byte[]byte[] | DataTypes.BinaryTypeDataTypes.BinaryType |
BooleanTypeBooleanType | boolean 或 Booleanboolean or Boolean | DataTypes.BooleanTypeDataTypes.BooleanType |
TimestampTypeTimestampType | java.sql.Timestampjava.sql.Timestamp | DataTypes.TimestampTypeDataTypes.TimestampType |
DateTypeDateType | java.sql.Datejava.sql.Date | DataTypes.DateTypeDataTypes.DateType |
ArrayTypeArrayType | java.util.Listjava.util.List | DataTypes.createArrayType(elementType),注意:containsNull 的值为 true。DataTypes.createArrayType(elementType) Note: The value of containsNull is true. DataTypes.createArrayType(elementType, containsNull)。DataTypes.createArrayType(elementType, containsNull). |
MapTypeMapType | java.util.Mapjava.util.Map | DataTypes.createMapType(keyType, valueType),注意:valueContainsNull 的值为 true。DataTypes.createMapType(keyType, valueType) Note: The value of valueContainsNull are true. DataTypes.createMapType(keyType, valueType, valueContainsNull)DataTypes.createMapType(keyType, valueType, valueContainsNull) |
StructTypeStructType | org.apache.spark.sql.Roworg.apache.spark.sql.Row | DataTypes.createStructType(fields),注意:字段是 StructFields 的列表或数组。此外,不允许使用名称相同的两个字段。DataTypes.createStructType(fields) Note: fields is a List or an array of StructFields.Also, two fields with the same name are not allowed. |
StructFieldStructField | 此字段的数据类型的值类型(例如,数据类型为 IntegerType 的 StructField 的 int)The value type of the data type of this field (For example, int for a StructField with the data type IntegerType) | DataTypes.createStructField(name, dataType, nullable)DataTypes.createStructField(name, dataType, nullable) |
PythonPython
Spark SQL 数据类型是在包 pyspark.sql.types
中定义的。Spark SQL data types are defined in the package pyspark.sql.types
. 可以通过导入此包来访问这些数据类型:You access them by importing the package:
from pyspark.sql.types import *
数据类型Data type | 值类型Value type | 用于访问或创建数据类型的 APIAPI to access or create data type |
---|---|---|
ByteTypeByteType | int 或 long,注意:数字在运行时会转换为 1 个字节的带符号整数。int or long Note: Numbers are converted to 1-byte signed integer numbers at runtime. 请确保数字在 -128 到 127 的范围内。Make sure sure that numbers are within the range of -128 to 127. | ByteType()ByteType() |
ShortTypeShortType | int 或 long,注意:数字在运行时会转换为 2 个字节的带符号整数。int or long Note: Numbers are converted to 2-byte signed integer numbers at runtime. 请确保数字在 -32768 到 32767 的范围内。Make sure sure that numbers are within the range of -32768 to 32767. | ShortType()ShortType() |
IntegerTypeIntegerType | int 或 longint or long | IntegerType()IntegerType() |
LongTypeLongType | long,注意:数字在运行时会转换为 8 个字节的带符号整数。long Note: Numbers are converted to 8-byte signed integer numbers at runtime. 请确保数字在 -9223372036854775808 到 9223372036854775807 的范围内。否则,数据将转换为 decimal.Decimal 并使用 DecimalType。Make sure sure that numbers are within the range of -9223372036854775808 to 9223372036854775807.Otherwise, convert data to decimal.Decimal and use DecimalType. | LongType()LongType() |
FloatTypeFloatType | float,注意:数字在运行时会转换为 4 个字节的单精度浮点。float Note: Numbers are converted to 4-byte single-precision floating point numbers at runtime. | FloatType()FloatType() |
DoubleTypeDoubleType | FLOATfloat | DoubleType()DoubleType() |
DecimalTypeDecimalType | decimal.Decimaldecimal.Decimal | DecimalType()DecimalType() |
StringTypeStringType | stringstring | StringType()StringType() |
BinaryTypeBinaryType | bytearraybytearray | BinaryType()BinaryType() |
BooleanTypeBooleanType | boolbool | BooleanType()BooleanType() |
TimestampTypeTimestampType | datetime.datetimedatetime.datetime | TimestampType()TimestampType() |
DateTypeDateType | datetime.datedatetime.date | DateType()DateType() |
ArrayTypeArrayType | list、tuple 或 arraylist, tuple, or array | ArrayType(elementType, [containsNull]),注意:containsNull 的默认值为 True。ArrayType(elementType, [containsNull]) Note: The default value of containsNull is True. |
MapTypeMapType | dictdict | MapType(keyType, valueType, [valueContainsNull]),注意:valueContainsNull 的默认值为 True。MapType(keyType, valueType, [valueContainsNull]) Note: The default value of valueContainsNull is True. |
StructTypeStructType | list 或 tuplelist or tuple | StructType(fields),注意:字段是一系列 StructFields。StructType(fields) Note: fields is a Seq of StructFields. 此外,不允许使用名称相同的两个字段。Also, two fields with the same name are not allowed. |
StructFieldStructField | 此字段的数据类型的值类型(例如,数据类型为 IntegerType 的 StructField 的 Int)The value type of the data type of this field (For example, Int for a StructField with the data type IntegerType) | StructField(name, dataType, nullable),注意:nullable 的默认值为 True。StructField(name, dataType, [nullable]) Note: The default value of nullable is True. |
RR
数据类型Data type | 值类型Value type | 用于访问或创建数据类型的 APIAPI to access or create data type |
---|---|---|
ByteTypeByteType | integer,注意:数字在运行时会转换为 1 个字节的带符号整数。integer Note: Numbers are converted to 1-byte signed integer numbers at runtime. 请确保数字在 -128 到 127 的范围内。Make sure sure that numbers are within the range of -128 to 127. | “byte”“byte” |
ShortTypeShortType | integer,注意:数字在运行时会转换为 2 个字节的带符号整数。integer Note: Numbers are converted to 2-byte signed integer numbers at runtime. 请确保数字在 -32768 到 32767 的范围内。Make sure sure that numbers are within the range of -32768 to 32767. | “short”“short” |
IntegerTypeIntegerType | integerinteger | “integer”“integer” |
LongTypeLongType | integer,注意:数字在运行时会转换为 8 个字节的带符号整数。integer Note: Numbers are converted to 8-byte signed integer numbers at runtime. 请确保数字在 -9223372036854775808 到 9223372036854775807 的范围内。Make sure sure that numbers are within the range of -9223372036854775808 to 9223372036854775807. 否则,数据将转换为 decimal.Decimal 并使用 DecimalType。Otherwise, convert data to decimal.Decimal and use DecimalType. | “long”“long” |
FloatTypeFloatType | numeric,注意:数字在运行时会转换为 4 个字节的单精度浮点。numeric Note: Numbers are converted to 4-byte single-precision floating point numbers at runtime. | “float”“float” |
DoubleTypeDoubleType | numericnumeric | “double”“double” |
DecimalTypeDecimalType | 不支持Not supported | 不支持Not supported |
StringTypeStringType | charactercharacter | "string"“string” |
BinaryTypeBinaryType | rawraw | “binary”“binary” |
BooleanTypeBooleanType | 逻辑logical | “bool”“bool” |
TimestampTypeTimestampType | POSIXctPOSIXct | “timestamp”“timestamp” |
DateTypeDateType | 日期Date | “date”“date” |
ArrayTypeArrayType | vector 或 listvector or list | list(type=”array”, elementType=elementType, containsNull=[containsNull]),注意:containsNull 的默认值为 TRUE。list(type=”array”, elementType=elementType, containsNull=[containsNull]) Note: The default value of containsNull is TRUE. |
MapTypeMapType | 环境environment | list(type=”map”, keyType=keyType, valueType=valueType, valueContainsNull=[valueContainsNull]),注意:valueContainsNull 的默认值为 TRUE。list(type=”map”, keyType=keyType, valueType=valueType, valueContainsNull=[valueContainsNull]) Note: The default value of valueContainsNull is TRUE. |
StructTypeStructType | named listnamed list | list(type=”struct”, fields=fields),注意:字段是一系列 StructFields。list(type=”struct”, fields=fields) Note: fields is a Seq of StructFields. 此外,不允许使用名称相同的两个字段。Also, two fields with the same name are not allowed. |
StructFieldStructField | 此字段的数据类型的值类型(例如,数据类型为 IntegerType 的 StructField 的 integer)The value type of the data type of this field (For example, integer for a StructField with the data type IntegerType) | list(name=name, type=dataType, nullable=[nullable]),注意:nullable 的默认值为 TRUE。list(name=name, type=dataType, nullable=[nullable]) Note: The default value of nullable is TRUE. |
SQLSQL
下表显示了每种数据类型的 Spark SQL 分析器中使用的类型名称和别名。The following table shows the type names as well as aliases used in Spark SQL parser for each data type.
数据类型Data type | SQL 名称SQL name |
---|---|
BooleanTypeBooleanType | BOOLEANBOOLEAN |
ByteTypeByteType | BYTE, TINYINTBYTE, TINYINT |
ShortTypeShortType | SHORT, SMALLINTSHORT, SMALLINT |
IntegerTypeIntegerType | INT、INTEGERINT, INTEGER |
LongTypeLongType | LONG, BIGINTLONG, BIGINT |
FloatTypeFloatType | FLOAT、REALFLOAT, REAL |
DoubleTypeDoubleType | DOUBLEDOUBLE |
DateTypeDateType | DATEDATE |
TimestampTypeTimestampType | TIMESTAMPTIMESTAMP |
StringTypeStringType | STRINGSTRING |
BinaryTypeBinaryType | BINARYBINARY |
DecimalTypeDecimalType | DECIMAL, DEC, NUMERICDECIMAL, DEC, NUMERIC |
CalendarIntervalTypeCalendarIntervalType | INTERVALINTERVAL |
ArrayTypeArrayType | ARRAY<element_type>ARRAY<element_type> |
StructTypeStructType | STRUCT<field1_name: field1_type, field2_name: field2_type, …>STRUCT<field1_name: field1_type, field2_name: field2_type, …> |
MapTypeMapType | MAP<key_type, value_type>MAP<key_type, value_type> |
特殊浮点值Special floating point values
Spark SQL 支持多个特殊的浮点值(不区分大小写):Spark SQL supports several special floating point values in a case-insensitive manner:
- Inf、+Inf、Infinity、+Infinity:正无穷大Inf, +Inf, Infinity, +Infinity: positive infinity
- FloatType:等效于 Scala
Float.PositiveInfinity
。FloatType: equivalent to ScalaFloat.PositiveInfinity
. - DoubleType:等效于 Scala
Double.PositiveInfinity
。DoubleType: equivalent to ScalaDouble.PositiveInfinity
.
- FloatType:等效于 Scala
- -Inf、-Infinity:负无穷大-Inf, -Infinity: negative infinity
- FloatType:等效于 Scala
Float.NegativeInfinity
。FloatType: equivalent to ScalaFloat.NegativeInfinity
. - DoubleType:等效于 Scala
Double.NegativeInfinity
。DoubleType: equivalent to ScalaDouble.NegativeInfinity
.
- FloatType:等效于 Scala
- NaN:非数值NaN: not a number
- FloatType:等效于 Scala
Float.NaN
。FloatType: equivalent to ScalaFloat.NaN
. - DoubleType:等效于 Scala
Double.NaN
。DoubleType: equivalent to ScalaDouble.NaN
.
- FloatType:等效于 Scala
正负无穷大语义Positive and negative infinity semantics
正负无穷大有特殊的处理方式。There is special handling for positive and negative infinity. 它们具有以下语义:They have the following semantics:
- 正无穷大乘以任何正值都会返回正无穷大。Positive infinity multiplied by any positive value returns positive infinity.
- 负无穷大乘以任何正值都会返回负无穷大。Negative infinity multiplied by any positive value returns negative infinity.
- 正无穷大乘以任何负值都会返回负无穷大。Positive infinity multiplied by any negative value returns negative infinity.
- 负无穷大乘以任何负值都会返回正无穷大。Negative infinity multiplied by any negative value returns positive infinity.
- 正/负无穷大乘以 0 返回 NaN。Positive/negative infinity multiplied by 0 returns NaN.
- 正/负无穷大等于自身。Positive/negative infinity is equal to itself.
- 在聚合中,所有正无穷大值都分组在一起。In aggregations, all positive infinity values are grouped together. 同样,所有负无穷大值都分组在一起。Similarly, all negative infinity values are grouped together.
- 正无穷大和负无穷大被视为联接键中的正常值。Positive infinity and negative infinity are treated as normal values in join keys.
- 正无穷大小于 NaN,并且大于任何其他值。Positive infinity sorts lower than NaN and higher than any other values.
- 负无穷大小于任何其他值。Negative infinity sorts lower than any other values.
NaN 语义NaN semantics
在处理与标准浮点语义不完全匹配的 float
或 double
类型时,NaN 具有特殊处理方式。There is special handling for NaN when dealing with float
or double
types that do not exactly match standard floating point semantics. 尤其是在下列情况下:Specifically:
- NaN = NaN 返回 true。NaN = NaN returns true.
- 在聚合中,所有 NaN 值都分组在一起。In aggregations, all NaN values are grouped together.
- NaN 被视为联接键中的正常值。NaN is treated as a normal value in join keys.
- NaN 值按升序排列后为最后一个值,大于任何其他数值。NaN values go last when in ascending order, larger than any other numeric value.
示例Examples
SELECT double('infinity') AS col;
+--------+
| col|
+--------+
|Infinity|
+--------+
SELECT float('-inf') AS col;
+---------+
| col|
+---------+
|-Infinity|
+---------+
SELECT float('NaN') AS col;
+---+
|col|
+---+
|NaN|
+---+
SELECT double('infinity') * 0 AS col;
+---+
|col|
+---+
|NaN|
+---+
SELECT double('-infinity') * (-1234567) AS col;
+--------+
| col|
+--------+
|Infinity|
+--------+
SELECT double('infinity') < double('NaN') AS col;
+----+
| col|
+----+
|true|
+----+
SELECT double('NaN') = double('NaN') AS col;
+----+
| col|
+----+
|true|
+----+
SELECT double('inf') = double('infinity') AS col;
+----+
| col|
+----+
|true|
+----+
CREATE TABLE test (c1 int, c2 double);
INSERT INTO test VALUES (1, double('infinity'));
INSERT INTO test VALUES (2, double('infinity'));
INSERT INTO test VALUES (3, double('inf'));
INSERT INTO test VALUES (4, double('-inf'));
INSERT INTO test VALUES (5, double('NaN'));
INSERT INTO test VALUES (6, double('NaN'));
INSERT INTO test VALUES (7, double('-infinity'));
SELECT COUNT(*), c2 FROM test GROUP BY c2;
+---------+---------+
| count(1)| c2|
+---------+---------+
| 2| NaN|
| 2|-Infinity|
| 3| Infinity|
+---------+---------+