数据类型Data types

支持的数据类型Supported data types

Apache Spark SQL 和 DataFrames 支持以下数据类型:Apache Spark SQL and DataFrames support the following data types:

  • 数字类型Numeric types
    • ByteType:表示 1 个字节的带符号整数。ByteType: Represents 1-byte signed integer numbers. 数字范围是从 -128127The range of numbers is from -128 to 127.
    • ShortType:表示 2 个字节的带符号整数。ShortType: Represents 2-byte signed integer numbers. 数字范围是从 -3276832767The range of numbers is from -32768 to 32767.
    • IntegerType:表示 4 个字节的带符号整数。IntegerType: Represents 4-byte signed integer numbers. 数字范围是从 -21474836482147483647The range of numbers is from -2147483648 to 2147483647.
    • LongType:表示 8 个字节的带符号整数。LongType: Represents 8-byte signed integer numbers. 数字范围是从 -92233720368547758089223372036854775807The range of numbers is from -9223372036854775808 to 9223372036854775807.
    • FloatType:表示 4 个字节的单精度浮点数。FloatType: Represents 4-byte single-precision floating point numbers.
    • DoubleType:表示 8 个字节的双精度浮点数。DoubleType: Represents 8-byte double-precision floating point numbers.
    • DecimalType:表示任意精度的带符号十进制数字。DecimalType: Represents arbitrary-precision signed decimal numbers. java.math.BigDecimal 在内部提供支持。Backed internally by java.math.BigDecimal. BigDecimal 由一个任意精度的非标度整数值和一个 32 位整数标度构成。A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
  • 字符串类型:String type:
    • StringType:表示字符串值。StringType: Represents character string values.
  • 二进制类型:Binary type:
    • BinaryType:表示字节序列值。BinaryType: Represents byte sequence values.
  • 布尔类型:Boolean type:
    • BooleanType:表示布尔值。BooleanType: Represents Boolean values.
  • 日期/时间类型Datetime types
    • TimestampType:表示由字段 year、month、day、hour、minute 和 second 的值构成的值,使用会话本地时区。TimestampType: Represents values comprising values of fields year, month, day, hour, minute, and second, with the session local time-zone. 时间戳值表示绝对时间点。The timestamp value represents an absolute point in time.
    • DateType:表示由字段 year、month 和 day 的值构成的值,不包含时区。DateType: Represents values comprising values of fields year, month and day, without a time-zone.
  • 复杂类型Complex types
    • ArrayType(elementType, containsNull):表示由 elementType 类型的元素序列构成的值。ArrayType(elementType, containsNull): Represents values comprising a sequence of elements with the type of elementType. containsNull 指示 ArrayType 值中的元素是否可以具有 null 值。containsNull indicates if elements in a ArrayType value can have null values.
    • MapType(keyType, valueType, valueContainsNull):表示由一组键值对构成的值。MapType(keyType, valueType, valueContainsNull): Represents values comprising a set of key-value pairs. 键的数据类型由 keyType 描述,而值的数据类型由 valueType 描述。The data type of keys is described by keyType and the data type of values is described by valueType. 对于 MapType 值,不允许键具有 null 值。For a MapType value, keys are not allowed to have null values. valueContainsNull 指示 MapType 值的值是否可以具有 null 值。valueContainsNull indicates if values of a MapType value can have null values.
    • StructType(fields):表示多个值,其结构通过一系列 StructFields(字段)来描述。StructType(fields): Represents values with the structure described by a sequence of StructFields (fields).
      • StructField(name, dataType, nullable):表示 StructType 内的字段。StructField(name, dataType, nullable): Represents a field in a StructType. 字段的名称由 name 指示。The name of a field is indicated by name. 字段的数据类型由 dataType 指示。The data type of a field is indicated by dataType. nullable 指示这些字段的值是否可以具有 null 值。nullable indicates if values of these fields can have null values.

语言映射Language mappings

ScalaScala

Spark SQL 数据类型是在包 org.apache.spark.sql.types 中定义的。Spark SQL data types are defined in the package org.apache.spark.sql.types. 可以通过导入此包来访问这些数据类型:You access them by importing the package:

import org.apache.spark.sql.types._
数据类型Data type 值类型Value type 用于访问或创建数据类型的 APIAPI to access or create data type
ByteTypeByteType ByteByte ByteTypeByteType
ShortTypeShortType ShortShort ShortTypeShortType
IntegerTypeIntegerType intInt IntegerTypeIntegerType
LongTypeLongType LongLong LongTypeLongType
FloatTypeFloatType FloatFloat FloatTypeFloatType
DoubleTypeDoubleType DoubleDouble DoubleTypeDoubleType
DecimalTypeDecimalType java.math.BigDecimaljava.math.BigDecimal DecimalTypeDecimalType
StringTypeStringType 字符串String StringTypeStringType
BinaryTypeBinaryType Array[Byte]Array[Byte] BinaryTypeBinaryType
BooleanTypeBooleanType 布尔值Boolean BooleanTypeBooleanType
TimestampTypeTimestampType java.sql.Timestampjava.sql.Timestamp TimestampTypeTimestampType
DateTypeDateType java.sql.Datejava.sql.Date DateTypeDateType
ArrayTypeArrayType scala.collection.Seqscala.collection.Seq ArrayType(elementType, [containsNull]),注意:containsNull 的默认值为 true。ArrayType(elementType, [containsNull]) Note: The default value of containsNull is true.
MapTypeMapType scala.collection.Mapscala.collection.Map MapType(keyType, valueType, valueContainsNull),注意:valueContainsNull 的默认值为 true。MapType(keyType, valueType, [valueContainsNull]) Note: The default value of valueContainsNull is true.
StructTypeStructType org.apache.spark.sql.Roworg.apache.spark.sql.Row StructType(fields),注意:字段是一系列 StructFields。StructType(fields) Note: fields is a Seq of StructFields. 此外,不允许使用名称相同的两个字段。Also, two fields with the same name are not allowed.
StructFieldStructField 此字段的数据类型的值类型(例如,数据类型为 IntegerType 的 StructField 的 Int)The value type of the data type of this field(For example, Int for a StructField with the data type IntegerType) StructField(name, dataType, nullable),注意:nullable 的默认值为 true。StructField(name, dataType, [nullable]) Note: The default value of nullable is true.

JavaJava

Spark SQL 数据类型是在包 org.apache.spark.sql.types 中定义的。Spark SQL data types are defined in the package org.apache.spark.sql.types. 若要访问或创建数据类型,请使用 org.apache.spark.sql.types.DataTypes 中提供的工厂方法。To access or create a data type, use factory methods provided in org.apache.spark.sql.types.DataTypes.

数据类型Data type 值类型Value type 用于访问或创建数据类型的 APIAPI to access or create data type
ByteTypeByteType byte 或 Bytebyte or Byte DataTypes.ByteTypeDataTypes.ByteType
ShortTypeShortType short 或 Shortshort or Short DataTypes.ShortTypeDataTypes.ShortType
IntegerTypeIntegerType int 或 Integerint or Integer DataTypes.IntegerTypeDataTypes.IntegerType
LongTypeLongType long 或 Longlong or Long DataTypes.LongTypeDataTypes.LongType
FloatTypeFloatType float 或 Floatfloat or Float DataTypes.FloatTypeDataTypes.FloatType
DoubleTypeDoubleType double 或 Doubledouble or Double DataTypes.DoubleTypeDataTypes.DoubleType
DecimalTypeDecimalType java.math.BigDecimaljava.math.BigDecimal DataTypes.createDecimalType() DataTypes.createDecimalType(precision, scale)。DataTypes.createDecimalType() DataTypes.createDecimalType(precision, scale).
StringTypeStringType 字符串String DataTypes.StringTypeDataTypes.StringType
BinaryTypeBinaryType byte[]byte[] DataTypes.BinaryTypeDataTypes.BinaryType
BooleanTypeBooleanType boolean 或 Booleanboolean or Boolean DataTypes.BooleanTypeDataTypes.BooleanType
TimestampTypeTimestampType java.sql.Timestampjava.sql.Timestamp DataTypes.TimestampTypeDataTypes.TimestampType
DateTypeDateType java.sql.Datejava.sql.Date DataTypes.DateTypeDataTypes.DateType
ArrayTypeArrayType java.util.Listjava.util.List DataTypes.createArrayType(elementType),注意:containsNull 的值为 true。DataTypes.createArrayType(elementType) Note: The value of containsNull is true. DataTypes.createArrayType(elementType, containsNull)。DataTypes.createArrayType(elementType, containsNull).
MapTypeMapType java.util.Mapjava.util.Map DataTypes.createMapType(keyType, valueType),注意:valueContainsNull 的值为 true。DataTypes.createMapType(keyType, valueType) Note: The value of valueContainsNull are true. DataTypes.createMapType(keyType, valueType, valueContainsNull)DataTypes.createMapType(keyType, valueType, valueContainsNull)
StructTypeStructType org.apache.spark.sql.Roworg.apache.spark.sql.Row DataTypes.createStructType(fields),注意:字段是 StructFields 的列表或数组。此外,不允许使用名称相同的两个字段。DataTypes.createStructType(fields) Note: fields is a List or an array of StructFields.Also, two fields with the same name are not allowed.
StructFieldStructField 此字段的数据类型的值类型(例如,数据类型为 IntegerType 的 StructField 的 int)The value type of the data type of this field (For example, int for a StructField with the data type IntegerType) DataTypes.createStructField(name, dataType, nullable)DataTypes.createStructField(name, dataType, nullable)

PythonPython

Spark SQL 数据类型是在包 pyspark.sql.types 中定义的。Spark SQL data types are defined in the package pyspark.sql.types. 可以通过导入此包来访问这些数据类型:You access them by importing the package:

from pyspark.sql.types import *
数据类型Data type 值类型Value type 用于访问或创建数据类型的 APIAPI to access or create data type
ByteTypeByteType int 或 long,注意:数字在运行时会转换为 1 个字节的带符号整数。int or long Note: Numbers are converted to 1-byte signed integer numbers at runtime. 请确保数字在 -128 到 127 的范围内。Make sure sure that numbers are within the range of -128 to 127. ByteType()ByteType()
ShortTypeShortType int 或 long,注意:数字在运行时会转换为 2 个字节的带符号整数。int or long Note: Numbers are converted to 2-byte signed integer numbers at runtime. 请确保数字在 -32768 到 32767 的范围内。Make sure sure that numbers are within the range of -32768 to 32767. ShortType()ShortType()
IntegerTypeIntegerType int 或 longint or long IntegerType()IntegerType()
LongTypeLongType long,注意:数字在运行时会转换为 8 个字节的带符号整数。long Note: Numbers are converted to 8-byte signed integer numbers at runtime. 请确保数字在 -9223372036854775808 到 9223372036854775807 的范围内。否则,数据将转换为 decimal.Decimal 并使用 DecimalType。Make sure sure that numbers are within the range of -9223372036854775808 to 9223372036854775807.Otherwise, convert data to decimal.Decimal and use DecimalType. LongType()LongType()
FloatTypeFloatType float,注意:数字在运行时会转换为 4 个字节的单精度浮点。float Note: Numbers are converted to 4-byte single-precision floating point numbers at runtime. FloatType()FloatType()
DoubleTypeDoubleType FLOATfloat DoubleType()DoubleType()
DecimalTypeDecimalType decimal.Decimaldecimal.Decimal DecimalType()DecimalType()
StringTypeStringType stringstring StringType()StringType()
BinaryTypeBinaryType bytearraybytearray BinaryType()BinaryType()
BooleanTypeBooleanType boolbool BooleanType()BooleanType()
TimestampTypeTimestampType datetime.datetimedatetime.datetime TimestampType()TimestampType()
DateTypeDateType datetime.datedatetime.date DateType()DateType()
ArrayTypeArrayType list、tuple 或 arraylist, tuple, or array ArrayType(elementType, [containsNull]),注意:containsNull 的默认值为 True。ArrayType(elementType, [containsNull]) Note: The default value of containsNull is True.
MapTypeMapType dictdict MapType(keyType, valueType, [valueContainsNull]),注意:valueContainsNull 的默认值为 True。MapType(keyType, valueType, [valueContainsNull]) Note: The default value of valueContainsNull is True.
StructTypeStructType list 或 tuplelist or tuple StructType(fields),注意:字段是一系列 StructFields。StructType(fields) Note: fields is a Seq of StructFields. 此外,不允许使用名称相同的两个字段。Also, two fields with the same name are not allowed.
StructFieldStructField 此字段的数据类型的值类型(例如,数据类型为 IntegerType 的 StructField 的 Int)The value type of the data type of this field (For example, Int for a StructField with the data type IntegerType) StructField(name, dataType, nullable),注意:nullable 的默认值为 True。StructField(name, dataType, [nullable]) Note: The default value of nullable is True.

RR

数据类型Data type 值类型Value type 用于访问或创建数据类型的 APIAPI to access or create data type
ByteTypeByteType integer,注意:数字在运行时会转换为 1 个字节的带符号整数。integer Note: Numbers are converted to 1-byte signed integer numbers at runtime. 请确保数字在 -128 到 127 的范围内。Make sure sure that numbers are within the range of -128 to 127. “byte”“byte”
ShortTypeShortType integer,注意:数字在运行时会转换为 2 个字节的带符号整数。integer Note: Numbers are converted to 2-byte signed integer numbers at runtime. 请确保数字在 -32768 到 32767 的范围内。Make sure sure that numbers are within the range of -32768 to 32767. “short”“short”
IntegerTypeIntegerType integerinteger “integer”“integer”
LongTypeLongType integer,注意:数字在运行时会转换为 8 个字节的带符号整数。integer Note: Numbers are converted to 8-byte signed integer numbers at runtime. 请确保数字在 -9223372036854775808 到 9223372036854775807 的范围内。Make sure sure that numbers are within the range of -9223372036854775808 to 9223372036854775807. 否则,数据将转换为 decimal.Decimal 并使用 DecimalType。Otherwise, convert data to decimal.Decimal and use DecimalType. “long”“long”
FloatTypeFloatType numeric,注意:数字在运行时会转换为 4 个字节的单精度浮点。numeric Note: Numbers are converted to 4-byte single-precision floating point numbers at runtime. “float”“float”
DoubleTypeDoubleType numericnumeric “double”“double”
DecimalTypeDecimalType 不支持Not supported 不支持Not supported
StringTypeStringType charactercharacter "string"“string”
BinaryTypeBinaryType rawraw “binary”“binary”
BooleanTypeBooleanType 逻辑logical “bool”“bool”
TimestampTypeTimestampType POSIXctPOSIXct “timestamp”“timestamp”
DateTypeDateType 日期Date “date”“date”
ArrayTypeArrayType vector 或 listvector or list list(type=”array”, elementType=elementType, containsNull=[containsNull]),注意:containsNull 的默认值为 TRUE。list(type=”array”, elementType=elementType, containsNull=[containsNull]) Note: The default value of containsNull is TRUE.
MapTypeMapType 环境environment list(type=”map”, keyType=keyType, valueType=valueType, valueContainsNull=[valueContainsNull]),注意:valueContainsNull 的默认值为 TRUE。list(type=”map”, keyType=keyType, valueType=valueType, valueContainsNull=[valueContainsNull]) Note: The default value of valueContainsNull is TRUE.
StructTypeStructType named listnamed list list(type=”struct”, fields=fields),注意:字段是一系列 StructFields。list(type=”struct”, fields=fields) Note: fields is a Seq of StructFields. 此外,不允许使用名称相同的两个字段。Also, two fields with the same name are not allowed.
StructFieldStructField 此字段的数据类型的值类型(例如,数据类型为 IntegerType 的 StructField 的 integer)The value type of the data type of this field (For example, integer for a StructField with the data type IntegerType) list(name=name, type=dataType, nullable=[nullable]),注意:nullable 的默认值为 TRUE。list(name=name, type=dataType, nullable=[nullable]) Note: The default value of nullable is TRUE.

SQLSQL

下表显示了每种数据类型的 Spark SQL 分析器中使用的类型名称和别名。The following table shows the type names as well as aliases used in Spark SQL parser for each data type.

数据类型Data type SQL 名称SQL name
BooleanTypeBooleanType BOOLEANBOOLEAN
ByteTypeByteType BYTE, TINYINTBYTE, TINYINT
ShortTypeShortType SHORT, SMALLINTSHORT, SMALLINT
IntegerTypeIntegerType INT、INTEGERINT, INTEGER
LongTypeLongType LONG, BIGINTLONG, BIGINT
FloatTypeFloatType FLOAT、REALFLOAT, REAL
DoubleTypeDoubleType DOUBLEDOUBLE
DateTypeDateType DATEDATE
TimestampTypeTimestampType TIMESTAMPTIMESTAMP
StringTypeStringType STRINGSTRING
BinaryTypeBinaryType BINARYBINARY
DecimalTypeDecimalType DECIMAL, DEC, NUMERICDECIMAL, DEC, NUMERIC
CalendarIntervalTypeCalendarIntervalType INTERVALINTERVAL
ArrayTypeArrayType ARRAY<element_type>ARRAY<element_type>
StructTypeStructType STRUCT<field1_name: field1_type, field2_name: field2_type, …>STRUCT<field1_name: field1_type, field2_name: field2_type, …>
MapTypeMapType MAP<key_type, value_type>MAP<key_type, value_type>

特殊浮点值Special floating point values

Spark SQL 支持多个特殊的浮点值(不区分大小写):Spark SQL supports several special floating point values in a case-insensitive manner:

  • Inf、+Inf、Infinity、+Infinity:正无穷大Inf, +Inf, Infinity, +Infinity: positive infinity
    • FloatType:等效于 Scala Float.PositiveInfinityFloatType: equivalent to Scala Float.PositiveInfinity.
    • DoubleType:等效于 Scala Double.PositiveInfinityDoubleType: equivalent to Scala Double.PositiveInfinity.
  • -Inf、-Infinity:负无穷大-Inf, -Infinity: negative infinity
    • FloatType:等效于 Scala Float.NegativeInfinityFloatType: equivalent to Scala Float.NegativeInfinity.
    • DoubleType:等效于 Scala Double.NegativeInfinityDoubleType: equivalent to Scala Double.NegativeInfinity.
  • NaN:非数值NaN: not a number
    • FloatType:等效于 Scala Float.NaNFloatType: equivalent to Scala Float.NaN.
    • DoubleType:等效于 Scala Double.NaNDoubleType: equivalent to Scala Double.NaN.

正负无穷大语义Positive and negative infinity semantics

正负无穷大有特殊的处理方式。There is special handling for positive and negative infinity. 它们具有以下语义:They have the following semantics:

  • 正无穷大乘以任何正值都会返回正无穷大。Positive infinity multiplied by any positive value returns positive infinity.
  • 负无穷大乘以任何正值都会返回负无穷大。Negative infinity multiplied by any positive value returns negative infinity.
  • 正无穷大乘以任何负值都会返回负无穷大。Positive infinity multiplied by any negative value returns negative infinity.
  • 负无穷大乘以任何负值都会返回正无穷大。Negative infinity multiplied by any negative value returns positive infinity.
  • 正/负无穷大乘以 0 返回 NaN。Positive/negative infinity multiplied by 0 returns NaN.
  • 正/负无穷大等于自身。Positive/negative infinity is equal to itself.
  • 在聚合中,所有正无穷大值都分组在一起。In aggregations, all positive infinity values are grouped together. 同样,所有负无穷大值都分组在一起。Similarly, all negative infinity values are grouped together.
  • 正无穷大和负无穷大被视为联接键中的正常值。Positive infinity and negative infinity are treated as normal values in join keys.
  • 正无穷大小于 NaN,并且大于任何其他值。Positive infinity sorts lower than NaN and higher than any other values.
  • 负无穷大小于任何其他值。Negative infinity sorts lower than any other values.

NaN 语义NaN semantics

在处理与标准浮点语义不完全匹配的 floatdouble 类型时,NaN 具有特殊处理方式。There is special handling for NaN when dealing with float or double types that do not exactly match standard floating point semantics. 尤其是在下列情况下:Specifically:

  • NaN = NaN 返回 true。NaN = NaN returns true.
  • 在聚合中,所有 NaN 值都分组在一起。In aggregations, all NaN values are grouped together.
  • NaN 被视为联接键中的正常值。NaN is treated as a normal value in join keys.
  • NaN 值按升序排列后为最后一个值,大于任何其他数值。NaN values go last when in ascending order, larger than any other numeric value.

示例Examples

SELECT double('infinity') AS col;
+--------+
|     col|
+--------+
|Infinity|
+--------+

SELECT float('-inf') AS col;
+---------+
|      col|
+---------+
|-Infinity|
+---------+

SELECT float('NaN') AS col;
+---+
|col|
+---+
|NaN|
+---+

SELECT double('infinity') * 0 AS col;
+---+
|col|
+---+
|NaN|
+---+

SELECT double('-infinity') * (-1234567) AS col;
+--------+
|     col|
+--------+
|Infinity|
+--------+

SELECT double('infinity') < double('NaN') AS col;
+----+
| col|
+----+
|true|
+----+

SELECT double('NaN') = double('NaN') AS col;
+----+
| col|
+----+
|true|
+----+

SELECT double('inf') = double('infinity') AS col;
+----+
| col|
+----+
|true|
+----+

CREATE TABLE test (c1 int, c2 double);
INSERT INTO test VALUES (1, double('infinity'));
INSERT INTO test VALUES (2, double('infinity'));
INSERT INTO test VALUES (3, double('inf'));
INSERT INTO test VALUES (4, double('-inf'));
INSERT INTO test VALUES (5, double('NaN'));
INSERT INTO test VALUES (6, double('NaN'));
INSERT INTO test VALUES (7, double('-infinity'));
SELECT COUNT(*), c2 FROM test GROUP BY c2;
+---------+---------+
| count(1)|       c2|
+---------+---------+
|        2|      NaN|
|        2|-Infinity|
|        3| Infinity|
+---------+---------+