日期/时间模式Datetime patterns

Spark 中的日期/时间有几种常见使用场景:There are several common scenarios for datetime usage in Spark:

  • CSV 和 JSON 数据源使用模式字符串对日期/时间内容进行分析和格式设置。CSV and JSON data sources use the pattern string for parsing and formatting datetime content.
  • StringTypeDateTypeTimestampType 之间相互转换相关的 Datetime 函数。Datetime functions related to convert StringType to and from DateType or TimestampType. 例如 unix_timestampdate_formatto_unix_timestampfrom_unixtimeto_dateto_timestampfrom_utc_timestampto_utc_timestampFor example, unix_timestamp, date_format, to_unix_timestamp, from_unixtime, to_date, to_timestamp, from_utc_timestamp, to_utc_timestamp.

Spark 使用下表中的模式字母进行日期和时间戳的分析和格式设置:Spark uses pattern letters in the following table for date and timestamp parsing and formatting:

符号Symbol 含义Meaning 呈现Presentation 示例Examples
GG 纪元era 文本text AD;公元AD; Anno Domini
yy yearyear yearyear 2020;202020; 20
DD 一年的某一日day-of-year 数字 (3)number(3) 189189
M/LM/L 一年的某一月month-of-year 月份month 7;07;七;七月7; 07; Jul; July
dd 一月的某一日day-of-month 数字 (3)number(3) 2828
Q/qQ/q 一年的某一季度quarter-of-year 数字/文本number/text 3;03;Q3;第三季度3; 03; Q3; 3rd quarter
EE 星期几day-of-week 文本text 周二;星期二Tue; Tuesday
FF 一个月的星期几aligned day of week in month 数字 (1)number(1) 33
aa 一天中的上午下午am-pm-of-day am-pmam-pm PMPM
hh 12 小时制的时钟小时 (1-12)clock-hour-of-am-pm (1-12) 数字 (2)number(2) 1212
KK 12 小时制的小时 (0-11)hour-of-am-pm (0-11) 数字 (2)number(2) 00
kk 24 小时制的时钟小时 (1-24)clock-hour-of-day (1-24) 数字 (2)number(2) 00
HH 24 小时制的小时 (0-23)hour-of-day (0-23) 数字 (2)number(2) 00
mm 分钟数minute-of-hour 数字 (2)number(2) 3030
ss 秒数second-of-minute 数字 (2)number(2) 5555
SS 秒的小数fraction-of-second fractionfraction 978978
VV 时区 IDtime-zone ID 区域 IDzone-id America/Los_Angeles; Z; -08:30America/Los_Angeles; Z; -08:30
zz 时区名称time-zone name 区域名称zone-name 太平洋标准时间;PSTPacific Standard Time; PST
OO 本地化区域偏移localized zone-offset offset-Ooffset-O GMT+8; GMT+08:00; UTC-08:00;GMT+8; GMT+08:00; UTC-08:00;
XX 区域偏移“Z”表示零zone-offset ‘Z’ for zero offset-Xoffset-X Z; -08; -0830; -08:30; -083015; -08:30:15;Z; -08; -0830; -08:30; -083015; -08:30:15;
xx 区域偏移zone-offset offset-xoffset-x +0000; -08; -0830; -08:30; -083015; -08:30:15;+0000; -08; -0830; -08:30; -083015; -08:30:15;
ZZ 区域偏移zone-offset offset-Zoffset-Z +0000; -0800; -08:00;+0000; -0800; -08:00;
文本转义escape for text delimiterdelimiter
‘’‘’ 单引号single quote 文本literal
[[ 可选部分开始optional section start
]] 可选部分结束optional section end

模式字母的计数决定了格式。The count of pattern letters determines the format.

  • 文本:文本样式根据使用的模式字母数确定。Text: The text style is determined based on the number of pattern letters used. 如果少于 4 个模式字母,则将使用短文本格式,通常为缩写,例如星期几,Monday 可能输出为“Mon”。Less than 4 pattern letters will use the short text form, typically an abbreviation, e.g. day-of-week Monday might output “Mon”. 如果刚好有 4 个模式字母,则将使用全文本格式,通常为完整描述,例如星期几,Monday 可能输出为“Monday”。Exactly 4 pattern letters will use the full text form, typically the full description, e.g, day-of-week Monday might output “Monday”. 5 个或更多个字母将会失败。5 or more letters will fail.

  • 数字 (n):此处的 n 表示这种类型的日期/时间模式可以使用的最大字母数。Number(n): The n here represents the maximum count of letters this type of datetime pattern can be used. 如果字母数为 1,则输出值时使用最小位数且不填充。If the count of letters is one, then the value is output using the minimum number of digits and without padding. 除此之外,数字个数用作输出字段的宽度,并根据需要使用零值填充。Otherwise, the count of digits is used as the width of the output field, with the value zero-padded as necessary.

  • 数字/文本:如果模式字母数为 3 或更大,请使用上述文本规则。Number/Text: If the count of pattern letters is 3 or greater, use the Text rules above. 否则使用上述数字规则。Otherwise use the Number rules above.

  • 小数:使用一个或多个(最多 9 个)连续 'S' 字符(例如 SSSSSS)分析并对秒的小数部分进行格式设置。Fraction: Use one or more (up to 9) contiguous 'S' characters, for example, SSSSSS, to parse and format fraction of second. 对于分析,可接受的小数长度可以为 [1,即连续的“S”的数目]。For parsing, the acceptable fraction length can be [1, the number of contiguous ‘S’]. 对于格式设置,小数长度将填充为带零的连续“S”的数目。For formatting, the fraction length would be padded to the number of contiguous ‘S’ with zeros. Spark 支持微秒精度的日期/时间,其最多可有 6 位有效数字,但可以解析纳米秒,此时会截断超出位数。Spark supports datetime of micro-of-second precision, which has up to 6 significant digits, but can parse nano-of-second with exceeded part truncated.

  • 年份:字母数决定了使用填充的最小字段宽度。Year: The count of letters determines the minimum field width below which padding is used. 如果字母数为 2,则使用减少的两位数字格式。If the count of letters is two, then a reduced two digit form is used. 对于输出,这会输出最右侧的两个数字。For printing, this outputs the rightmost two digits. 对于分析,这将使用基值 2000 进行分析,得到 2000 到 2099 范围内(含 2000 和 2099)的年份。For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. 如果字母数小于 4(但不是 2),则符号仅输出为负数。If the count of letters is less than four (but not two), then the sign is only output for negative years. 否则,如果在“G”不存在时超出填充宽度,则会输出符号。Otherwise, the sign is output if the pad width is exceeded when ‘G’ is not present. 7 个或更多个字母将会失败。7 or more letters will fail.

  • 月份:它遵循数字/文本的规则。Month: It follows the rule of Number/Text. 文本形式取决于字母 - 'M' 表示“标准”形式,'L' 表示“独立”形式。The text form is depend on letters - 'M' denotes the ‘standard’ form, and 'L' is for ‘stand-alone’ form. 这两种形式仅在某些语言中是不同的。These two forms are different only in some certain languages. 例如,在俄语中,“Июль”是七月的独立形式,而“Июля”是标准形式。For example, in Russian, ‘Июль’ is the stand-alone form of July, and ‘Июля’ is the standard form. 下面是所有受支持的模式字母的示例:Here are examples for all supported pattern letters:

    • 'M''L':一年中的月份数从 1 开始。'M' or 'L': Month number in a year starting from 1. 'M''L' 没有任何区别。There is no difference between 'M' and 'L'. 1 到 9 的月份输出时没有填充。Month from 1 to 9 are printed without padding.

      select date_format(date '1970-01-01', "M");
      1
      select date_format(date '1970-12-01', "L");
      12
      
    • 'MM''LL':一年中的月份数从 1 开始。'MM' or 'LL': Month number in a year starting from 1. 月份 1-9 添加了零填充。Zero padding is added for month 1-9.

      select date_format(date '1970-1-01', "LL");
      01
      select date_format(date '1970-09-01', "MM");
      09
      
    • 'MMM':标准形式的短文本表示形式。'MMM': Short textual representation in the standard form. 月份模式应该是日期模式的一部分,而不只是一个独立的月份,但区域设置除外,在区域设置中,标准形式和独立形式之间没有区别(如英语)。The month pattern should be a part of a date pattern not just a stand-alone month except locales where there is no difference between stand and stand-alone forms like in English.

      select date_format(date '1970-01-01', "d MMM");
      1 Jan
      select to_csv(named_struct('date', date '1970-01-01'), map('dateFormat', 'dd MMM', 'locale', 'RU'));
      01 янв.
      
    • 'LLL':独立形式的短文本表示形式。'LLL': Short textual representation in the stand-alone form. 它应仅用于对没有任何其他日期字段的月份进行格式设置/分析。It should be used to format/parse only months without any other date fields.

      select date_format(date '1970-01-01', "LLL");
      
      Jan
      
      select to_csv(named_struct('date', date '1970-01-01'), map('dateFormat', 'LLL', 'locale', 'RU'));
      
      янв.
      
    • 'MMMM':标准形式的完整文本月份表示形式。'MMMM': full textual month representation in the standard form. 它用于将月份分析/格式设置为日期/时间戳的一部分。It is used for parsing/formatting months as a part of dates/timestamps.

      select date_format(date '1970-01-01', "d MMMM");
      
      1 January
      
      select to_csv(named_struct('date', date '1970-01-01'), map('dateFormat', 'd MMMM', 'locale', 'RU'));
      
      1 января
      
    • 'LLLL':独立形式的完整文本月份表示形式。'LLLL': full textual month representation in the stand-alone form. 该模式只能用于对月份进行格式设置/分析。The pattern can be used to format/parse only months.

      select date_format(date '1970-01-01', "LLLL");
      
      January
      
      select to_csv(named_struct('date', date '1970-01-01'), map('dateFormat', 'LLLL', 'locale', 'RU'));
      
      январь
      
  • am-pm:这会输出一天中的上午下午。am-pm: This outputs the am-pm-of-day. 模式字母数必须为 1。Pattern letter count must be 1.

  • 区域 ID(V):这将输出显示时区 ID。Zone ID(V): This outputs the display the time-zone ID. 模式字母数必须为 2。Pattern letter count must be 2.

  • 区域名称(z):这将输出时区 ID 的显示文本名称。Zone names(z): This outputs the display textual name of the time-zone ID. 如果字母计为 1、2 或 3,则输出短名称。If the count of letters is one, two or three, then the short name is output. 如果字母计为 4,则输出全名。If the count of letters is four, then the full name is output. 5 个或更多个字母将失败。Five or more letters will fail.

  • 偏移 X 和 x:这会根据模式字母的数量设置偏移的格式。Offset X and x: This formats the offset based on the number of pattern letters. 1 个字母仅输出小时,如“+01”,除非分钟不为零,在这种情况下,也会输出分钟,如“+0130”。One letter outputs just the hour, such as ‘+01’, unless the minute is non-zero in which case the minute is also output, such as ‘+0130’. 2 个字母将输出小时和分钟,不带冒号,如“+0130”。Two letters outputs the hour and minute, without a colon, such as ‘+0130’. 3 个字母将输出小时和分钟,带冒号,如“+01:30”。Three letters outputs the hour and minute, with a colon, such as ‘+01:30’. 4 个字母将输出小时、分钟以及可选的秒,不带冒号,如“+013015”。Four letters outputs the hour and minute and optional second, without a colon, such as ‘+013015’. 5 个字母将输出小时、分钟以及可选的秒,带冒号,如“+01:30:15”。Five letters outputs the hour and minute and optional second, with a colon, such as ‘+01:30:15’. 6 个或更多个字母将会失败。Six or more letters will fail. 当要输出的偏移量为零时,模式字母“X”(大写)将输出“Z”,而模式字母“x”(小写)将输出“+00”、“+0000”或“+00:00”。Pattern letter ‘X’ (upper case) will output ‘Z’ when the offset to be output would be zero, whereas pattern letter ‘x’ (lower case) will output ‘+00’, ‘+0000’, or ‘+00:00’.

  • 偏移 O:这会根据模式字母的数量设置本地化偏移的格式。Offset O: This formats the localized offset based on the number of pattern letters. 1 个字母将输出本地化偏移的短格式,即本地化偏移文本(如“GMT”),包含没有前导零的小时、可选的 2 位数分钟和秒(如果不为零),带冒号,如“GMT+8”。One letter outputs the short form of the localized offset, which is localized offset text, such as ‘GMT’, with hour without leading zero, optional 2-digit minute and second if non-zero, and colon, for example ‘GMT+8’. 4 个字母将输出完整格式,即本地化偏移文本,如“GMT”,其中包含 2 位数的小时和分钟字段,可选的秒字段(如果不为零),带冒号,如“GMT+08:00”。Four letters outputs the full form, which is localized offset text, such as ‘GMT, with 2-digit hour and minute field, optional second field if non-zero, and colon, for example ‘GMT+08:00’. 任何其他字母数都将失败。Any other count of letters will fail.

  • 偏移 Z:这会根据模式字母的数量设置偏移的格式。Offset Z: This formats the offset based on the number of pattern letters. 1、2 或 3 个字母将输出小时和分钟,不带冒号,如“+0130”。One, two or three letters outputs the hour and minute, without a colon, such as ‘+0130’. 偏移量为零时,输出为“+0000”。The output is ‘+0000’ when the offset is zero. 4 个字母将输出本地化偏移,等效于 4 个字母的Offset-O。Four letters outputs the full form of localized offset, equivalent to four letters of Offset-O. 如果偏移量为零,则输出为相应的本地化偏移文本。The output is the corresponding localized offset text if the offset is zero. 5 个字母将输出小时、分钟和可选的秒(如果不为零),带冒号。Five letters outputs the hour, minute, with optional second if non-zero, with colon. 如果偏移量为零,则输出“Z”。It outputs ‘Z’ if the offset is zero. 6 个或更多个字母将会失败。Six or more letters will fail.

  • 可选部分的开始和结束:使用 [] 定义可选部分,可能是嵌套的。Optional section start and end: Use [] to define an optional section and maybe nested. 在设置格式的过程中,即使所有有效数据都处于可选部分,也会输出。During formatting, all valid data is output even it is in the optional section. 在分析过程中,已分析字符串中可能会缺少整个部分。During parsing, the whole section may be missing from the parsed string. 可选部分以 [ 开头,并使用 ] 结尾(或在模式末尾)。An optional section is started by [ and ended using ] (or at the end of the pattern).

  • 符号“E”、“F”、“q”和“Q”只能用于日期/时间格式设置,例如 date_formatSymbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. 它们不能用于日期时间分析,例如 to_timestampThey are not allowed used for datetime parsing, e.g. to_timestamp.