Azure 流分析自定义 blob 输出分区Azure Stream Analytics custom blob output partitioning

Azure 流分析支持包含自定义字段或属性和自定义 DateTime 路径模式的自定义 blob 输出分区。Azure Stream Analytics supports custom blob output partitioning with custom fields or attributes and custom DateTime path patterns.

自定义字段或属性Custom field or attributes

自定义字段或输入属性通过允许进一步控制输出,改进下游数据处理和报告工作流。Custom field or input attributes improve downstream data-processing and reporting workflows by allowing more control over the output.

分区键选项Partition key options

用于分区输入数据的分区键或列名称可能包含带有连字符、下划线和空格的字母数字字符。The partition key, or column name, used to partition input data may contain alphanumeric characters with hyphens, underscores, and spaces. 除非与别名一起使用,否则无法将嵌套字段用作分区键。It is not possible to use nested fields as a partition key unless used in conjunction with aliases. 分区键必须为 NVARCHAR (MAX)。The partition key must be NVARCHAR(MAX).

示例Example

假设作业从连接到外部视频游戏服务的实时用户会话获取输入数据,其中引入的数据包含用于识别会话的列 client_id。Suppose a job takes input data from live user sessions connected to an external video game service where ingested data contains a column client_id to identify the sessions. 若要按 client_id 对数据进行分区,请在创建作业时将“blob 路径模式”字段设置为,在 blob 输出属性中添加分区标记 {client_id}。To partition the data by client_id, set the Blob Path Pattern field to include a partition token {client_id} in blob output properties when creating a job. 当包含各种 client_id 值的数据流经流分析作业时,输出数据根据每个文件夹的单一 client_id 值保存到单独的文件夹中。As data with various client_id values flow through the Stream Analytics job, the output data is saved into separate folders based on a single client_id value per folder.

包含客户端 ID 的路径模式

同样,如果作业输入是来自数百万个传感器的传感器数据(其中每个传感器有一个 sensor_id),那么路径模式为 {sensor_id},用于将每个传感器数据分区到不同的文件夹中。Similarly, if the job input was sensor data from millions of sensors where each sensor had a sensor_id, the Path Pattern would be {sensor_id} to partition each sensor data to different folders.

使用 REST API,用于相应请求的 JSON 文件的输出部分可能如下所示:Using the REST API, the output section of a JSON file used for that request may look like the following:

REST API 输出

在作业开始运行后,“客户端”容器可能如下所示:Once the job starts running, the clients container may look like the following:

“客户端”容器

每个文件夹都可能包含多个 blob,其中每个 blob 包含一个或多个记录。Each folder may contain multiple blobs where each blob contains one or more records. 在上面的示例中,标记为“06000000”的文件夹中有一个 blob,其中包含以下内容:In the above example, there is a single blob in a folder labelled "06000000" with the following contents:

blob 内容

请注意,blob 中的每个记录都有一个与文件夹名称匹配的 client_id 列,这是因为用于在输出路径中对输出进行分区的列是 client_id。Notice that each record in the blob has a client_id column matching the folder name since the column used to partition the output in the output path was client_id.

限制Limitations

  1. “路径模式”blob 输出属性中只允许有一个自定义分区键。Only one custom partition key is permitted in the Path Pattern blob output property. 以下所有路径模式都有效:All of the following Path Patterns are valid:

    • cluster1/{date}/{aFieldInMyData}cluster1/{date}/{aFieldInMyData}
    • cluster1/{time}/{aFieldInMyData}cluster1/{time}/{aFieldInMyData}
    • cluster1/{aFieldInMyData}cluster1/{aFieldInMyData}
    • cluster1/{date}/{time}/{aFieldInMyData}cluster1/{date}/{time}/{aFieldInMyData}
  2. 由于分区键不区分大小写,因此像“John”和“john”这样的分区键是等效的。Partition keys are case insensitive, so partition keys like "John" and "john" are equivalent. 另外,无法使用表达式作为分区键。Also, expressions cannot be used as partition keys. 例如,{columnA + columnB} 不起作用。For example, {columnA + columnB} does not work.

  3. 如果输入流由分区键基数低于 8000 的记录组成,记录会附加到现有 blob,并且仅在必要时新建 blob。When an input stream consists of records with a partition key cardinality under 8000, the records will be appended to existing blobs and only create new blobs when necessary. 如果基数超过 8000,无法保证将写入现有 blob,并且不会为具有相同分区键的任意数量记录新建 blob。If the cardinality is over 8000 there is no guarantee existing blobs will be written to and new blobs won't be created for an arbitrary number of records with the same partition key.

自定义 DateTime 路径模式Custom DateTime path patterns

可以使用自定义 DateTime 路径模式来指定与 Hive 流式处理约定相符的输出格式,这样 Azure 流分析就可以将数据发送到 Azure HDInsight 和 Azure Databricks 进行下游处理。Custom DateTime path patterns allow you to specify an output format that aligns with Hive Streaming conventions, giving Azure Stream Analytics the ability to send data to Azure HDInsight and Azure Databricks for downstream processing. 自定义 DateTime 路径模式可以轻松地实现,只需在 Blob 输出的“路径前缀”字段中使用 datetime 关键字并使用格式说明符即可。Custom DateTime path patterns are easily implemented using the datetime keyword in the Path Prefix field of your blob output, along with the format specifier. 例如,{datetime:yyyy}For example, {datetime:yyyy}.

支持的令牌Supported tokens

以下格式说明符令牌可以单独使用,也可以组合使用,以便实现自定义 DateTime 格式:The following format specifier tokens can be used alone or in combination to achieve custom DateTime formats:

格式说明符Format specifier 说明Description 示例时间 2018-01-02T10:06:08 的结果Results on example time 2018-01-02T10:06:08
{datetime:yyyy}{datetime:yyyy} 年份为四位数The year as a four-digit number 20182018
{datetime:MM}{datetime:MM} 月份为 01 到 12Month from 01 to 12 0101
{datetime:M}{datetime:M} 月份为 1 到 12Month from 1 to 12 11
{datetime:dd}{datetime:dd} 日期为 01 到 31Day from 01 to 31 0202
{datetime:d}{datetime:d} 日期为 1 到 31Day from 1 to 31 22
{datetime:HH}{datetime:HH} 小时为 00 到 23,采用 24 小时格式Hour using the 24-hour format, from 00 to 23 1010
{datetime:mm}{datetime:mm} 分钟为 00 到 60Minutes from 00 to 60 0606
{datetime:m}{datetime:m} 分钟为 0 到 60Minutes from 0 to 60 66
{datetime:ss}{datetime:ss} 秒为 00 到 60Seconds from 00 to 60 0808

如果不希望使用自定义 DateTime 模式,可以将 {date} 和/或 {time} 令牌添加到路径前缀,以便使用内置的 DateTime 格式生成一个下拉列表。If you do not wish to use custom DateTime patterns, you can add the {date} and/or {time} token to the Path Prefix to generate a dropdown with built-in DateTime formats.

流分析的旧 DateTime 格式

扩展性和限制Extensibility and restrictions

可以在路径模式中使用尽量多的 {datetime:<specifier>} 令牌,直到达到路径前缀字符限制。You can use as many tokens, {datetime:<specifier>}, as you like in the path pattern until you reach the Path Prefix character limit. 在单个令牌中,格式说明符的组合不能超出日期和时间下拉列表已经列出的组合。Format specifiers can't be combined within a single token beyond the combinations already listed by the date and time dropdowns.

对于 logs/MM/dd 路径分区:For a path partition of logs/MM/dd:

有效表达式Valid expression 无效表达式Invalid expression
logs/{datetime:MM}/{datetime:dd} logs/{datetime:MM/dd}

可以在路径前缀中多次使用同一格式说明符。You may use the same format specifier multiple times in the Path Prefix. 令牌每次都必须重复。The token must be repeated each time.

Hive 流式处理约定Hive Streaming conventions

Blob 存储的自定义路径模式可以与 Hive 流式处理约定配合使用,后者要求文件夹在其名称中使用 column= 进行标记。Custom path patterns for blob storage can be used with the Hive Streaming convention, which expects folders to be labeled with column= in the folder name.

例如,year={datetime:yyyy}/month={datetime:MM}/day={datetime:dd}/hour={datetime:HH}For example, year={datetime:yyyy}/month={datetime:MM}/day={datetime:dd}/hour={datetime:HH}.

有了自定义输出,就不需更改表,也不需将分区添加到 Azure 流分析和 Hive 之间的端口数据。Custom output eliminates the hassle of altering tables and manually adding partitions to port data between Azure Stream Analytics and Hive. 许多文件夹可以使用以下方式自动添加:Instead, many folders can be added automatically using:

MSCK REPAIR TABLE while hive.exec.dynamic.partition true

示例Example

根据 Azure 流分析 Azure 门户快速入门指南中的说明,创建存储帐户、资源组、流分析作业和输入源。Create a storage account, a resource group, a Stream Analytics job, and an input source according to the Azure Stream Analytics Azure Portal quickstart guide. 使用在快速入门指南中使用的同一示例数据,该数据也可以在 GitHub 上获取。Use the same sample data used in the quickstart guide, also available on GitHub.

使用以下配置创建 Blob 输出接收器:Create a blob output sink with the following configuration:

流分析创建 Blob 输出接收器

完整路径模式如下所示:The full path pattern is as follows:

year={datetime:yyyy}/month={datetime:MM}/day={datetime:dd}

启动作业时,会在 Blob 容器中创建基于路径模式的文件夹结构。When you start the job, a folder structure based on the path pattern is created in your blob container. 可以向下钻取到日级别。You can drill down to the day level.

使用自定义路径模式的流分析 Blob 输出

后续步骤Next steps