数据存储Data Storage

本文介绍 Azure 时序见解第 2 代中的数据存储。This article describes data storage in Azure Time Series Insights Gen2. 它涵盖了暖存储和冷存储、数据可用性和最佳做法。It covers warm and cold, data availability, and best practices.

设置Provisioning

创建 Azure 时序见解第 2 代环境时,可以使用以下选项:When you create an Azure Time Series Insights Gen2 environment, you have the following options:

  • 冷数据存储:Cold data storage:
    • 在为环境选择的订阅和区域中创建新的 Azure 存储资源。Create a new Azure Storage resource in the subscription and region you’ve chosen for your environment.
    • 附加预先存在的 Azure 存储帐户。Attach a pre-existing Azure Storage account. 此选项仅在从 Azure 资源管理器模板部署时可用,在 Azure 门户中不可见。This option is only available by deploying from an Azure Resource Manager template, and is not visible in the Azure portal.
  • 暖数据存储:Warm data storage:
    • 暖存储是可选项,可在预配期间或之后启用或禁用。A warm store is optional, and can be enabled or disabled during or after time of provisioning. 如果你决定稍后启用暖存储并且冷存储中已有数据,请参阅下面的此部分,了解预期行为。If you decide to enable warm store at a later time and there is already data in your cold store, review this section below to understand the expected behavior. 暖存储数据保留时间可以配置为 7 到 31 天,也可以根据需要进行调整。The warm store data retention time can be configured for 7 to 31 days, and this can also be adjusted as needed.

引入一个事件时,会在暖存储(如果启用)和冷存储中为其建立索引。When an event is ingested, it is indexed in both warm store (if enabled) and cold store.

存储概述Storage overview

警告

冷存储数据所在 Azure Blob 存储帐户的所有者对该帐户中的所有数据拥有完全访问权限。As the owner of the Azure Blob storage account where cold store data resides, you have full access to all data in the account. 此访问权限包括“写入”和“删除”权限。This access includes write and delete permissions. 请不要编辑或删除 Azure 时序见解第 2 代写入的数据,否则可能导致数据丢失。Don't edit or delete the data that Azure Time Series Insights Gen2 writes because that can cause data loss.

数据可用性Data availability

为了实现最佳查询性能,Azure 时序见解第 2 代会将数据分区并为其编制索引。Azure Time Series Insights Gen2 partitions and indexes data for optimum query performance. 为数据编制索引后,就可以从暖存储(如果已启用)和冷存储中查询数据。Data becomes available to query from both warm (if enabled) and cold store after it's indexed. 所引入的数据量和每个分区的吞吐率会影响可用性。The amount of data that's being ingested and the per-partition throughput rate can affect availability. 请查看事件源吞吐量限制最佳做法,以获取最佳性能。Review the event source throughput limitations and best practices for best performance. 你还可以配置一个当环境在处理数据的过程中遇到问题时要发送的延迟警报You can also configure a lag alert to be notified if your environment is experiencing issues processing data.

重要

最长可能需要等待 60 秒才能看到数据。You might experience a period of up to 60 seconds before data becomes available. 如果出现超过 60 秒的明显延迟,请通过 Azure 门户提交支持票证。If you experience significant latency beyond 60 seconds, please submit a support ticket through the Azure portal.

暖存储Warm store

只能通过时序查询 API 和 Azure 时序见解 TSI Explorer 或 Power BI 连接器使用暖存储中的数据。Data in your warm store is available only via the Time Series Query APIs, the Azure Time Series Insights TSI Explorer, or the Power BI Connector. 暖存储查询免费且没有配额,但有一个 30 个并发请求的限制Warm store queries are free and there is no quota, but there is a limit of 30 concurrent requests.

暖存储行为Warm store behavior

  • 启用后,流式传输到你的环境的所有数据都将路由到你的暖存储,而不考虑事件时间戳。When enabled, all data streamed into your environment will be routed to your warm store, regardless of the event timestamp. 请注意,流式引入管道专为近实时流式处理生成,不支持引入历史事件。Note that the streaming ingestion pipeline is built for near-real time streaming and ingesting historical events is not supported.

  • 保留期将基于事件在暖存储中建立索引的时间(而不是事件时间戳)进行计算。The retention period is calculated based on when the event was indexed in warm store, not the event timestamp. 这意味着,在保留期过后,即使事件时间戳适用于将来,数据在暖存储中也不再可用。This means that data is no longer available in warm store after the retention period has elapsed, even if the event timestamp is for the future.

    • 示例:将包含 10 天天气预报的事件引入一个配置了 7 天保持期的暖存储容器中,并为该事件编制索引。Example: an event with 10-day weather forecasts is ingested and indexed in a warm storage container configured with a 7-day retention period. 7 天后,将再也不能在暖存储中访问预测,但可以从冷存储中查询它。After 7 days time, the prediction is no longer accessible in warm store, but can be queried from cold.
  • 请注意,如果在现有环境中启用了暖存储,而该环境已在冷存储中将最近的数据编制索引,则不会使用此数据回填暖存储。If you enable warm store on an existing environment that already has recent data indexed in cold storage, note that your warm store will not be back-filled with this data.

  • 如果你刚启用了暖存储,并且在 Explorer 中查看最近的数据时遇到问题,可以暂时关闭暖存储查询:If you just enabled warm store and are experiencing issues viewing your recent data in the Explorer, you can temporarily toggle warm store queries off:

    禁用暖查询Disable warm queries

冷存储Cold store

此部分介绍与 Azure 时序见解第 2 代相关的 Azure 存储详细信息。This section describes Azure Storage details relevant to Azure Time Series Insights Gen2.

有关 Azure Blob 存储的全面介绍,请阅读存储 Blob 简介For a thorough description of Azure Blob storage, read the Storage blobs introduction.

冷存储帐户Your cold storage account

Azure 时序见解第 2 代在 Azure 存储帐户中为每个事件保留最多两个副本。Azure Time Series Insights Gen2 retains up to two copies of each event in your Azure Storage account. 一个副本存储按引入时间排序的事件,始终允许按时序访问事件。One copy stores events ordered by ingestion time, always allowing access to events in a time-ordered sequence. 随着时间的推移,Azure 时序见解第 2 代还会创建数据的重新分区的副本,通过优化实现高性能查询。Over time, Azure Time Series Insights Gen2 also creates a repartitioned copy of the data to optimize for performant queries.

你的所有数据无限期存储在 Azure 存储帐户中。All of your data is stored indefinitely in your Azure Storage account.

写入和编辑 blobWriting and editing blobs

为了确保查询性能和数据可用性,请不要编辑或删除 Azure 时序见解第 2 代创建的任何 Blob。To ensure query performance and data availability, don't edit or delete any blobs that Azure Time Series Insights Gen2 creates.

访问冷存储数据Accessing cold store data

除了从 Azure 时序见解资源管理器和时序查询 API 访问数据外,还可以直接从冷存储中存储的 Parquet 文件访问数据。In addition to accessing your data from the Azure Time Series Insights Explorer and Time Series Query APIs, you may also want to access your data directly from the Parquet files stored in the cold store. 例如,可以在 Jupyter 笔记本中读取、转换和清理数据,然后使用它来训练同一 Spark 工作流中的 Azure 机器学习模型。For example, you can read, transform, and cleanse data in a Jupyter notebook, then use it to train your Azure Machine Learning model in the same Spark workflow.

若要直接从 Azure 存储帐户访问数据,你需要具有用于存储 Azure 时序见解第 2 代数据的帐户的读取访问权限。To access data directly from your Azure Storage account, you need read access to the account used to store your Azure Time Series Insights Gen2 data. 然后,可以根据 Parquet 文件的创建时间读取选定的数据,该文件位于下面的 Parquet 文件格式部分所述的 PT=Time 文件夹中。You can then read selected data based on the creation time of the Parquet file located in the PT=Time folder described below in the Parquet file format section.

数据删除Data deletion

不要删除你的 Azure 时序见解第 2 代文件。Don't delete your Azure Time Series Insights Gen2 files. 请只在 Azure 时序见解第 2 代内部管理相关数据。Manage related data from within Azure Time Series Insights Gen2 only.

Parquet 文件格式和文件夹结构Parquet file format and folder structure

Parquet 是一种开源的列式文件格式,旨在提高存储效率和性能。Parquet is an open-source columnar file format designed for efficient storage and performance. Azure 时序见解第 2 代使用 Parquet 来大规模启用基于时序 ID 的查询的性能。Azure Time Series Insights Gen2 uses Parquet to enable Time Series ID-based query performance at scale.

有关 Parquet 文件类型的详细信息,请参阅 Parquet 文档For more information about the Parquet file type, read the Parquet documentation.

Azure 时序见解第 2 代按如下方式存储数据的副本:Azure Time Series Insights Gen2 stores copies of your data as follows:

  • 首先,初始副本按引入时间进行分区,并按到达顺序存储数据。The first, initial copy is partitioned by ingestion time and stores data roughly in order of arrival. 该数据驻留在 PT=Time 文件夹中:This data resides in the PT=Time folder:

    V=1/PT=Time/Y=<YYYY>/M=<MM>/<YYYYMMDDHHMMSSfff>_<TSI_INTERNAL_SUFFIX>.parquet

  • 其次,重新分区的副本按时序 ID 分组,驻留在 PT=TsId 文件夹中:The second, repartitioned copy is grouped by Time Series IDs and resides in the PT=TsId folder:

    V=1/PT=TsId/<TSI_INTERNAL_NAME>.parquet

PT=Time 文件夹中的 blob 名称中的时间戳对应于数据到达 Azure 时序见解第 2 代的时间,而不是对应于事件的时间戳。The timestamp in the blob names in the PT=Time folder corresponds to the arrival time of the data to Azure Time Series Insights Gen2 and not the timestamp of the events.

PT=TsId 文件夹中的数据会不断针对查询进行优化,不是静态的。Data in the PT=TsId folder will be optimized for query over time and is not static. 在重新分区期间,某些事件可能会出现在多个 blob 中。During repartitioning, some events might be present in multiple blobs. 不保证此文件夹中的 blob 的命名保持不变。The naming of the blobs in this folder is not guaranteed to remain the same.

通常,如果需要直接通过 Parquet 文件访问数据,请使用 PT=Time 文件夹。In general, if you need to access data directly via Parquet files, use the PT=Time folder. 将来的功能将能够有效地访问 PT=TsId 文件夹。Future functionality will enable efficient access to the PT=TsId folder.

备注

  • <YYYY> 映射到 4 位数的年份表示形式。<YYYY> maps to a four-digit year representation.
  • <MM> 映射到两位数的月份表示形式。<MM> maps to a two-digit month representation.
  • <YYYYMMDDHHMMSSfff> 映射到时间戳表示形式,其中包括四位数的年份 (YYYY)、两位数的月份 (MM)、两位数的日 (DD)、两位数的小时 (HH)、两位数的分钟 (MM)、两位数的秒 (SS) 和三位数的毫秒 (fff)。<YYYYMMDDHHMMSSfff> maps to a time-stamp representation with four-digit year (YYYY), two-digit month (MM), two-digit day (DD), two-digit hour (HH), two-digit minute (MM), two-digit second (SS), and three-digit millisecond (fff).

Azure 时序见解第 2 代事件映射到 Parquet 文件内容,如下所示:Azure Time Series Insights Gen2 events are mapped to Parquet file contents as follows:

  • 每个事件映射到单个行。Each event maps to a single row.
  • 每一行包含带有事件时间戳的 时间戳 列。Every row includes the timestamp column with an event time stamp. “时间戳”属性永远不会为 null。The time-stamp property is never null. 如果未在事件源中指定时间戳属性,该属性默认为 事件排队时间It defaults to the event enqueued time if the time stamp property isn't specified in the event source. 存储的时间戳始终为 UTC 时间。The stored time-stamp is always in UTC.
  • 每一行包含创建 Azure 时序见解第 2 代环境时定义的时序 ID (TSID) 列。Every row includes the Time Series ID (TSID) column(s) as defined when the Azure Time Series Insights Gen2 environment is created. TSID 属性名称包含 _string 后缀。The TSID property name includes the _string suffix.
  • 作为遥测数据发送的所有其他属性映射到以 _bool(布尔值)、_datetime(时间戳)、_long (long)、_double (double)、_string(字符串)或 dynamic (dynamic) 结尾的列名称,具体取决于属性类型。All other properties sent as telemetry data are mapped to column names that end with _bool (boolean), _datetime (time stamp), _long (long), _double (double), _string (string), or dynamic (dynamic), depending on the property type. 有关详细信息,请阅读支持的数据类型For more information, read about Supported data types.
  • 此映射架构适用于文件格式的第一个版本(以 V=1 表示,存储在使用相同名称的基文件夹中)。This mapping schema applies to the first version of the file format, referenced as V=1 and stored in the base folder of the same name. 随着此功能的演变,此映射架构可能会更改,引用名称的数字会递增。As this feature evolves, this mapping schema might change and the reference name incremented.

后续步骤Next steps