连续数据导出概述Continuous data export overview

本文介绍了如何使用定期运行的查询,将 Kusto 中的数据连续导出到外部表This article describes continuous export of data from Kusto to an external table with a periodically run query. 结果会存储在外部表中,该表定义了导出的数据的目标(例如 Azure Blob 存储)和架构。The results are stored in the external table, which defines the destination, such as Azure Blob Storage, and the schema of the exported data. 此过程确保所有记录都导出“恰好一次”,但有一些例外This process guarantees that all records are exported "exactly once", with some exceptions.

若要启用连续数据导出,请创建一个外部表,然后创建连续导出定义并将其指向该外部表。To enable continuous data export, create an external table and then create a continuous export definition pointing to the external table.

备注

所有连续导出命令都需要数据库管理员权限All continuous export commands require database admin permissions.

连续导出准则Continuous export guidelines

  • 输出架构Output schema :
    • 导出查询的输出架构必须与要导出到的外部表的架构匹配。The output schema of the export query must match the schema of the external table to which you export.
  • 频率Frequency :
    • 连续导出将根据在 intervalBetweenRuns 属性中为其配置的时间段运行。Continuous export runs according to the time period configured for it in the intervalBetweenRuns property. 此间隔的建议值至少为几分钟,具体取决于你愿意接受的延迟时间。The recommended value for this interval is at least several minutes, depending on the latencies you're willing to accept. 如果引入速率较高,则时间间隔最小可为一分钟。The time interval can be as low as one minute, if the ingestion rate is high.
  • 分布Distribution :
    • 连续导出中的默认分布是 per_node,其中的所有节点都同时导出。The default distribution in continuous export is per_node, in which all nodes are exporting concurrently.
    • 可以在连续导出的 create 命令的属性中重写此设置。This setting can be overridden in the properties of the continuous export create command. 使用 per_shard 分布可提高并发性。Use per_shard distribution to increase concurrency.

      备注

      此分布会增加存储帐户的负载,并有可能达到限制上限。This distribution will increase the load on the storage account(s) and has a chance of hitting throttling limits.

    • 使用 single(或 distributed=false)来完全禁用分发。Use single (or distributed=false) to disable distribution altogether. 此设置可能会显著降低连续导出过程的速度,并影响每个连续导出迭代中创建的文件数。This setting may significantly slow down the continuous export process and impact the number of files created in each continuous export iteration.
  • 文件数目Number of files :
    • 每个连续导出迭代中导出的文件数取决于外部表的分区情况。The number of files exported in each continuous export iteration depends on how the external table is partitioned. 有关详细信息,请参阅导出到外部表命令For more information, see export to external table command. 每个连续导出迭代始终会写入到新文件中,永远不会附加到现有文件中。Each continuous export iteration always writes to new files, and never appends to existing ones. 因此,导出的文件的数目也取决于连续导出的运行频率。As a result, the number of exported files also depends on the frequency in which the continuous export runs. 频率参数为 intervalBetweenRunsThe frequency parameter is intervalBetweenRuns.
  • 位置Location :
    • 为了获得最佳性能,Azure 数据资源管理器群集和存储帐户应位于同一 Azure 区域中。For best performance, the Azure Data Explorer cluster and the storage account(s) should be colocated in the same Azure region.
    • 如果导出的数据量很大,建议为外部表配置多个存储帐户,以免出现存储限制。If the exported data volume is large, it's recommended to configure multiple storage accounts for the external table, to avoid storage throttling. 请参阅将数据导出到存储See export data to storage.

导出恰好一次Exactly once export

为了保证导出“恰好一次”,连续导出使用数据库游标To guarantee "exactly once" export, continuous export uses database cursors. 必须在所有已在查询中引用且应当在导出中处理“恰好一次”的表上启用 IngestionTime 策略IngestionTime policy must be enabled on all tables referenced in the query that should be processed "exactly once" in the export. 此策略默认在所有新创建的表上启用。The policy is enabled by default on all newly created tables.

导出“恰好一次”的保证仅适用于显示导出的项目命令中报告的文件。The guarantee for "exactly once" export is only for files reported in the show exported artifacts command. 连续导出并不保证仅将每个记录向外部表中写入一次。Continuous export doesn't guarantee that each record will be written only once to the external table. 如果在开始导出后发生故障,并且某些项目已被写入到外部表,则外部表可能包含重复项。If a failure occurs after export has begun and some of the artifacts were already written to the external table, the external table may contain duplicates. 如果写入操作在完成之前被中止,则外部表可能包含已损坏的文件。If a write operation was aborted before completion, the external table may contain corrupted files. 在这种情况下,不会从外部表中删除这些项目,但也不会在显示导出的项目命令中报告这些项目。In such cases, artifacts aren't deleted from the external table, but they won't be reported in the show exported artifacts command. 通过 show exported artifacts command 使用导出的文件可保证无重复且无损坏。Consuming the exported files using the show exported artifacts command guarantees no duplications and no corruptions.

导出到事实数据表和维度表Export to fact and dimension tables

默认情况下,会假定导出查询中引用的所有表都是事实数据表By default, all tables referenced in the export query are assumed to be fact tables. 因此,它们的作用域限定于数据库游标。As such, they're scoped to the database cursor. 此语法显式声明哪些表具有作用域(事实数据表)以及哪些表没有作用域(维度表)。The syntax explicitly declares which tables are scoped (fact) and which aren't scoped (dimension). 有关详细信息,请参阅创建命令中的 over 参数。See the over parameter in the create command for details.

导出查询仅包含自上一次执行导出后联接的记录。The export query includes only the records that joined since the previous export execution. 导出查询可能包含维度表,维度表的所有记录都包括在所有导出查询中。 The export query may contain dimension tables in which all records of the dimension table are included in all export queries. 在连续导出中,当在事实数据表与维度表之间使用联接时,请记住,事实数据表中的记录仅处理一次。When using joins between fact and dimension tables in continuous-export, keep in mind that records in the fact table are only processed once. 如果在维度表中的记录缺少某些键的情况下运行导出,则相应键的记录会丢失,或者会在导出的文件中包含维度列的 null 值。If the export runs while records in the dimension tables are missing for some keys, records for the respective keys will either be missed or include null values for the dimension columns in the exported files. 是返回缺少的记录还是返回 null 记录取决于查询使用的是内部联接还是外部联接。Returning missed or null records depends on whether the query uses inner or outer join. 如果对于匹配的记录,事实数据表和维度表是在同一时间引入的,则在这种情况下,连续导出定义中的 forcedLatency 属性可能很有用。The forcedLatency property in the continuous-export definition can be useful in such cases, where the fact and dimensions tables are ingested during the same time for matching records.

备注

不支持连续导出纯维度表。Continuous export of only dimension tables isn't supported. 导出查询必须包含至少一个事实数据表。The export query must include at least a single fact table.

导出历史数据Exporting historical data

连续导出只从其创建时间点开始导出数据。Continuous export starts exporting data only from the point of its creation. 在该时间之前引入的记录应该使用非连续导出命令单独导出。Records ingested before that time should be exported separately using the non-continuous export command.

历史数据可能太大,无法在单个导出命令中导出。Historical data may be too large to be exported in a single export command. 如果需要,请将查询分成几个较小的批。If needed, partition the query into several smaller batches.

为了避免与连续导出所导出的数据重复,请使用由显示连续导出命令返回的 StartCursor,仅导出 where cursor_before_or_at 该游标值的记录。To avoid duplicates with data exported by continuous export, use StartCursor returned by the show continuous export command and export only records where cursor_before_or_at the cursor value. 例如:For example:

.show continuous-export MyExport | project StartCursor
StartCursorStartCursor
636751928823156645636751928823156645

后面是:Followed by:

.export async to table ExternalBlob
<| T | where cursor_before_or_at("636751928823156645")

资源消耗Resource consumption

  • 连续导出对群集的影响取决于连续导出正在运行的查询。The impact of the continuous export on the cluster depends on the query the continuous export is running. 大多数资源(例如 CPU 和内存)都由查询执行操作使用。Most resources, such as CPU and memory, are consumed by the query execution.
  • 可以并发运行的导出操作的数量受限于群集的数据导出容量。The number of export operations that can run concurrently is limited by the cluster's data export capacity. 有关详细信息,请参阅限制For more information, see throttling. 如果群集没有足够的容量来处理所有连续导出,某些导出将会滞后启动。If the cluster doesn't have sufficient capacity to handle all continuous exports, some will start lagging behind.
  • 可以使用 show commands-and-queries 命令来估计资源消耗情况。The show commands-and-queries command can be used to estimate the resources consumption.
    • 根据 | where ClientActivityId startswith "RunContinuousExports" 进行筛选,查看与连续导出关联的命令和查询。Filter on | where ClientActivityId startswith "RunContinuousExports" to view the commands and queries associated with continuous export.

限制Limitations

  • 连续导出对于通过流式引入引入的数据不起作用。Continuous export doesn't work for data ingested using streaming ingestion.
  • 无法在启用了行级安全策略的表上配置连续导出。Continuous export can't be configured on a table on which a Row Level Security policy is enabled.
  • 连接字符串中包含 impersonate 的外部表不支持连续导出。Continuous export isn't supported for external tables with impersonate in their connection strings.
  • 连续导出不支持跨数据库和跨群集调用。Continuous export doesn't support cross-database and cross-cluster calls.
  • 连续导出未设计用于将数据从 Azure 数据资源管理器连续流式传输出去。Continuous export isn't designed for constantly streaming data out of Azure Data Explorer. 连续导出以分布式模式运行,所有节点并发导出。Continuous export runs in a distributed mode, where all nodes export concurrently. 如果每个运行所查询的数据范围很小,则连续导出的输出将是很多小项目。If the range of data queried by each run is small, the output of the continuous export would be many small artifacts. 项目数量取决于群集中的节点数。The number of artifacts depends on the number of nodes in the cluster.