Kusto 引入客户端库 - 最佳做法Kusto ingest client library - Best practices

选择正确的 IngestClient 风格Select the right IngestClient flavor

请使用 KustoQueuedIngestClient,这是建议的本机数据引入模式。Use KustoQueuedIngestClient, it's the recommended native data ingestion mode. 原因如下:Here's why:

  • 在引擎停机期间(例如在部署期间),无法进行直接引入。Direct ingestion is impossible during engine downtime, such as during deployment. 在排队引入模式下,请求将保留到 Azure 队列中,并且数据管理服务将根据需要重试。In the queued ingestion mode, the requests are persisted to the Azure queue, and the Data Management service will retry as needed.
  • 数据管理服务使引擎不会发生引入请求过载。The Data Management service keeps the engine from overloading with ingestion requests. 使用“直接引入”重写此控制可能会严重影响引擎引入和查询性能。Overriding this control by using Direct ingestion, for example, can severely affect engine ingestion and query performance.
  • 数据管理将聚合多个引入请求。Data Management aggregates multiple requests for ingestion. 聚合会优化要创建的初始分片(区)的大小。The aggregation optimizes the size of the initial shard (extent) to be created.
  • 获取有关每个引入的反馈非常简单。Getting feedback about each ingestion is easy.

避免跟踪引入操作状态Avoid tracking ingest operation status

跟踪引入操作状态很有用。Tracking ingest operation status is useful. 但对于大容量数据流,应避免为每个引入请求启用用于指示该请求存在的通知。However, for large volume data streams, turning on positive notifications for every ingestion request should be avoided. 这样的跟踪会对基础 xStore 资源施加非常大的负担,从而可能导致引入延迟,甚至群集完全无响应。Such tracking puts an extreme load on the underlying xStore resources that can lead to increased ingestion latency and even complete cluster non-responsiveness.

针对吞吐量进行优化Optimizing for throughput

如果以大区块的方式执行引入,引入效果最佳。Ingestion works best if done in large chunks.

  • 它消耗的资源最少It consumes the least resources
  • 它生成在最大程度上进行了 COGS 优化的数据分片,并产生最佳数据事务It produces the most COGS-optimized data shards, and results in the best data transactions

我们建议,使用 Kusto.Ingest 库引入数据或直接将数据引入到引擎的客户按每批 100 MB 到 1 GB(未压缩)的数据量分批发送数据We recommend customers who ingest data with the Kusto.Ingest library or directly into the engine, to send data in batches of 100 MB to 1 GB (uncompressed)

  • 当直接使用引擎时,上限对帮助减少引入过程使用的内存量非常重要The upper limit is important when working directly with the engine, to help reduce the amount of memory used by the ingestion process

备注

使用 KustoQueuedIngestClient 类时,过大的数据块会拆分为较小的区块,并且在到达引擎之前,这些小区块将在一定程度上聚合。When using the KustoQueuedIngestClient class, overly large blocks of data will be split into smaller chunks, and these small chunks will be aggregated, to a certain degree, before reaching the engine.

  • 引入数据大小的下限也很重要,尽管重要程度并非最高。The lower limit on ingested data size is also important, although less critical. 每隔一段时间就小批引入数据非常适用,尽管效率略低于使用大批。Ingesting data in small batches every now and then is perfectly fine, although slightly less efficient than using large batches. 如果客户需要引入大量数据,并且在将数据发送到引擎之前,不能将它们分批转换为大型区块,则 KustoQueuedIngestClient 类也可以解决问题。The KustoQueuedIngestClient class also solves the problem for customers who need to ingest large amounts of data and can't batch them into large chunks before sending them to the engine.

影响引入吞吐量的其他因素Other factors that impact ingestion throughput

多个因素可能会影响引入吞吐量。Multiple factors can affect ingestion throughput. 在规划引入管道时,请确保评估以下要点,这些要点可能会对 COG 产生重要影响。When planning your ingestion pipeline, make sure to evaluate the following points, which can have significant implications on your COGs.

要考虑的因素Factor for consideration 说明Description
数据格式Data format CSV 是最快的引入格式。CSV is the fastest format to ingest. 对于同一数据量,JSON 需要花费的时间通常会多 2 或 3 倍。JSON will typically take 2x or 3x longer for the same volume of data.
表宽度Table width 请确保仅引入你真正需要的数据。Make sure that you only ingest data you really need. 表越宽,需要编码和索引编制的列就越多,吞吐量也越低。The wider the table, the more columns will need to be encoded and indexed, and the lower the throughput. 可以通过提供引入映射来控制要引入哪些字段。You can control which fields get ingested, by providing an ingestion mapping.
源数据位置Source data location 避免跨区域读取可加快引入速度。Avoid cross-region reads to speed up the ingestion.
群集上的负载Load on the cluster 当群集遇到较高的查询负载时,引入将需要更长的时间才能完成,从而会减少吞吐量。When a cluster experiences a high query load, ingestions will take longer to complete, reducing throughput.

针对 COGS 进行优化Optimizing for COGS

使用 Kusto 客户端库将数据引入 Azure 数据资源管理器仍是最便宜且最可靠的选项。Using Kusto client libraries to ingest data into Azure Data Explorer remains the cheapest and the most robust option. 我们强烈建议客户查看其引入方法,并利用 Azure 存储定价,该定价将使 blob 事务明显经济高效。We urge our customers to review their ingestion methods and to take advantage of the Azure Storage pricing that will make blob transactions significantly cost effective.

  • 限制引入数据区块的数量Limit the number of ingested data chunks. 为了更好地控制 Azure 数据资源管理器引入成本并减少每月费用,请限制引入数据区块(文件/blob/流)的数量。For better control of your Azure Data Explorer ingestion costs and to reduce your monthly bill, limit the number of ingested data chunks (files/blobs/streams).
  • 引入较大的数据区块(最多 1GB 未压缩的数据)Ingest large chunks of data (up to 1GB of uncompressed data). 许多团队尝试引入数千万的小数据区块来实现较低的延迟,这非常低效且成本高昂。Many teams attempt to achieve low latency by ingesting tens of millions of tiny chunks of data, which is inefficient and costly.
  • 批处理Batching. 客户端的任意量批处理都将提高优化程度。Any amount of batching at the client side would improve optimization.
  • 为 Kusto.Ingest 客户端提供确切的未压缩数据大小Provide the Kusto.Ingest client with an exact, uncompressed, data size. 如果不这样做,可能会导致额外的存储事务。Not doing so may cause extra storage transactions.
  • 避免在将 FlushImmediately 标志设置为 true 时发送要引入的数据Avoid sending data for ingestion with the FlushImmediately flag set to true. 另外,请避免发送设置了 ingest-by/drop-by 标记的小区块。Also, avoid sending small chunks with ingest-by/drop-by tags set. 如果使用这些方法,这些方法将:If you use these methods, they'll:
    • 阻止服务在引入过程中正确聚合数据prevent the service from properly aggregating the data during ingestion
    • 在引入后导致不必要的存储事务cause unnecessary storage transactions following the ingestion
    • 影响 COGSaffect COGS
    • 可能会导致群集的引入或查询性能下降(如果过度使用)likely result in degraded ingestion or query performance of your cluster, if used excessively