Azure 数据资源管理器数据引入概述Azure Data Explorer data ingestion overview

数据引入是用于从一个或多个源加载数据记录以在 Azure 数据资源管理器中将数据导入表的过程。Data ingestion is the process used to load data records from one or more sources to import data into a table in Azure Data Explorer. 引入后,数据即可用于查询。Once ingested, the data becomes available for query.

下图显示了用于在 Azure 数据资源管理器中运行的端到端流,以及不同的引入方法。The diagram below shows the end-to-end flow for working in Azure Data Explorer and shows different ingestion methods.

数据引入和管理的概述方案

Azure 数据资源管理器数据管理服务负责数据引入,并执行以下过程:The Azure Data Explorer data management service, which is responsible for data ingestion, implements the following process:

Azure 数据资源管理器从外部源拉取数据,并从挂起的 Azure 队列读取请求。Azure Data Explorer pulls data from an external source and reads requests from a pending Azure queue. 将数据分批或流式传输到数据管理器。Data is batched or streamed to the Data Manager. 批数据流入相同的数据库和表中,以优化引入吞吐量。Batch data flowing to the same database and table is optimized for ingestion throughput. Azure 数据资源管理器验证初始数据并在必要时转换数据格式。Azure Data Explorer validates initial data and converts data formats where necessary. 进一步的数据操作包括匹配架构、组织、编制索引、编码和压缩数据。Further data manipulation includes matching schema, organizing, indexing, encoding, and compressing the data. 数据根据设置的保留策略保留在存储中。Data is persisted in storage according to the set retention policy. 然后,数据管理器将数据引入提交到引擎,在那里数据可供查询。The Data Manager then commits the data ingest to the engine, where it's available for query.

支持的数据格式、属性和权限Supported data formats, properties, and permissions

批量引入与流式引入Batching vs streaming ingestion

  • 批量引入执行数据批处理,并针对高引入吞吐量进行优化。Batching ingestion does data batching and is optimized for high ingestion throughput. 此方法是引入的首选且性能最高。This method is the preferred and most performant type of ingestion. 数据根据引入属性进行批处理。Data is batched according to ingestion properties. 然后合并小批次的数据,并针对快速查询结果进行优化。Small batches of data are then merged, and optimized for fast query results. 可以在数据库或表上设置引入批处理策略。The ingestion batching policy can be set on databases or tables. 默认情况下,最大批处理值为 5 分钟、1000 项,或者 500 MB 的总大小。By default, the maximum batching value is 5 minutes, 1000 items, or a total size of 500 MB.

  • 流式引入是来自流式处理源的正在进行的数据引入。Streaming ingestion is ongoing data ingestion from a streaming source. 流式引入针对每个表的小型数据允许近乎实时的延迟。Streaming ingestion allows near real-time latency for small sets of data per table. 数据最初引入到行存储,然后移动到列存储盘区。Data is initially ingested to row store, then moved to column store extents. 可以使用 Azure 数据资源管理器客户端库或某个受支持的数据管道来完成流式引入。Streaming ingestion can be done using an Azure Data Explorer client library or one of the supported data pipelines.

引入方法和工具Ingestion methods and tools

Azure 数据资源管理器支持多种引入方法,每种方法都有自己的适用场景。Azure Data Explorer supports several ingestion methods, each with its own target scenarios. 这些方法包括引入工具、各种服务的连接器和插件、托管管道、使用 SDK 的编程引入,以及直接引入。These methods include ingestion tools, connectors and plugins to diverse services, managed pipelines, programmatic ingestion using SDKs, and direct access to ingestion.

使用托管管道的引入Ingestion using managed pipelines

对于希望通过外部服务进行管理(限制、重试、监视器、警报等)的组织而言,使用连接器可能是最合适的解决方案。For organizations who wish to have management (throttling, retries, monitors, alerts, and more) done by an external service, using a connector is likely the most appropriate solution. 排队引入适合大数据量。Queued ingestion is appropriate for large data volumes. Azure 数据资源管理器支持以下 Azure Pipelines:Azure Data Explorer supports the following Azure Pipelines:

使用连接器和插件的引入Ingestion using connectors and plugins

  • Apache Spark 连接器:可在任何 Spark 群集上运行的开放源代码项目。Apache Spark connector: An open-source project that can run on any Spark cluster. 它实现了用于跨 Azure 数据资源管理器和 Spark 群集移动数据的数据源和数据接收器。It implements data source and data sink for moving data across Azure Data Explorer and Spark clusters. 可以构建面向数据驱动型方案的可缩放快速应用程序。You can build fast and scalable applications targeting data-driven scenarios. 请参阅适用于 Apache Spark 的 Azure 数据资源管理器连接器See Azure Data Explorer Connector for Apache Spark.

使用 SDK 的编程引入Programmatic ingestion using SDKs

Azure 数据资源管理器提供可用于查询和数据引入的 SDK。Azure Data Explorer provides SDKs that can be used for query and data ingestion. 通过在引入期间和之后尽量减少存储事务,编程引入得到优化,可降低引入成本 (COG)。Programmatic ingestion is optimized for reducing ingestion costs (COGs), by minimizing storage transactions during and following the ingestion process.

可用的 SDK 和开放源代码项目Available SDKs and open-source projects

工具Tools

  • LightIngest :用于将即席数据引入 Azure 数据资源管理器的命令行实用工具。LightIngest: A command-line utility for ad-hoc data ingestion into Azure Data Explorer. 该实用工具可以从本地文件夹或 Azure Blob 存储容器提取源数据。The utility can pull source data from a local folder or from an Azure blob storage container.

Kusto 查询语言引入控制命令Kusto Query Language ingest control commands

可以通过多种方法利用 Kusto 查询语言 (KQL) 命令将数据直接引入引擎。There are a number of methods by which data can be ingested directly to the engine by Kusto Query Language (KQL) commands. 由于此方法绕过数据管理服务,因此仅适用于探索和制作原型。Because this method bypasses the Data Management services, it's only appropriate for exploration and prototyping. 不要在生产或大容量方案中使用此方法。Don't use this method in production or high-volume scenarios.

  • 内联引入:向引擎发送控制命令 .ingest inline,要引入的数据是命令文本自身的一部分。Inline ingestion: A control command .ingest inline is sent to the engine, with the data to be ingested being a part of the command text itself. 此方法用于临时测试目的。This method is intended for improvised testing purposes.

  • 从查询引入:向引擎发送控制命令 .set、.append、.set-or-append 或 .set-or-replace,将数据间接指定为查询或命令的结果。Ingest from query: A control command .set, .append, .set-or-append, or .set-or-replace is sent to the engine, with the data specified indirectly as the results of a query or a command.

  • 从存储引入(拉取) :向引擎发送控制命令 .ingest into,数据存储在某个外部存储(例如 Azure Blob 存储)中,可供引擎访问,命令也可以指向它。Ingest from storage (pull): A control command .ingest into is sent to the engine, with the data stored in some external storage (for example, Azure Blob Storage) accessible by the engine and pointed-to by the command.

比较引入方法和工具Comparing ingestion methods and tools

| 引入名称Ingestion name | 数据类型Data type | 文件大小上限Maximum file size | 流式引入,批量引入,直接引入Streaming, batching, direct | 最常用场景Most common scenarios | 注意事项Considerations | | --- | --- | --- | --- | --- | --- | | LightIngestLightIngest | 支持所有支持All formats supported | 1 GB,解压缩(参见备注)1 GB uncompressed (see note) | 通过 DM 批量引入或直接引入引擎Batching via DM or direct ingestion to engine | 数据迁移,含已调整引入时间戳的历史数据,批量引入(无大小限制)Data migration, historical data with adjusted ingestion timestamps, bulk ingestion (no size restriction)| 区分大小写,区分有无空格Case-sensitive, space-sensitive | | ADX KafkaADX Kafka | | | | | | ADX 到 Apache SparkADX to Apache Spark | | | | | | LogStashLogStash | | | | | <!-- | IoT 中心IoT Hub | 支持的数据格式Supported data formats | 空值N/A | 批量引入,流式引入Batching, streaming | IoT 消息,IoT 事件,IoT 属性IoT messages, IoT events, IoT properties | | | 事件中心Event Hub | 支持的数据格式Supported data formats | 空值N/A | 批量引入,流式引入Batching, streaming | 消息,事件Messages, events | | | Net StdNet Std | 支持所有格式All formats supported | 1 GB,解压缩(参见备注)1 GB uncompressed (see note) | 批量引入,流式引入,直接引入Batching, streaming, direct | 根据组织需求编写自己的代码Write your own code according to organizational needs | | PythonPython | 支持所有格式All formats supported | 1 GB,解压缩(参见备注)1 GB uncompressed (see note) | 批量引入,流式引入,直接引入Batching, streaming, direct | 根据组织需求编写自己的代码Write your own code according to organizational needs | | Node.jsNode.js | 支持所有格式All formats supported | 1 GB,解压缩(参见备注)1 GB uncompressed (see note | 批量引入,流式引入,直接引入Batching, streaming, direct | 根据组织需求编写自己的代码Write your own code according to organizational needs | | JavaJava | 支持所有格式All formats supported | 1 GB,解压缩(参见备注)1 GB uncompressed (see note) | 批量引入,流式引入,直接引入Batching, streaming, direct | 根据组织需求编写自己的代码Write your own code according to organizational needs | | RESTREST | 支持所有格式All formats supported | 1 GB,解压缩(参见备注)1 GB uncompressed (see note) | 批量引入,流式引入,直接引入Batching, streaming, direct| 根据组织需求编写自己的代码Write your own code according to organizational needs | | GoGo | 支持所有格式All formats supported | 1 GB,解压缩(参见备注)1 GB uncompressed (see note) | 批量引入,流式引入,直接引入Batching, streaming, direct | 根据组织需求编写自己的代码Write your own code according to organizational needs |

备注

上表中进行引用时,引入支持的最大文件大小为 5 GB。When referenced in the above table, ingestion supports a maximum file size of 5 GB. 建议引入 100 MB 到 1 GB 的文件。The recommendation is to ingest files between 100 MB and 1 GB.

引入过程Ingestion process

根据需要选择最适合的引入方法后,执行以下步骤:Once you have chosen the most suitable ingestion method for your needs, do the following steps:

  1. 设置保留策略Set retention policy

    将数据引入 Azure 数据资源管理器中的表必须遵循该表的有效保留策略。Data ingested into a table in Azure Data Explorer is subject to the table's effective retention policy. 除非在表上显式设置,否则有效保留策略派生自数据库的保留策略。Unless set on a table explicitly, the effective retention policy is derived from the database's retention policy. 热保留是群集大小和保留策略的功能。Hot retention is a function of cluster size and your retention policy. 引入超过可用空间的数据将强制首先进入的数据进行冷保留。Ingesting more data than you have available space will force the first in data to cold retention.

    确保数据库的保留策略符合你的需求。Make sure that the database's retention policy is appropriate for your needs. 如果并非如此,请在表级别显式重写它。If not, explicitly override it at the table level. 有关详细信息,请参阅保留策略For more information, see retention policy.

  2. 创建表Create a table

    为了引入数据,需要事先创建一个表。In order to ingest data, a table needs to be created beforehand. 使用以下选项之一:Use one of the following options:

    备注

    如果记录不完整或者无法将字段解析为所需的数据类型,则将使用 NULL 值填充相应的表列。If a record is incomplete or a field cannot be parsed as the required data type, the corresponding table columns will be populated with null values.

  3. 创建架构映射Create schema mapping

    架构映射有助于将源数据字段绑定到目标表列。Schema mapping helps bind source data fields to destination table columns. 映射允许根据定义的属性,将不同源中的数据引入同一个表。Mapping allows you to take data from different sources into the same table, based on the defined attributes. 支持不同类型的映射,行导向(CSV、JSON 和 AVRO)和列导向 (Parquet)。Different types of mappings are supported, both row-oriented (CSV, JSON and AVRO), and column-oriented (Parquet). 在大部分方法中,可以在表上预先创建映射并从引入命令参数进行引用。In most methods, mappings can also be pre-created on the table and referenced from the ingest command parameter.

  4. 设置更新策略(可选)Set update policy (optional)

    一些数据格式映射(Parquet、JSON 和 Avro)支持简单且有用的引入时间转换。Some of the data format mappings (Parquet, JSON, and Avro) support simple and useful ingest-time transformations. 如果在引入时需要进行更复杂的处理,可使用更新策略,该策略允许使用 Kusto 查询语言命令进行轻型处理。Where the scenario requires more complex processing at ingest time, use update policy, which allows for lightweight processing using Kusto Query Language commands. 更新策略自动对原始表上的引入数据运行提取和转换,并将生成的数据引入到一个或多个目标表。The update policy automatically runs extractions and transformations on ingested data on the original table, and ingests the resulting data into one or more destination tables. 设置更新策略Set your update policy.

后续步骤Next steps