使用 LightIngest 将数据引入 Azure 数据资源管理器Use LightIngest to ingest data to Azure Data Explorer

LightIngest 是用于将即席数据引入 Azure 数据资源管理器的命令行实用工具。LightIngest is a command-line utility for ad-hoc data ingestion into Azure Data Explorer. 该实用工具可以从本地文件夹或 Azure Blob 存储容器提取源数据。The utility can pull source data from a local folder or from an Azure blob storage container. 当你想要引入大量数据时,LightIngest 最有用,因为引入持续时间没有时间限制。LightIngest is most useful when you want to ingest a large amount of data, because there is no time constraint on ingestion duration. 当你想要以后根据记录的创建时间而不是引入时间来查询记录时,它也很有用。It's also useful when you want to later query records according to the time they were created, and not the time they were ingested.

先决条件Prerequisites

安装 LightIngestInstall LightIngest

  1. 在计算机上导航到 LightIngest 所下载到的位置。Navigate to the location on your computer where you downloaded LightIngest.
  2. 使用 WinRAR 将 tools 目录解压缩到计算机上。**Using WinRAR, extract the tools directory to your computer.

运行 LightIngestRun LightIngest

  1. 在计算机上导航到解压缩的 tools 目录。**Navigate to the extracted tools directory on your computer.

  2. 从位置栏中删除现有的位置信息。Delete the existing location information from the location bar.

    删除 LightIngest 的现有位置信息

  3. 输入 cmd,然后按 Enter。****Enter cmd and press Enter.

  4. 在命令提示符下,输入 LightIngest.exe,后接相关的命令行参数。At the command prompt, enter LightIngest.exe followed by the relevant command-line argument.

    提示

    若要查看支持的命令行参数列表,请输入 LightIngest.exe /helpFor a list of supported command-line arguments, enter LightIngest.exe /help.

    LightIngest 的命令行帮助

  5. 输入 ingest-,后接用于管理引入操作的 Azure 数据资源管理器群集的连接字符串。Enter ingest- followed by the connection string to the Azure Data Explorer cluster that will manage the ingestion. 将连接字符串括在双引号中,并遵循 Kusto 连接字符串规范Enclose the connection string in double quotes and follow the Kusto connection strings specification.

    例如:For example:

    ingest-{Cluster name and region}.kusto.chinacloudapi.cn;AAD Federated Security=True"  -db:{Database} -table:Trips -source:"https://{Account}.blob.core.chinacloudapi.cn/{ROOT_CONTAINER};{StorageAccountKey}" -pattern:"*.csv.gz" -format:csv -limit:2 -ignoreFirst:true -cr:10.0 -dontWait:true
    

建议Recommendations

  • 建议的方法是让 LightIngest 使用 https://ingest-{yourClusterNameAndRegion}.kusto.chinacloudapi.cn 上的引入终结点。The recommended method is for LightIngest to work with the ingestion endpoint at https://ingest-{yourClusterNameAndRegion}.kusto.chinacloudapi.cn. 这样,Azure 数据资源管理器服务就可以管理引入负载,而你可以在发生暂时性错误时轻松恢复。This way, the Azure Data Explorer service can manage the ingestion load, and you can easily recover from transient errors. 不过,也可以将 LightIngest 配置为直接用于引擎终结点 (https://{yourClusterNameAndRegion}.kusto.chinacloudapi.cn)。However, you can also configure LightIngest to work directly with the engine endpoint (https://{yourClusterNameAndRegion}.kusto.chinacloudapi.cn).

    备注

    如果直接与引擎终结点一起引入,则不需要包括 ingest-If you ingest directly with the engine endpoint, you don't need to include ingest-. 但是,不会有 DM 功能来保护引擎和提高引入成功率。However, there won't be a DM feature to protect the engine and improve the ingestion success rate.

  • 若要提供最佳引入性能,需要原始数据大小,以便 LightIngest 可以估计本地文件的未压缩大小。For optimal ingestion performance, the raw data size is needed so LightIngest can estimate the uncompressed size of local files. 但如果没有事先下载已压缩的 Blob,LightIngest 可能无法正确估算这些 Blob 的原始大小。However, LightIngest might not be able to correctly estimate the raw size of compressed blobs without first downloading them. 因此,在引入压缩的 Blob 时,请将 Blob 元数据的 rawSizeBytes 属性设置为以字节为单位的非压缩数据大小。Therefore, when ingesting compressed blobs, set the rawSizeBytes property on the blob metadata to uncompressed data size in bytes.

命令行参数Command-line arguments

参数名称Argument name 类型Type 说明Description 必需/可选Mandatory/Optional
stringstring Azure 数据资源管理器连接字符串,指定用于处理引入操作的 Kusto 终结点。Azure Data Explorer Connection String specifying the Kusto endpoint that will handle the ingestion. 应括在双引号中Should be enclosed in double quotes 必需Mandatory
-database、-db-database, -db stringstring 目标 Azure 数据资源管理器数据库名称Target Azure Data Explorer database name 可选Optional
-table-table stringstring 目标 Azure 数据资源管理器表名称Target Azure Data Explorer table name 必需Mandatory
-sourcePath、-source-sourcePath, -source stringstring 源文件的路径,或 Blob 容器的根 URI。Path to source files or root URI of the blob container. 如果数据位于 Blob 中,则必须包含存储帐户密钥或 SAS。If the data is in blobs, must contain storage account key or SAS. 建议括在双引号中Recommended to enclose in double quotes 必需Mandatory
-prefix-prefix stringstring 当要引入的源数据驻留在 Blob 存储中时,此 URL 前缀(不包括容器名称)将由所有 Blob 共享。When the source data to ingest resides on blob storage, this URL prefix is shared by all blobs, excluding the container name.
例如,如果数据位于 MyContainer/Dir1/Dir2 中,则前缀应是 Dir1/Dir2For example, if the data is in MyContainer/Dir1/Dir2, then the prefix should be Dir1/Dir2. 建议括在双引号中Enclosing in double quotes is recommended
可选Optional
-pattern-pattern stringstring 提取源文件/Blob 时所遵循的模式。Pattern by which source files/blobs are picked. 支持通配符。Supports wildcards. 例如,"*.csv"For example, "*.csv". 建议括在双引号中Recommended to enclose in double quotes 可选Optional
-zipPattern-zipPattern stringstring 选择要引入 ZIP 存档中的哪些文件时要使用的正则表达式。Regular expression to use when selecting which files in a ZIP archive to ingest.
存档中的所有其他文件会被忽略。All other files in the archive will be ignored. 例如,"*.csv"For example, "*.csv". 建议括在双引号中It's recommended to surround it in double quotes
可选Optional
-format、-f-format, -f stringstring 源数据格式。Source data format. 必须是支持的格式之一Must be one of the supported formats 可选Optional
-ingestionMappingPath、-mappingPath-ingestionMappingPath, -mappingPath stringstring 用于引入列映射的本地文件路径。Path to local file for ingestion column mapping. 对于 Json 和 Avro 格式是必需的。Mandatory for Json and Avro formats. 请参阅数据映射See data mappings 可选Optional
-ingestionMappingRef、-mappingRef-ingestionMappingRef, -mappingRef stringstring 之前在表上创建的引入列映射的名称。Name of an ingestion column mapping that was previously created on the table. 对于 Json 和 Avro 格式是必需的。Mandatory for Json and Avro formats. 请参阅数据映射See data mappings 可选Optional
-creationTimePattern-creationTimePattern stringstring 如果设置此参数,它将用于从文件或 Blob 路径中提取 CreationTime 属性。When set, is used to extract the CreationTime property from the file or blob path. 请参阅如何使用 CreationTime 引入数据See How to ingest data using CreationTime 可选Optional
-ignoreFirstRow、-ignoreFirst-ignoreFirstRow, -ignoreFirst boolbool 如果设置此参数,将忽略每个文件/Blob 的第一条记录(例如,如果源数据包含标头)If set, the first record of each file/blob is ignored (for example, if the source data has headers) 可选Optional
-tag-tag stringstring 要与引入的数据相关联的标记Tags to associate with the ingested data. 允许多个标记Multiple occurrences are permitted 可选Optional
-dontWait-dontWait boolbool 如果设置为“true”,则不等待引入完成。If set to 'true', doesn't wait for ingestion completion. 引入大量文件/Blob 时非常有用Useful when ingesting large amounts of files/blobs 可选Optional
-compression、-cr-compression, -cr Doubledouble 压缩比提示。Compression ratio hint. 引入压缩的文件/Blob 来帮助 Azure 数据资源管理器评估原始数据大小时,此参数非常有用。Useful when ingesting compressed files/blobs to help Azure Data Explorer assess the raw data size. 计算方式为原始大小除以压缩大小Calculated as original size divided by compressed size 可选Optional
-limit、-l-limit , -l integerinteger 如果设置此参数,则会将引入限制为前 N 个文件If set, limits the ingestion to first N files 可选Optional
-listOnly、-list-listOnly, -list boolbool 如果设置此参数,仅显示已选择引入的项If set, only displays the items that would have been selected for ingestion 可选Optional
-ingestTimeout-ingestTimeout integerinteger 完成所有引入操作的超时(分钟)。Timeout in minutes for all ingest operations completion. 默认为 60Defaults to 60 可选Optional
-forceSync-forceSync boolbool 如果设置此参数,将强制同步引入。If set, forces synchronous ingestion. 默认为 falseDefaults to false 可选Optional
-dataBatchSize-dataBatchSize integerinteger 设置每个引入操作的总大小限制(MB,非压缩)Sets the total size limit (MB, uncompressed) of each ingest operation 可选Optional
-filesInBatch-filesInBatch integerinteger 设置每个引入操作的文件/Blob 计数限制Sets the file/blob count limit of each ingest operation 可选Optional
-devTracing、-trace-devTracing, -trace stringstring 如果设置此参数,诊断日志将写入本地目录(默认为当前目录中的 RollingLogs,可以通过设置开关值修改此位置)If set, diagnostic logs are written to a local directory (by default, RollingLogs in the current directory, or can be modified by setting the switch value) 可选Optional

特定于 Azure Blob 的功能Azure blob-specific capabilities

与 Azure Blob 配合使用时,LightIngest 将使用特定的 Blob 元数据属性来增强引入过程。When used with Azure blobs, LightIngest will use certain blob metadata properties to augment the ingestion process.

元数据属性Metadata property 使用情况Usage
rawSizeBytes, kustoUncompressedSizeBytesrawSizeBytes, kustoUncompressedSizeBytes 如果设置此参数,它将解释为非压缩数据大小If set, will be interpreted as the uncompressed data size
kustoCreationTime, kustoCreationTimeUtckustoCreationTime, kustoCreationTimeUtc 解释为 UTC 时间戳。Interpreted as UTC timestamp. 如果设置此参数,它将用于重写 Kusto 中的创建时间。If set, will be used to override the creation time in Kusto. 在回填方案中非常有用Useful for backfilling scenarios

用法示例Usage examples

如何使用 CreationTime 引入数据How to ingest data using CreationTime

将历史数据从现有系统加载到 Azure 数据资源管理器时,所有记录接收相同的引入日期。When you load historical data from existing system to Azure Data Explorer, all records receive the same the ingestion date. 若要启用按创建时间而不是引入时间对数据进行分区,可以使用 -creationTimePattern 参数。To enable partitioning your data by creation time and not ingestion time, you can use the -creationTimePattern argument. -creationTimePattern 参数从文件或 Blob 路径中提取 CreationTime 属性。The -creationTimePattern argument extracts the CreationTime property from the file or blob path. 此模式不需要反映整个项路径,只需反映包含要使用的时间戳的部分。The pattern doesn't need to reflect the entire item path, just the section enclosing the timestamp you want to use.

参数值必须包括:The argument values must include:

  • 紧邻在时间戳格式前面的常量文本,用单引号括起来(前缀)Constant text immediately preceding the timestamp format, enclosed in single quotes (prefix)
  • 时间戳格式,采用标准的 .NET 日期时间表示法The timestamp format, in standard .NET DateTime notation
  • 紧跟在时间戳之后的常量文本(后缀)。Constant text immediately following the timestamp (suffix).

示例Examples

  • 包含如下日期时间的 Blob 名称:historicalvalues19840101.parquet(时间戳中年份以四位数表示,月份以两位数表示,日期以两位数表示)A blob name that contains the datetime as follows: historicalvalues19840101.parquet (the timestamp is four digits for the year, two digits for the month, and two digits for the day of month),

    -creationTimePattern 参数的值是文件名的一部分:“'historicalvalues'yyyyMMdd'.parquet'”The value for -creationTimePattern argument is part of the filename: "'historicalvalues'yyyyMMdd'.parquet'"

    ingest-{Cluster name and region}.kusto.chinacloudapi.cn;AAD Federated Security=True -db:{Database} -table:Trips -source:"https://{Account}.blob.core.chinacloudapi.cn/{ROOT_CONTAINER};{StorageAccountKey}" -creationTimePattern:"'historicalvalues'yyyyMMdd'.parquet'"
     -pattern:"*.csv.gz" -format:csv -limit:2 -ignoreFirst:true -cr:10.0 -dontWait:true
    
  • 一个引用分层文件夹结构的 Blob URI(如 https://storageaccount/container/folder/2002/12/01/blobname.extensionFor a blob URI that refers to hierarchical folder structure, like https://storageaccount/container/folder/2002/12/01/blobname.extension,

    -creationTimePattern 参数的值为文件夹结构的一部分:“'folder/'yyyy/MM/dd'/blob'”The value for -creationTimePattern argument is part of the folder structure: "'folder/'yyyy/MM/dd'/blob'"

      ingest-{Cluster name and region}.kusto.chinacloudapi.cn;AAD Federated Security=True -db:{Database} -table:Trips -source:"https://{Account}.blob.core.chinacloudapi.cn/{ROOT_CONTAINER};{StorageAccountKey}" -creationTimePattern:"'folder/'yyyy/MM/dd'/blob'"
       -pattern:"*.csv.gz" -format:csv -limit:2 -ignoreFirst:true -cr:10.0 -dontWait:true
    

使用存储帐户密钥或 SAS 令牌引入 BlobIngesting blobs using a storage account key or a SAS token

  • 引入指定存储帐户 ACCOUNT 下文件夹 DIR 中容器 CONT 下的与模式 *.csv.gz 匹配的 10 个 BlobIngest 10 blobs under specified storage account ACCOUNT, in folder DIR, under container CONT, and matching the pattern *.csv.gz
  • 目标是数据库 DBTABLE,引入映射 MAPPING 已在目标中预先创建Destination is database DB, table TABLE, and the ingestion mapping MAPPING is precreated on the destination
  • 工具将等到引入操作完成The tool will wait until the ingest operations complete
  • 请注意,有不同的选项用于指定目标数据库和存储帐户密钥与SAS 令牌Note the different options for specifying the target database and storage account key vs. SAS token
LightIngest.exe "https://ingest-{ClusterAndRegion}.kusto.chinacloudapi.cn;Fed=True"
  -database:DB
  -table:TABLE
  -source:"https://ACCOUNT.blob.core.chinacloudapi.cn/{ROOT_CONTAINER};{StorageAccountKey}"
  -prefix:"DIR"
  -pattern:*.csv.gz
  -format:csv
  -mappingRef:MAPPING
  -limit:10

LightIngest.exe "https://ingest-{ClusterAndRegion}.kusto.chinacloudapi.cn;Fed=True;Initial Catalog=DB"
  -table:TABLE
  -source:"https://ACCOUNT.blob.core.chinacloudapi.cn/{ROOT_CONTAINER}?{SAS token}"
  -prefix:"DIR"
  -pattern:*.csv.gz
  -format:csv
  -mappingRef:MAPPING
  -limit:10

引入容器中的所有 Blob,不包括标头行Ingesting all blobs in a container, not including header rows

  • 引入指定存储帐户 ACCOUNT 下文件夹 DIR1/DIR2 中容器 CONT 下的与模式 *.csv.gz 匹配的所有 BlobIngest all blobs under specified storage account ACCOUNT, in folder DIR1/DIR2, under container CONT, and matching the pattern *.csv.gz
  • 目标是数据库 DBTABLE,引入映射 MAPPING 已在目标中预先创建Destination is database DB, table TABLE, and the ingestion mapping MAPPING is precreated on the destination
  • 源 Blob 包含标头行,因此工具被指示删除每个 Blob 的第一条记录Source blobs contain header line, so the tool is instructed to drop the first record of each blob
  • 工具将发布要引入的数据,且不会等待引入操作完成The tool will post the data for ingestion and won't wait for the ingest operations to complete
LightIngest.exe "https://ingest-{ClusterAndRegion}.kusto.chinacloudapi.cn;Fed=True"
  -database:DB
  -table:TABLE
  -source:"https://ACCOUNT.blob.core.chinacloudapi.cn/{ROOT_CONTAINER}?{SAS token}"
  -prefix:"DIR1/DIR2"
  -pattern:*.csv.gz
  -format:csv
  -mappingRef:MAPPING
  -ignoreFirstRow:true

从一个路径引入所有 JSON 文件Ingesting all JSON files from a path

  • 引入路径 PATH 下的与模式 *.json 匹配的所有文件Ingest all files under path PATH, matching the pattern *.json
  • 目标为数据库 DBTABLE,引入映射已在本地文件 MAPPING_FILE_PATH 中定义Destination is database DB, table TABLE, and the ingestion mapping is defined in local file MAPPING_FILE_PATH
  • 工具将发布要引入的数据,且不会等待引入操作完成The tool will post the data for ingestion and won't wait for the ingest operations to complete
LightIngest.exe "https://ingest-{ClusterAndRegion}.kusto.chinacloudapi.cn;Fed=True"
  -database:DB
  -table:TABLE
  -source:"PATH"
  -pattern:*.json
  -format:json
  -mappingPath:"MAPPING_FILE_PATH"

引入文件并写入诊断跟踪文件Ingesting files and writing diagnostic trace files

  • 引入路径 PATH 下的与模式 *.json 匹配的所有文件Ingest all files under path PATH, matching the pattern *.json
  • 目标为数据库 DBTABLE,引入映射已在本地文件 MAPPING_FILE_PATH 中定义Destination is database DB, table TABLE, and the ingestion mapping is defined in local file MAPPING_FILE_PATH
  • 工具将发布要引入的数据,且不会等待引入操作完成The tool will post the data for ingestion and won't wait for the ingest operations to complete
  • 诊断跟踪文件将在本地写入到文件夹 LOGS_PATHDiagnostics trace files will be written locally under folder LOGS_PATH
LightIngest.exe "https://ingest-{ClusterAndRegion}.kusto.chinacloudapi.cn;Fed=True"
  -database:DB
  -table:TABLE
  -source:"PATH"
  -pattern:*.json
  -format:json
  -mappingPath:"MAPPING_FILE_PATH"
  -trace:"LOGS_PATH"