使用 Azure 数据资源管理器查询 Azure Data Lake 中的数据Query data in Azure Data Lake using Azure Data Explorer

Azure Data Lake Storage 是一种用于大数据分析的数据湖解决方案,可伸缩性高且经济高效。Azure Data Lake Storage is a highly scalable and cost-effective data lake solution for big data analytics. 它既有高性能文件系统的强大功能,又有规模效应且经济高效,可以帮助你快速获得见解。It combines the power of a high-performance file system with massive scale and economy to help you reduce your time to insight. Data Lake Storage Gen2 扩展了 Azure Blob 存储功能,并针对分析工作负载进行了优化。Data Lake Storage Gen2 extends Azure Blob Storage capabilities and is optimized for analytics workloads.

Azure 数据资源管理器与 Azure Blob 存储和 Azure Data Lake Storage (Gen2) 集成,可用于对存储在外部存储中的数据进行快速、缓存式和索引式访问。Azure Data Explorer integrates with Azure Blob Storage and Azure Data Lake Storage (Gen2), providing fast, cached, and indexed access to data stored in external storage. 无需先将数据引入 Azure 数据资源管理器即可分析和查询数据。You can analyze and query data without prior ingestion into Azure Data Explorer. 还可以同时对引入的和未引入的外部数据进行查询。You can also query across ingested and uningested external data simultaneously.

提示

若要获得最佳查询性能,必须将数据引入 Azure 数据资源管理器中。The best query performance necessitates data ingestion into Azure Data Explorer. 在不预先引入数据的情况下查询外部数据的功能只应当用于历史数据或极少会查询的数据。The capability to query external data without prior ingestion should only be used for historical data or data that are rarely queried. 优化外部数据查询性能以获得最佳结果。Optimize your external data query performance for best results.

创建外部表Create an external table

假设你有许多 CSV 文件,其中包含仓库中存储的产品的历史信息,并且你想要进行快速分析来查找去年最流行的五种产品。Let's say you have lots of CSV files containing historical info on products stored in a warehouse, and you want to do a quick analysis to find the five most popular products from last year. 在此示例中,CSV 文件如下所示:In this example, the CSV files look like:

TimestampTimestamp ProductIdProductId ProductDescriptionProductDescription
2019-01-01 11:21:002019-01-01 11:21:00 TO6050TO6050 3.5in DS/HD Floppy Disk3.5in DS/HD Floppy Disk
2019-01-01 11:30:552019-01-01 11:30:55 YDX1YDX1 Yamaha DX1 SynthesizerYamaha DX1 Synthesizer
...... ...... ......

文件存储在 Azure Blob 存储 mycompanystorage 中名为 archivedproducts 的容器中,按日期分区:The files are stored in Azure Blob storage mycompanystorage under a container named archivedproducts, partitioned by date:

https://mycompanystorage.blob.core.chinacloudapi.cn/archivedproducts/2019/01/01/part-00000-7e967c99-cf2b-4dbb-8c53-ce388389470d.csv.gz
https://mycompanystorage.blob.core.chinacloudapi.cn/archivedproducts/2019/01/01/part-00001-ba356fa4-f85f-430a-8b5a-afd64f128ca4.csv.gz
https://mycompanystorage.blob.core.chinacloudapi.cn/archivedproducts/2019/01/01/part-00002-acb644dc-2fc6-467c-ab80-d1590b23fc31.csv.gz
https://mycompanystorage.blob.core.chinacloudapi.cn/archivedproducts/2019/01/01/part-00003-cd5fad16-a45e-4f8c-a2d0-5ea5de2f4e02.csv.gz
https://mycompanystorage.blob.core.chinacloudapi.cn/archivedproducts/2019/01/02/part-00000-ffc72d50-ff98-423c-913b-75482ba9ec86.csv.gz
...

若要直接对这些 CSV 文件运行 KQL 查询,请使用 .create external table 命令在 Azure 数据资源管理器中定义一个外部表。To run a KQL query on these CSV files directly, use the .create external table command to define an external table in Azure Data Explorer. 有关外部表创建命令选项的详细信息,请参阅外部表命令For more information on external table create command options, see external table commands.

.create external table ArchivedProducts(Timestamp:datetime, ProductId:string, ProductDescription:string)   
kind=blob            
partition by (Date:datetime = bin(Timestamp, 1d))   
dataformat=csv   
(   
  h@'https://mycompanystorage.blob.core.chinacloudapi.cn/archivedproducts;StorageSecretKey'
)    

外部表现在显示在 Web UI 的左窗格中:The external table is now visible in the left pane of the Web UI:

Web UI 中的外部表

外部表权限External table permissions

  • 数据库用户可以创建外部表。The database user can create an external table. 表创建者自动成为表管理员。The table creator automatically becomes the table administrator.
  • 群集、数据库或表管理员可以编辑现有表。The cluster, database, or table administrator can edit an existing table.
  • 任何数据库用户或读取者都可以查询外部表。Any database user or reader can query an external table.

查询外部表Querying an external table

定义外部表后,可以使用 external_table() 函数引用它。Once an external table is defined, the external_table() function can be used to refer to it. 查询的其余部分是标准的 Kusto 查询语言。The rest of the query is standard Kusto Query Language.

external_table("ArchivedProducts")   
| where Timestamp > ago(365d)   
| summarize Count=count() by ProductId,   
| top 5 by Count

同时查询外部数据和引入的数据Querying external and ingested data together

可以在同一查询中同时查询外部表和引入的数据表。You can query both external tables and ingested data tables within the same query. 可以将外部表与来自 Azure 数据资源管理器、SQL Server 或其他源的其他数据进行 joinunionYou can join or union the external table with additional data from Azure Data Explorer, SQL servers, or other sources. 请使用 let( ) statement 为外部表引用分配速记名称。Use a let( ) statement to assign a shorthand name to an external table reference.

在下面的示例中,Products 是引入的数据表,ArchivedProducts 是我们之前定义的外部表:In the example below, Products is an ingested data table and ArchivedProducts is an external table that we've defined previously:

let T1 = external_table("ArchivedProducts") |  where TimeStamp > ago(100d);   
let T = Products; //T is an internal table   
T1 | join T on ProductId | take 10

查询分层数据格式Querying hierarchical data formats

Azure 数据资源管理器允许查询分层格式,例如 JSONParquetAvroORCAzure Data Explorer allows querying hierarchical formats, such as JSON, Parquet, Avro, and ORC. 若要将分层数据架构映射到外部表架构(如果它是不同的架构),请使用外部表映射命令To map hierarchical data schema to an external table schema (if it's different), use external table mappings commands. 例如,如果要查询采用以下格式的 JSON 日志文件:For instance, if you want to query JSON log files with the following format:

{
  "timestamp": "2019-01-01 10:00:00.238521",   
  "data": {    
    "tenant": "e1ef54a6-c6f2-4389-836e-d289b37bcfe0",   
    "method": "RefreshTableMetadata"   
  }   
}   
{
  "timestamp": "2019-01-01 10:00:01.845423",   
  "data": {   
    "tenant": "9b49d0d7-b3e6-4467-bb35-fa420a25d324",   
    "method": "GetFileList"   
  }   
}
...

外部表定义如下所示:The external table definition looks like this:

.create external table ApiCalls(Timestamp: datetime, TenantId: guid, MethodName: string)
kind=blob
dataformat=multijson
( 
   h@'https://storageaccount.blob.core.chinacloudapi.cn/container1;StorageSecretKey'
)

定义将数据字段映射到外部表定义字段的 JSON 映射:Define a JSON mapping that maps data fields to external table definition fields:

.create external table ApiCalls json mapping 'MyMapping' '[{"Column":"Timestamp","Properties":{"Path":"$.timestamp"}},{"Column":"TenantId","Properties":{"Path":"$.data.tenant"}},{"Column":"MethodName","Properties":{"Path":"$.data.method"}}]'

查询外部表时,将会调用映射,并会将相关数据映射到外部表列:When you query the external table, the mapping will be invoked, and relevant data will be mapped to the external table columns:

external_table('ApiCalls') | take 10

有关映射语法的详细信息,请参阅数据映射For more info on mapping syntax, see data mappings.

查询 help 群集中的 TaxiRides 外部表Query TaxiRides external table in the help cluster

使用名为 help 的测试群集来尝试不同的 Azure 数据资源管理器功能。Use the test cluster called help to try out different Azure Data Explorer capabilities. help 群集包含纽约市出租车数据集(其中包含数十亿条出租车搭乘记录)的外部表定义。The help cluster contains an external table definition for a New York City taxi dataset containing billions of taxi rides.

创建外部表 TaxiRidesCreate external table TaxiRides

本部分显示了用于在 help 群集中创建 TaxiRides 外部表的查询。 This section shows the query used to create the TaxiRides external table in the help cluster. 由于此表已创建,因此你可以跳过本部分,直接转到查询 TaxiRides 外部表数据Since this table has already been created, you can skip this section and go directly to query TaxiRides external table data.

.create external table TaxiRides
(
  trip_id: long,
  vendor_id: string, 
  pickup_datetime: datetime,
  dropoff_datetime: datetime,
  store_and_fwd_flag: string,
  rate_code_id: int,
  pickup_longitude: real,
  pickup_latitude: real,
  dropoff_longitude: real,
  dropoff_latitude: real,
  passenger_count: int,
  trip_distance: real,
  fare_amount: real,
  extra: real,
  mta_tax: real,
  tip_amount: real,
  tolls_amount: real,
  ehail_fee: real,
  improvement_surcharge: real,
  total_amount: real,
  payment_type: string,
  trip_type: int,
  pickup: string,
  dropoff: string,
  cab_type: string,
  precipitation: int,
  snow_depth: int,
  snowfall: int,
  max_temperature: int,
  min_temperature: int,
  average_wind_speed: int,
  pickup_nyct2010_gid: int,
  pickup_ctlabel: string,
  pickup_borocode: int,
  pickup_boroname: string,
  pickup_ct2010: string,
  pickup_boroct2010: string,
  pickup_cdeligibil: string,
  pickup_ntacode: string,
  pickup_ntaname: string,
  pickup_puma: string,
  dropoff_nyct2010_gid: int,
  dropoff_ctlabel: string,
  dropoff_borocode: int,
  dropoff_boroname: string,
  dropoff_ct2010: string,
  dropoff_boroct2010: string,
  dropoff_cdeligibil: string,
  dropoff_ntacode: string,
  dropoff_ntaname: string,
  dropoff_puma: string
)
kind=blob 
partition by bin(pickup_datetime, 1d)
dataformat=csv
( 
    h@'https://storageaccount.blob.core.chinacloudapi.cn/container1;secretKey'
)

可以通过查看 Web UI 的左窗格查找已创建的 TaxiRides 表:You can find the created TaxiRides table by looking at the left pane of the Web UI:

出租车搭乘记录外部表

查询 TaxiRides 外部表数据Query TaxiRides external table data

登录到 https://dataexplorer.azure.cn/clusters/help/databases/SamplesSign in to https://dataexplorer.azure.cn/clusters/help/databases/Samples.

在不分区的情况下查询 TaxiRides 外部表Query TaxiRides external table without partitioning

对外部表 TaxiRides 运行此查询,以显示整个数据集中每周的每一天的搭乘记录。Run this query on the external table TaxiRides to show rides for each day of the week, across the entire data set.

external_table("TaxiRides")
| summarize count() by dayofweek(pickup_datetime)
| render columnchart

此查询显示一周中最忙碌的一天。This query shows the busiest day of the week. 由于数据未分区,因此查询可能需要好几分钟的时间才能返回结果。Since the data isn't partitioned, the query may take up to several minutes to return results.

呈现未分区的查询

在分区的情况下查询 TaxiRides 外部表Query TaxiRides external table with partitioning

针对外部表 TaxiRides 运行此查询,以显示 2017 年 1 月使用的出租车类型(黄色或绿色)。Run this query on the external table TaxiRides to show taxi cab types (yellow or green) used in January of 2017.

external_table("TaxiRides")
| where pickup_datetime between (datetime(2017-01-01) .. datetime(2017-02-01))
| summarize count() by cab_type
| render piechart

此查询使用分区,这样可以优化查询时间和性能。This query uses partitioning, which optimizes query time and performance. 此查询基于分区的列 (pickup_datetime) 进行筛选,在数秒钟内返回结果。The query filters on a partitioned column (pickup_datetime) and returns results in a few seconds.

呈现分区的查询

你可以编写针对外部表 TaxiRides 运行的其他查询,了解有关数据的更多信息。You can write additional queries to run on the external table TaxiRides and learn more about the data.

优化查询性能Optimize your query performance

可以使用下述用于查询外部数据的最佳做法,优化数据湖中的查询性能。Optimize your query performance in the lake by using the following best practices for querying external data.

数据格式Data format

  • 请为分析查询使用纵栏格式,原因如下:Use a columnar format for analytical queries, for the following reasons:
    • 只能读取与查询相关的列。Only the columns relevant to a query can be read.
    • 列编码技术可显著减少数据大小。Column encoding techniques can reduce data size significantly.
  • Azure 数据资源管理器支持 Parquet 和 ORC 纵栏格式。Azure Data Explorer supports Parquet and ORC columnar formats. 建议采用 Parquet 格式,因为它优化了实现。Parquet format is suggested because of optimized implementation.

Azure 区域Azure region

检查外部数据是否与 Azure 数据资源管理器群集位于同一 Azure 区域中。Check that external data is in the same Azure region as your Azure Data Explorer cluster. 此设置可降低成本并缩短数据提取时间。This setup reduces cost and data fetch time.

文件大小File size

最佳文件大小为每个文件数百 Mb(最高可达 1 GB)。The optimal file size is hundreds of Mb (up to 1 GB) per file. 请避免使用多个需要不必要开销的小文件,例如速度较慢的文件枚举过程,以及对纵栏格式进行有限的使用。Avoid many small files that require unneeded overhead, such as slower file enumeration process and limited use of columnar format. 文件数应大于 Azure 数据资源管理器群集中的 CPU 核心数。The number of files should be greater than the number of CPU cores in your Azure Data Explorer cluster.

压缩Compression

使用压缩可以减少从远程存储提取的数据量。Use compression to reduce the amount of data being fetched from the remote storage. 对于 Parquet 格式,请使用内部 Parquet 压缩机制来单独压缩列组,这样就可以单独读取列组。For Parquet format, use the internal Parquet compression mechanism that compresses column groups separately, allowing you to read them separately. 若要验证压缩机制的使用情况,请检查文件的命名方式是否如下所示: <filename>.gz.parquet<filename>.snappy.parquet ,而不是 <filename>.parquet.gzTo validate use of compression mechanism, check that the files are named as follows: <filename>.gz.parquet or <filename>.snappy.parquet and not <filename>.parquet.gz .

分区Partitioning

使用“文件夹”分区组织你的数据,此类分区允许查询跳过无关的路径。Organize your data using "folder" partitions that enable the query to skip irrelevant paths. 在规划分区时,请考虑查询中的文件大小和常见筛选器,例如时间戳或租户 ID。When planning partitioning, consider file size and common filters in your queries such as timestamp or tenant ID.

VM 大小VM size

请选择核心数较多且网络吞吐量较高的 VM SKU(内存不太重要)。Select VM SKUs with more cores and higher network throughput (memory is less important). 有关详细信息,请参阅为 Azure 数据资源管理器群集选择正确的 VM SKUFor more information, see Select the correct VM SKU for your Azure Data Explorer cluster.

后续步骤Next steps

  • 使用 Azure 数据资源管理器查询 Azure Data Lake 中的数据。Query your data in the Azure Data Lake using Azure Data Explorer. 了解如何编写查询,从数据派生更多见解。Learn to write queries and derive additional insights from your data.