在 Azure 存储或 Azure Data Lake 中创建和更改外部表Create and alter external tables in Azure Storage or Azure Data Lake

以下命令介绍了如何创建位于 Azure Blob 存储、Azure Data Lake Store Gen1 或 Azure Data Lake Store Gen2 中的外部表。The following command describes how to create an external table located in Azure Blob Storage, Azure Data Lake Store Gen1, or Azure Data Lake Store Gen2.

有关外部 Azure 存储表功能的简介,请参阅使用 Azure 数据资源管理器查询 Azure Data Lake 中的数据For an introduction to the external Azure Storage tables feature, see Query data in Azure Data Lake using Azure Data Explorer.

.create 或 .alter external table.create or .alter external table

语法Syntax

(.create | .alter | .create-or-alter) external table TableName ( Schema )(.create | .alter | .create-or-alter) external table TableName ( Schema )
kind = (blob | adl)kind = (blob | adl)
[partition by ( Partitions ) [pathformat = ( PathFormat )]][partition by ( Partitions ) [pathformat = ( PathFormat )]]
dataformat = Formatdataformat = Format
( StorageConnectionString [, ...] ) ( StorageConnectionString [, ...] )
[with (PropertyName = Value , ... )][with (PropertyName = Value , ... )]

在执行命令的数据库中创建或更改新的外部表。Creates or alters a new external table in the database in which the command is executed.

备注

  • 如果该表存在,则 .create 命令将失败并显示错误。If the table exists, .create command will fail with an error. 使用 .create-or-alter.alter 来修改现有表。Use .create-or-alter or .alter to modify existing tables.
  • 不支持更改外部 blob 表的架构、格式或分区定义。Altering the schema, format, or the partition definition of an external blob table isn't supported.
  • 此操作需要数据库用户权限(对于 .create)和表管理员权限(对于 .alter)。The operation requires database user permission for .create and table admin permission for .alter.

ParametersParameters

TableNameTableName

遵循实体名称规则的外部表名称。External table name that adheres to entity names rules. 外部表不能与同一数据库中的常规表具有相同的名称。An external table can't have the same name as a regular table in the same database.

SchemaSchema

外部数据架构是使用以下格式描述的:External data schema is described using the following format:

  ColumnName : ColumnType [, ColumnName : ColumnType ...]  ColumnName : ColumnType [, ColumnName : ColumnType ...]

其中,ColumnName 遵循实体命名规则,ColumnType 是受支持的数据类型之一。where ColumnName adheres to entity naming rules, and ColumnType is one of supported data types.

提示

如果外部数据架构未知,请使用 infer_storage_schema 插件,该插件有助于根据外部文件内容推断架构。If the external data schema is unknown, use the infer_storage_schema plug-in, which helps infer the schema based on external file contents.

PartitionsPartitions

对外部表进行分区时所依据的列的逗号分隔列表。Comma-separated list of columns by which an external table is partitioned. 分区列可以存在于数据文件本身或文件路径的 sa 部分中(有关详细信息,请参阅虚拟列)。Partition column can exist in the data file itself, or sa part of the file path (read more on virtual columns).

分区列表是分区列的任意组合,使用以下形式之一指定:Partitions list is any combination of partition columns, specified using one of the following forms:

  • 表示虚拟列的分区。Partition, representing a virtual column.

    PartitionName : (datetime | string)PartitionName : (datetime | string)

  • 基于字符串列值的分区。Partition, based on a string column value.

    PartitionName : string = ColumnNamePartitionName : string = ColumnName

  • 基于字符串列值哈希且模数为 Number 的分区。Partition, based on a string column value hash, modulo Number.

    PartitionName : long = hash ( ColumnName , Number )PartitionName : long = hash ( ColumnName , Number )

  • 基于日期/时间列的截断值的分区。Partition, based on truncated value of a datetime column. 请参阅有关 startofyearstartofmonthstartofweekstartofdaybin 函数的文档。See documentation on startofyear, startofmonth, startofweek, startofday or bin functions.

    PartitionName : datetime = (startofyear | startofmonth | startofweek | startofday) ( ColumnName )PartitionName : datetime = (startofyear | startofmonth | startofweek | startofday) ( ColumnName )
    PartitionName : datetime = bin ( ColumnName , TimeSpan )PartitionName : datetime = bin ( ColumnName , TimeSpan )

若要检查分区定义的正确性,请在创建外部表时使用属性 sampleUrisfilesPreviewTo check partitioning definition correctness, use the property sampleUris or filesPreview when creating an external table.

PathFormatPathFormat

除分区之外还可以指定的外部数据文件夹 URI 路径格式。External data folder URI path format, which can be specified in addition to partitions. 路径格式是分区元素和文本分隔符的序列:Path format is a sequence of partition elements and text separators:

  [StringSeparator] Partition [StringSeparator] [Partition [StringSeparator] ...]  [StringSeparator] Partition [StringSeparator] [Partition [StringSeparator] ...]

其中,Partition 指的是在 partition by 子句中声明的分区,StringSeparator 是括在引号中的任何文本。where Partition refers to a partition declared in partition by clause, and StringSeparator is any text enclosed in quotes. 必须使用 StringSeparator 分隔连续的分区元素。Consecutive partition elements must be set apart using StringSeparator.

可以使用呈现为字符串并以相应的文本分隔符分隔的分区元素来构造原始文件路径前缀。Original file path prefix can be constructed using partition elements rendered as strings and separated with corresponding text separators. 若要指定用于呈现日期/时间分区值的格式,可以使用以下宏:To specify format used for rendering a datetime partition value, the following macro can be used:

  datetime_pattern ( DateTimeFormat , PartitionName )  datetime_pattern ( DateTimeFormat , PartitionName )

其中,DateTimeFormat 遵循 .NET 格式规范,并可扩展为允许将格式说明符括在大括号中。where DateTimeFormat adheres to the .NET format specification, with an extension allowing to enclose format specifiers into curly brackets. 例如,以下两种格式是等效的:For example, the following two formats are equivalent:

  'year='yyyy'/month='MMyear={yyyy}/month={MM}  'year='yyyy'/month='MM and year={yyyy}/month={MM}

默认情况下,使用以下格式呈现日期/时间值:By default, datetime values are rendered using the following formats:

分区函数Partition function 默认格式Default format
startofyear yyyy
startofmonth yyyy/MM
startofweek yyyy/MM/dd
startofday yyyy/MM/dd
bin(Column, 1d)bin(Column, 1d) yyyy/MM/dd
bin(Column, 1h)bin(Column, 1h) yyyy/MM/dd/HH
bin(Column, 1m)bin(Column, 1m) yyyy/MM/dd/HH/mm

如果在外部表定义中省略了 PathFormat,则会假定使用 / 分隔符分隔所有分区,分区的顺序与定义它们的顺序完全相同。If PathFormat is omitted from the external table definition, it's assumed that all partitions, in exactly the same order as they're defined, are separated using / separator. 分区是使用其默认字符串表示形式呈现的。Partitions are rendered using their default string presentation.

若要检查路径格式定义的正确性,请在创建外部表时使用属性 sampleUrisfilesPreviewTo check path format definition correctness, use the property sampleUris or filesPreview when creating an external table.

备注

PathFormat 只能描述存储“文件夹”URI 路径。PathFormat can only describe the storage "folder" URI path. 若要按文件名进行筛选,请使用 NamePrefix 和/或 FileExtension 外部表属性。To filter by file name, use NamePrefix and/or FileExtension external table properties.

FormatFormat

数据格式,可以是任意引入格式The data format, any of the ingestion formats.

备注

将外部表用于导出方案仅限于以下格式:CSVTSVJSONParquetUsing external table for export scenario is limited to the following formats: CSV, TSV, JSON and Parquet.

StorageConnectionStringStorageConnectionString

指向 Azure Blob 存储 Blob 容器或 Azure Data Lake Store 文件系统(虚拟目录或文件夹)的一个或多个路径,包括凭据。One or more paths to Azure Blob Storage blob containers or Azure Data Lake Store file systems (virtual directories or folders), including credentials. 有关详细信息,请参阅存储连接字符串See storage connection strings for details.

提示

请提供多个存储帐户,以避免将大量数据导出到外部表时出现存储限制。Provide more than a single storage account to avoid storage throttling while exporting large amounts of data to the external table. 导出会将写入分布到提供的所有帐户之间。Export will distribute the writes between all accounts provided.

可选属性Optional Properties

属性Property 类型Type 说明Description
folder string 表的文件夹Table's folder
docString string 用来记录表的字符串String documenting the table
compressed bool 如果设置了此项,则表示文件是否压缩为 .gz 文件(仅用在导出方案中)If set, indicates whether the files are compressed as .gz files (used in export scenario only)
includeHeaders string 对于带分隔符的文本格式(CSV、TSV、...),指示文件是否包含标头。For delimited text formats (CSV, TSV, ...), indicates whether files contain a header. 可能的值包括:All(所有文件都包含标头)、FirstFile(文件夹中的第一个文件包含标头)、None(无文件包含标头)。Possible values are: All (all files contain a header), FirstFile (first file in a folder contains a header), None (no files contain a header).
namePrefix string 如果设置了此项,则表示文件的前缀。If set, indicates the prefix of the files. 在写入操作中,所有文件都将用此前缀来写入。On write operations, all files will be written with this prefix. 在读取操作中,将只读取具有此前缀的文件。On read operations, only files with this prefix are read.
fileExtension string 如果设置了此项,则表示文件的文件扩展名。If set, indicates file extensions of the files. 写入时,文件名将以此后缀结尾。On write, files names will end with this suffix. 读取时,将只读取具有此文件扩展名的文件。On read, only files with this file extension will be read.
encoding string 表示文本编码方式:UTF8NoBOM(默认值)或 UTF8BOMIndicates how the text is encoded: UTF8NoBOM (default) or UTF8BOM.
sampleUris bool 如果设置了此属性,命令结果就会提供外部表定义所需的模拟外部数据文件 URI 的几个示例。If set, the command result provides several examples of simulated external data files URI as they are expected by the external table definition. 此选项有助于验证是否正确定义了 PartitionsPathFormat 参数 。This option helps validate whether the Partitions and PathFormat parameters are defined properly.
filesPreview bool 如果设置了此属性,其中某一个命令结果表就会包含 .show external table artifacts 命令的预览。If set, one of the command result tables contains a preview of .show external table artifacts command. sampleUri 类似,该选项有助于验证外部表定义的 PartitionsPathFormat 参数 。Like sampleUri, the option helps validate the Partitions and PathFormat parameters of external table definition.
validateNotEmpty bool 如果设置了,则验证连接字符串是否有内容。If set, the connection strings are validated for having content in them. 如果指定的 URI 位置不存在,或者没有足够的访问它的权限,则该命令会失败。The command will fail if the specified URI location doesn't exist, or if there are insufficient permissions to access it.
dryRun bool 如果设置了此属性,则不会保留外部表定义。If set, the external table definition is not persisted. 此选项对于验证外部表定义很有用,特别是在与 filesPreviewsampleUris 参数一起使用的情况下。This option is useful for validating the external table definition, especially in conjunction with the filesPreview or sampleUris parameter.

提示

若要详细了解 namePrefixfileExtension 属性在查询期间在数据文件筛选中所起的作用,请参阅文件筛选逻辑部分。To learn more about the role namePrefix and fileExtension properties play in data file filtering during query, see file filtering logic section.

示例Examples

一个未分区的外部表。A non-partitioned external table. 数据文件应直接放置在所定义的容器下:Data files are expected to be placed directly under the container(s) defined:

.create external table ExternalTable (x:long, s:string)  
kind=blob 
dataformat=csv 
( 
   h@'https://storageaccount.blob.core.chinacloudapi.cn/container1;secretKey' 
) 

一个按日期分区的外部表。An external table partitioned by date. 数据文件应放置在默认日期/时间格式为 yyyy/MM/dd 的目录中:Date files are expected to be placed in directories of default datetime format yyyy/MM/dd:

.create external table ExternalTable (Timestamp:datetime, x:long, s:string) 
kind=adl
partition by (Date:datetime = bin(Timestamp, 1d)) 
dataformat=csv 
( 
   h@'abfss://filesystem@storageaccount.dfs.core.chinacloudapi.cn/path;secretKey'
)

一个按月份分区的外部表,目录格式为 year=yyyy/month=MMAn external table partitioned by month, with a directory format of year=yyyy/month=MM:

.create external table ExternalTable (Timestamp:datetime, x:long, s:string) 
kind=blob 
partition by (Month:datetime = startofmonth(Timestamp)) 
pathformat = (datetime_pattern("'year='yyyy'/month='MM", Month)) 
dataformat=csv 
( 
   h@'https://storageaccount.blob.core.chinacloudapi.cn/container1;secretKey' 
) 

一个先按客户名称分区然后按日期分区的外部表。An external table partitioned first by customer name, then by date. 例如,预期的目录结构是 customer_name=Softworks/2019/02/01Expected directory structure is, for example, customer_name=Softworks/2019/02/01:

.create external table ExternalTable (Timestamp:datetime, CustomerName:string) 
kind=blob 
partition by (CustomerNamePart:string = CustomerName, Date:datetime = startofday(Timestamp)) 
pathformat = ("customer_name=" CustomerNamePart "/" Date)
dataformat=csv 
(  
   h@'https://storageaccount.blob.core.chinacloudapi.cn/container1;secretKey' 
)

一个先按客户名称哈希(模数为 10)分区然后按日期分区的外部表。An external table partitioned first by customer name hash (modulo ten), then by date. 例如,预期的目录结构是 customer_id=5/dt=20190201Expected directory structure is, for example, customer_id=5/dt=20190201. 数据文件名以 .txt 扩展名结尾:Data file names end with .txt extension:

.create external table ExternalTable (Timestamp:datetime, CustomerName:string) 
kind=blob 
partition by (CustomerId:long = hash(CustomerName, 10), Date:datetime = startofday(Timestamp)) 
pathformat = ("customer_id=" CustomerId "/dt=" datetime_pattern("yyyyMMdd", Date)) 
dataformat=csv 
( 
   h@'https://storageaccount.blob.core.chinacloudapi.cn/container1;secretKey'
)
with (fileExtension = ".txt")

若要在查询中按分区列进行筛选,请在查询谓词中指定原始列名:To filter by partition columns in a query, specify original column name in query predicate:

external_table("ExternalTable")
 | where Timestamp between (datetime(2020-01-01) .. datetime(2020-02-01))
 | where CustomerName in ("John.Doe", "Ivan.Ivanov")

示例输出Sample Output

TableNameTableName TableTypeTableType 文件夹Folder DocStringDocString 属性Properties ConnectionStringsConnectionStrings 分区Partitions PathFormatPathFormat
ExternalTableExternalTable BlobBlob ExternalTablesExternalTables DocsDocs {"Format":"Csv","Compressed":false,"CompressionType":null,"FileExtension":null,"IncludeHeaders":"None","Encoding":null,"NamePrefix":null}{"Format":"Csv","Compressed":false,"CompressionType":null,"FileExtension":null,"IncludeHeaders":"None","Encoding":null,"NamePrefix":null} ["https://storageaccount.blob.core.chinacloudapi.cn/container1;***"]["https://storageaccount.blob.core.chinacloudapi.cn/container1;***"] [{"Mod":10,"Name":"CustomerId","ColumnName":"CustomerName","Ordinal":0},{"Function":"StartOfDay","Name":"Date","ColumnName":"Timestamp","Ordinal":1}][{"Mod":10,"Name":"CustomerId","ColumnName":"CustomerName","Ordinal":0},{"Function":"StartOfDay","Name":"Date","ColumnName":"Timestamp","Ordinal":1}] "customer_id=" CustomerId "/dt=" datetime_pattern("yyyyMMdd",Date)"customer_id=" CustomerId "/dt=" datetime_pattern("yyyyMMdd",Date)

虚拟列Virtual columns

从 Spark 导出数据时,不会将分区列(在数据帧编写器的 partitionBy 方法中指定)写入到数据文件中。When data is exported from Spark, partition columns (that are specified in dataframe writer's partitionBy method) are not written to data files. 此过程可避免数据重复,因为数据已存在于“文件夹“名称中。This process avoids data duplication because the data already present in "folder" names. 例如 column1=<value>/column2=<value>/,Spark 在读取时可以识别它。For example, column1=<value>/column2=<value>/, and Spark can recognize it upon read.

外部表支持用于指定虚拟列的以下语法:External tables support the following syntax for specifying virtual columns:

.create external table ExternalTable (EventName:string, Revenue:double)  
kind=blob  
partition by (CustomerName:string, Date:datetime)  
pathformat = ("customer=" CustomerName "/date=" datetime_pattern("yyyyMMdd", Date))  
dataformat=parquet
( 
   h@'https://storageaccount.blob.core.chinacloudapi.cn/container1;secretKey'
)

若要在查询中按虚拟列进行筛选,请在查询谓词中指定分区名:To filter by virtual columns in a query, specify partition names in query predicate:

external_table("ExternalTable")
 | where Date between (datetime(2020-01-01) .. datetime(2020-02-01))
 | where CustomerName in ("John.Doe", "Ivan.Ivanov")

文件筛选逻辑File filtering logic

查询外部表时,查询引擎通过筛选掉无关的外部存储文件来提高性能。When querying an external table, the query engine improves performance by filtering out irrelevant external storage files. 下面介绍了循环访问文件和确定是否应处理某个文件的过程。The process of iterating on files and deciding whether a file should be processed is described below.

  1. 生成一个 URI 模式,用于表示找到文件的位置。Build a URI pattern that represents a place where files are found. 最初,URI 模式等于作为外部表定义的一部分提供的连接字符串。Initially, the URI pattern equals a connection string provided as part of the external table definition. 如果定义了任何分区,则它们将使用 PathFormat 呈现,然后追加到 URI 模式。If there are any partitions defined, they are rendered using PathFormat, then appended to the URI pattern.

  2. 对于在创建的 URI 模式下找到的所有文件,请检查:For all files found under the URI patterns(s) created, check:

    • 分区值是否与查询中使用的谓词匹配。Partition values match predicates used in a query.
    • 如果定义了这样的属性,则 Blob 名称以 NamePrefix 开头。Blob name starts with NamePrefix, if such a property is defined.
    • 如果定义了这样的属性,则 Blob 名称以 FileExtension 结尾。Blob name ends with FileExtension, if such a property is defined.

满足所有条件时,查询引擎将提取并处理该文件。Once all the conditions are met, the file is fetched and processed by the query engine.

备注

初始 URI 模式是使用查询谓词值构建的。Initial URI pattern is built using query predicate values. 这最适用于有限的字符串值集以及闭合时间范围。This works best for a limited set of string values as well as for a closed time ranges.

.show external table artifacts.show external table artifacts

返回查询给定外部表时将处理的所有文件的列表。Returns a list of all files that will be processed when querying a given external table.

备注

此操作需要数据库用户权限The operation requires database user permission.

语法:Syntax:

.show external table TableName artifacts [limit MaxResults].show external table TableName artifacts [limit MaxResults]

其中,MaxResults 是可选参数,可设置此参数来限制结果数。where MaxResults is an optional parameter, which can be set to limit the number of results.

输出Output

输出参数Output parameter 类型Type 描述Description
URIUri stringstring 外部存储数据文件的 URIURI of external storage data file
大小Size longlong 文件长度(以字节为单位)File length in bytes
分区Partition 动态dynamic 描述分区外部表的文件分区的动态对象Dynamic object describing file partitions for partitioned external table

提示

循环访问外部表引用的所有文件可能开销会非常高,具体取决于文件的数量。Iterating on all files referenced by an external table can be quite costly, depending on the number of files. 如果只想查看某些 URI 示例,请确保使用 limit 参数。Make sure to use limit parameter if you just want to see some URI examples.

示例:Examples:

.show external table T artifacts

输出:Output:

UriUri 大小Size 分区Partition
https://storageaccount.blob.core.chinacloudapi.cn/container1/folder/file.csv 1074310743 {}

对于已分区的表,Partition 列将包含提取的分区值:For partitioned table, Partition column will contain extracted partition values:

输出:Output:

UriUri 大小Size 分区Partition
https://storageaccount.blob.core.chinacloudapi.cn/container1/customer=john.doe/dt=20200101/file.csv 1074310743 {"Customer": "john.doe", "Date": "2020-01-01T00:00:00.0000000Z"}

.create external table mapping.create external table mapping

.create external table ExternalTableName mapping MappingName MappingInJsonFormat.create external table ExternalTableName mapping MappingName MappingInJsonFormat

创建新映射。Creates a new mapping. 有关详细信息,请参阅数据映射For more information, see Data Mappings.

示例Example

.create external table MyExternalTable mapping "Mapping1" '[{"Column": "rownumber", "Properties": {"Path": "$.rownumber"}}, {"Column": "rowguid", "Properties": {"Path": "$.rowguid"}}]'

示例输出Example output

名称Name 种类Kind 映射Mapping
mapping1mapping1 JSONJSON [{"ColumnName":"rownumber","Properties":{"Path":"$.rownumber"}},{"ColumnName":"rowguid","Properties":{"Path":"$.rowguid"}}][{"ColumnName":"rownumber","Properties":{"Path":"$.rownumber"}},{"ColumnName":"rowguid","Properties":{"Path":"$.rowguid"}}]

.alter external table mapping.alter external table mapping

.alter external table ExternalTableName mapping MappingName MappingInJsonFormat.alter external table ExternalTableName mapping MappingName MappingInJsonFormat

更改现有映射。Alters an existing mapping.

示例Example

.alter external table MyExternalTable mapping "Mapping1" '[{"Column": "rownumber", "Properties": {"Path": "$.rownumber"}}, {"Column": "rowguid", "Properties": {"Path": "$.rowguid"}}]'

示例输出Example output

名称Name 种类Kind 映射Mapping
mapping1mapping1 JSONJSON [{"ColumnName":"rownumber","Properties":{"Path":"$.rownumber"}},{"ColumnName":"rowguid","Properties":{"Path":"$.rowguid"}}][{"ColumnName":"rownumber","Properties":{"Path":"$.rownumber"}},{"ColumnName":"rowguid","Properties":{"Path":"$.rowguid"}}]

.show external table mappings.show external table mappings

.show external table ExternalTableName mapping MappingName.show external table ExternalTableName mapping MappingName

.show external table ExternalTableName mappings.show external table ExternalTableName mappings

显示映射(所有映射或按名称指定的某个映射)。Show the mappings (all or the one specified by name).

示例Example

.show external table MyExternalTable mapping "Mapping1" 

.show external table MyExternalTable mappings 

示例输出Example output

名称Name 种类Kind 映射Mapping
mapping1mapping1 JSONJSON [{"ColumnName":"rownumber","Properties":{"Path":"$.rownumber"}},{"ColumnName":"rowguid","Properties":{"Path":"$.rowguid"}}][{"ColumnName":"rownumber","Properties":{"Path":"$.rownumber"}},{"ColumnName":"rowguid","Properties":{"Path":"$.rowguid"}}]

.drop external table mapping.drop external table mapping

.drop external table ExternalTableName mapping MappingName.drop external table ExternalTableName mapping MappingName

从数据库中删除映射。Drops the mapping from the database.

示例Example

.drop external table MyExternalTable mapping "Mapping1" 

后续步骤Next steps