Azure 数据工厂中带分隔符的文本格式Delimited text format in Azure Data Factory
Azure 数据工厂
Azure Synapse Analytics
若要 分析带分隔符的文本文件或以带分隔符的文本格式写入数据,请遵循此文章中的说明。Follow this article when you want to parse the delimited text files or write the data into delimited text format.
以下连接器支持带分隔符的文本格式:Amazon S3、Azure Blob、Azure Data Lake Storage Gen2、Azure 文件存储、文件系统、FTP、Google 云存储、HDFS、HTTP 和 SFTP。Delimited text format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP.
数据集属性Dataset properties
有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集一文。For a full list of sections and properties available for defining datasets, see the Datasets article. 本部分提供带分隔符的文本数据集支持的属性列表。This section provides a list of properties supported by the delimited text dataset.
属性Property | 说明Description | 必须Required |
---|---|---|
typetype | 数据集的类型属性必须设置为 DelimitedText。The type property of the dataset must be set to DelimitedText. | 是Yes |
locationlocation | 文件的位置设置。Location settings of the file(s). 每个基于文件的连接器在 location 下都有其自己的位置类型和支持的属性。Each file-based connector has its own location type and supported properties under location . |
是Yes |
columnDelimitercolumnDelimiter | 用于分隔文件中的列的字符。The character(s) used to separate columns in a file. 默认值为“逗号( , )”。The default value is comma , . 当列分隔符定义为空字符串(意味着没有分隔符)时,整行将被视为单个列。When the column delimiter is defined as empty string, which means no delimiter, the whole line is taken as a single column.目前,只有映射数据流支持使用空字符串或多字符作为列分隔符,而复制活动则不支持。Currently, column delimiter as empty string or multi-char is only supported for mapping data flow but not Copy activity. |
否No |
rowDelimiterrowDelimiter | 用于分隔文件中的行的单个字符或“\r\n”。The single character or "\r\n" used to separate rows in a file. 由映射数据流和复制活动读取或写入时,默认值分别为以下任一值:["\r\n", "\r", "\n"],以及“\n”或“\r\n” 。The default value is any of the following values on read: ["\r\n", "\r", "\n"], and "\n" or “\r\n” on write by mapping data flow and Copy activity respectively. 当行分隔符设置为“无分隔符”(空字符串)时,列分隔符也必须设置为“无分隔符”(空字符串),这意味着将整个内容视为单个值。When the row delimiter is set to no delimiter (empty string), the column delimiter must be set as no delimiter (empty string) as well, which means to treat the entire content as a single value. 目前,只有映射数据流支持使用空字符串作为行分隔符,而复制活动则不支持。Currently, row delimiter as empty string is only supported for mapping data flow but not Copy activity. |
否No |
quoteCharquoteChar | 当列包含列分隔符时,用于括住列值的单个字符。The single character to quote column values if it contains column delimiter. 默认值为 双引号 " 。The default value is double quotes " . 将 quoteChar 定义为空字符串时,它表示不使用引号字符且不用引号括住列值,escapeChar 用于转义列分隔符和值本身。When quoteChar is defined as empty string, it means there is no quote char and column value is not quoted, and escapeChar is used to escape the column delimiter and itself. |
否No |
escapeCharescapeChar | 用于转义括住值中的引号的单个字符。The single character to escape quotes inside a quoted value. 默认值为 反斜杠 \ 。The default value is backslash \ . 将 escapeChar 定义为空字符串时,quoteChar 也必须设置为空字符串。在这种情况下,应确保所有列值不包含分隔符。When escapeChar is defined as empty string, the quoteChar must be set as empty string as well, in which case make sure all column values don't contain delimiters. |
否No |
firstRowAsHeaderfirstRowAsHeader | 指定是否要将第一行视为/设为包含列名称的标头行。Specifies whether to treat/make the first row as a header line with names of columns. 允许的值为 true 和 true(默认值)。Allowed values are true and false (default). 当“将第一行用作标题”为 false 时,请注意,UI 数据预览和查找活动输出会自动以 Prop_{n} 格式(从 0 开始)生成列名称,复制活动需要使用从源到接收器的显式映射,并且按序号(从 1 开始)定位各个列,映射数据流以 Column_{n} 格式(从 1 开始)的名称列出和定位各个列。When first row as header is false, note UI data preview and lookup activity output auto generate column names as Prop_{n} (starting from 0), copy activity requires explicit mapping from source to sink and locates columns by ordinal (starting from 1), and mapping data flow lists and locates columns with name as Column_{n} (starting from 1). |
否No |
nullValuenullValue | 指定 null 值的字符串表示形式。Specifies the string representation of null value. 默认值为 空字符串。The default value is empty string. |
否No |
encodingNameencodingName | 用于读取/写入测试文件的编码类型。The encoding type used to read/write test files. 可用的值如下:"UTF-8"、"UTF-16"、"UTF-16BE"、"UTF-32"、"UTF-32BE"、"US-ASCII"、“UTF-7”、"BIG5"、"EUC-JP"、"EUC-KR"、"GB2312"、"GB18030"、"JOHAB"、"SHIFT-JIS"、"CP875"、"CP866"、"IBM00858"、"IBM037"、"IBM273"、"IBM437"、"IBM500"、"IBM737"、"IBM775"、"IBM850"、"IBM852"、"IBM855"、"IBM857"、"IBM860"、"IBM861"、"IBM863"、"IBM864"、"IBM865"、"IBM869"、"IBM870"、"IBM01140"、"IBM01141"、"IBM01142"、"IBM01143"、"IBM01144"、"IBM01145"、"IBM01146"、"IBM01147"、"IBM01148"、"IBM01149"、"ISO-2022-JP"、"ISO-2022-KR"、"ISO-8859-1"、"ISO-8859-2"、"ISO-8859-3"、"ISO-8859-4"、"ISO-8859-5"、"ISO-8859-6"、"ISO-8859-7"、"ISO-8859-8"、"ISO-8859-9"、"ISO-8859-13"、"ISO-8859-15"、"WINDOWS-874"、"WINDOWS-1250"、"WINDOWS-1251"、"WINDOWS-1252"、"WINDOWS-1253"、"WINDOWS-1254"、"WINDOWS-1255"、"WINDOWS-1256"、"WINDOWS-1257"、"WINDOWS-1258”。Allowed values are as follows: "UTF-8", "UTF-16", "UTF-16BE", "UTF-32", "UTF-32BE", "US-ASCII", “UTF-7”, "BIG5", "EUC-JP", "EUC-KR", "GB2312", "GB18030", "JOHAB", "SHIFT-JIS", "CP875", "CP866", "IBM00858", "IBM037", "IBM273", "IBM437", "IBM500", "IBM737", "IBM775", "IBM850", "IBM852", "IBM855", "IBM857", "IBM860", "IBM861", "IBM863", "IBM864", "IBM865", "IBM869", "IBM870", "IBM01140", "IBM01141", "IBM01142", "IBM01143", "IBM01144", "IBM01145", "IBM01146", "IBM01147", "IBM01148", "IBM01149", "ISO-2022-JP", "ISO-2022-KR", "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-13", "ISO-8859-15", "WINDOWS-874", "WINDOWS-1250", "WINDOWS-1251", "WINDOWS-1252", "WINDOWS-1253", "WINDOWS-1254", "WINDOWS-1255", "WINDOWS-1256", "WINDOWS-1257", "WINDOWS-1258”. 注意,映射数据流不支持 UTF-7 编码。Note mapping data flow doesn't support UTF-7 encoding. |
否No |
compressionCodeccompressionCodec | 用于读取/写入文本文件的压缩编解码器。The compression codec used to read/write text files. 允许的值为 bzip2、gzip、deflate、ZipDeflate、TarGzip、Tar、snappy 或 lz4 。Allowed values are bzip2, gzip, deflate, ZipDeflate, TarGzip, Tar, snappy, or lz4. 默认设置是不压缩。Default is not compressed. 注意:目前,复制活动不支持“snappy”和“lz4”,映射数据流不支持“ZipDeflate”、“TarGzip”和“Tar”。Note currently Copy activity doesn't support "snappy" & "lz4", and mapping data flow doesn't support "ZipDeflate", "TarGzip" and "Tar". 请注意,使用复制活动解压缩 ZipDeflate/TarGzip/Tar 文件并将其写入基于文件的接收器数据存储时,默认情况下文件将提取到 <path specified in dataset>/<folder named as source compressed file>/ 文件夹,对复制活动源使用 preserveZipFileNameAsFolder /preserveCompressionFileNameAsFolder 来控制是否以文件夹结构形式保留压缩文件名 。Note when using copy activity to decompress ZipDeflate/TarGzip/Tar file(s) and write to file-based sink data store, by default files are extracted to the folder:<path specified in dataset>/<folder named as source compressed file>/ , use preserveZipFileNameAsFolder /preserveCompressionFileNameAsFolder on copy activity source to control whether to preserve the name of the compressed file(s) as folder structure. |
否No |
compressionLevelcompressionLevel | 压缩率。The compression ratio. 允许的值为 Optimal 或 Fastest。Allowed values are Optimal or Fastest. - Fastest:尽快完成压缩操作,不过,无法以最佳方式压缩生成的文件。- Fastest: The compression operation should complete as quickly as possible, even if the resulting file is not optimally compressed. - Optimal:以最佳方式完成压缩操作,不过,需要耗费更长的时间。- Optimal: The compression operation should be optimally compressed, even if the operation takes a longer time to complete. 有关详细信息,请参阅 Compression Level(压缩级别)主题。For more information, see Compression Level topic. |
否No |
下面是 Azure Blob 存储上的带分隔符的文本数据集的示例:Below is an example of delimited text dataset on Azure Blob Storage:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"columnDelimiter": ",",
"quoteChar": "\"",
"escapeChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
复制活动属性Copy activity properties
有关可用于定义活动的各部分和属性的完整列表,请参阅管道一文。For a full list of sections and properties available for defining activities, see the Pipelines article. 本部分提供带分隔符的文本源和接收器支持的属性列表。This section provides a list of properties supported by the delimited text source and sink.
带分隔符的文本作为源Delimited text as source
复制活动的 __source* * 节支持以下属性。The following properties are supported in the copy activity __source** section.
属性Property | 说明Description | 必需Required |
---|---|---|
typetype | 复制活动源的 type 属性必须设置为 DelimitedTextSource。The type property of the copy activity source must be set to DelimitedTextSource. | 是Yes |
formatSettingsformatSettings | 一组属性。A group of properties. 请参阅下面的“带分隔符的文本读取设置”表。Refer to Delimited text read settings table below. | 否No |
storeSettingsstoreSettings | 有关如何从数据存储读取数据的一组属性。A group of properties on how to read data from a data store. 每个基于文件的连接器在 storeSettings 下都有其自己支持的读取设置。Each file-based connector has its own supported read settings under storeSettings . |
否No |
formatSettings
下支持的 带分隔符的文本读取设置:Supported delimited text read settings under formatSettings
:
属性Property | 说明Description | 必需Required |
---|---|---|
typetype | formatSettings 的类型必须设置为 DelimitedTextReadSettings。The type of formatSettings must be set to DelimitedTextReadSettings. | 是Yes |
skipLineCountskipLineCount | 指示从输入文件读取数据时要跳过的非空行数 。Indicates the number of non-empty rows to skip when reading data from input files. 如果同时指定了 skipLineCount 和 firstRowAsHeader,则先跳过行,然后从输入文件读取标头信息。If both skipLineCount and firstRowAsHeader are specified, the lines are skipped first and then the header information is read from the input file. |
否No |
compressionPropertiescompressionProperties | 一组属性,指示如何为给定的压缩编解码器解压缩数据。A group of properties on how to decompress data for a given compression codec. | 否No |
preserveZipFileNameAsFolderpreserveZipFileNameAsFolder (在 compressionProperties ->type 下为 ZipDeflateReadSettings )(under compressionProperties ->type as ZipDeflateReadSettings ) |
当输入数据集配置了 ZipDeflate 压缩时适用。Applies when input dataset is configured with ZipDeflate compression. 指示是否在复制过程中以文件夹结构形式保留源 zip 文件名。Indicates whether to preserve the source zip file name as folder structure during copy. - 当设置为 true(默认值)时,数据工厂会将已解压缩的文件写入 <path specified in dataset>/<folder named as source zip file>/ 。- When set to true (default), Data Factory writes unzipped files to <path specified in dataset>/<folder named as source zip file>/ .- 当设置为 false 时,数据工厂会直接将未解压缩的文件写入 <path specified in dataset> 。- When set to false, Data Factory writes unzipped files directly to <path specified in dataset> . 请确保不同的源 zip 文件中没有重复的文件名,以避免产生冲突或出现意外行为。Make sure you don't have duplicated file names in different source zip files to avoid racing or unexpected behavior. |
否No |
preserveCompressionFileNameAsFolderpreserveCompressionFileNameAsFolder (在 compressionProperties ->type 下为 TarGZipReadSettings 或 TarReadSettings )(under compressionProperties ->type as TarGZipReadSettings or TarReadSettings ) |
当输入数据集配置了 TarGzip/Tar 压缩时适用 。Applies when input dataset is configured with TarGzip/Tar compression. 指示是否在复制过程中以文件夹结构形式保留源压缩文件名。Indicates whether to preserve the source compressed file name as folder structure during copy. - 当设置为 true(默认值)时,数据工厂会将已解压缩的文件写入 <path specified in dataset>/<folder named as source compressed file>/ 。- When set to true (default), Data Factory writes decompressed files to <path specified in dataset>/<folder named as source compressed file>/ . - 当设置为 false 时,数据工厂会直接将已解压缩的文件写入 <path specified in dataset> 。- When set to false, Data Factory writes decompressed files directly to <path specified in dataset> . 请确保不同的源文件中没有重复的文件名,以避免产生冲突或出现意外行为。Make sure you don't have duplicated file names in different source files to avoid racing or unexpected behavior. |
否No |
"activities": [
{
"name": "CopyFromDelimitedText",
"type": "Copy",
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
},
"formatSettings": {
"type": "DelimitedTextReadSettings",
"skipLineCount": 3,
"compressionProperties": {
"type": "ZipDeflateReadSettings",
"preserveZipFileNameAsFolder": false
}
}
},
...
}
...
}
]
带分隔符的文本作为接收器Delimited text as sink
复制活动的 _sink* 节支持以下属性。The following properties are supported in the copy activity __sink** section.
propertiesProperty | 说明Description | 必需Required |
---|---|---|
typetype | 复制活动源的 type 属性必须设置为 DelimitedTextSink。The type property of the copy activity source must be set to DelimitedTextSink. | 是Yes |
formatSettingsformatSettings | 一组属性。A group of properties. 请参阅下面的“带分隔符的文本写入设置”表。Refer to Delimited text write settings table below. | 否No |
storeSettingsstoreSettings | 有关如何将数据写入到数据存储的一组属性。A group of properties on how to write data to a data store. 每个基于文件的连接器在 storeSettings 下都有其自身支持的写入设置。Each file-based connector has its own supported write settings under storeSettings . |
否No |
formatSettings
下支持的 带分隔符的文本写入设置:Supported delimited text write settings under formatSettings
:
属性Property | 说明Description | 必需Required |
---|---|---|
typetype | formatSettings 的类型必须设置为 DelimitedTextWriteSettings。The type of formatSettings must be set to DelimitedTextWriteSettings. | 是Yes |
fileExtensionfileExtension | 用来为输出文件命名的扩展名,例如 .csv 、.txt 。The file extension used to name the output files, for example, .csv , .txt . 未在 DelimitedText 输出数据集中指定 fileName 时,必须指定该扩展名。It must be specified when the fileName is not specified in the output DelimitedText dataset. 如果在输出数据集中配置了文件名,则它将其用作接收器文件名,并且将忽略文件扩展名设置。When file name is configured in the output dataset, it will be used as the sink file name and the file extension setting will be ignored. |
未在输出数据集中指定文件名时为“是”Yes when file name is not specified in output dataset |
maxRowsPerFilemaxRowsPerFile | 在将数据写入到文件夹时,可选择写入多个文件,并指定每个文件的最大行数。When writing data into a folder, you can choose to write to multiple files and specify the max rows per file. | 否No |
fileNamePrefixfileNamePrefix | 配置 maxRowsPerFile 时适用。Applicable when maxRowsPerFile is configured.在将数据写入多个文件时,指定文件名前缀,生成的模式为 <fileNamePrefix>_00000.<fileExtension> 。Specify the file name prefix when writing data to multiple files, resulted in this pattern: <fileNamePrefix>_00000.<fileExtension> . 如果未指定,将自动生成文件名前缀。If not specified, file name prefix will be auto generated. 如果源是基于文件的存储或已启用分区选项的数据存储,则此属性不适用。This property does not apply when source is file-based store or partition-option-enabled data store. |
否No |
映射数据流属性Mapping data flow properties
在映射数据流中,可以在以下数据存储中读取和写入带分隔符的文本格式:Azure Blob 存储和 Azure Data Lake Storage Gen2。In mapping data flows, you can read and write to delimited text format in the following data stores: Azure Blob Storage, and Azure Data Lake Storage Gen2.
源属性Source properties
下表列出了带分隔符的文本源支持的属性。The below table lists the properties supported by a delimited text source. 你可以在“源选项”选项卡中编辑这些属性。You can edit these properties in the Source options tab.
名称Name | 说明Description | 必需Required | 允许的值Allowed values | 数据流脚本属性Data flow script property |
---|---|---|---|---|
通配符路径Wild card paths | 将处理与通配符路径匹配的所有文件。All files matching the wildcard path will be processed. 替代在数据集中设置的文件夹和文件路径。Overrides the folder and file path set in the dataset. | 否no | String[]String[] | wildcardPathswildcardPaths |
分区根路径Partition root path | 对于已分区的文件数据,你可以输入分区根路径,以便将已分区的文件夹作为列读取For file data that is partitioned, you can enter a partition root path in order to read partitioned folders as columns | 否no | 字符串String | partitionRootPathpartitionRootPath |
文件列表List of files | 源是否指向列出了要处理的文件的文本文件Whether your source is pointing to a text file that lists files to process | 否no | true 或 false true or false |
fileListfileList |
多行行Multiline rows | 源文件是否包含跨多个行的行。Does the source file contain rows that span multiple lines. 多行值必须用引号引起来。Multiline values must be in quotes. | 否 true 或 false no true or false |
multiLineRowmultiLineRow | |
用于存储文件名的列Column to store file name | 使用源文件名称和路径创建一个新列Create a new column with the source file name and path | 否no | 字符串String | rowUrlColumnrowUrlColumn |
完成后After completion | 在处理后删除或移动文件。Delete or move the files after processing. 文件路径从容器根目录开始File path starts from the container root | 否no | Delete:true 或 false Delete: true or false Move: ['<from>', '<to>'] Move: ['<from>', '<to>'] |
purgeFilespurgeFiles moveFilesmoveFiles |
按上次修改时间筛选Filter by last modified | 选择根据上次更改时间筛选文件Choose to filter files based upon when they were last altered | 否no | 时间戳Timestamp | ModifiedAftermodifiedAfter modifiedBeforemodifiedBefore |
允许找不到文件Allow no files found | 如果为 true,则找不到文件时不会引发错误If true, an error is not thrown if no files are found | 否no | true 或 false true or false |
ignoreNoFilesFoundignoreNoFilesFound |
源示例Source example
下图是映射数据流中带分隔符的文本源配置的示例。The below image is an example of a delimited text source configuration in mapping data flows.
关联的数据流脚本为:The associated data flow script is:
source(
allowSchemaDrift: true,
validateSchema: false,
multiLineRow: true,
wildcardPaths:['*.csv']) ~> CSVSource
备注
数据流源支持 Hadoop 文件系统支持的一组有限的 Linux 通配Data flow sources support a limited set of Linux globbing that is support by Hadoop file systems
接收器属性Sink properties
下表列出了带分隔符的文本接收器支持的属性。The below table lists the properties supported by a delimited text sink. 可以在“设置”选项卡中编辑这些属性。You can edit these properties in the Settings tab.
名称Name | 说明Description | 必需Required | 允许的值Allowed values | 数据流脚本属性Data flow script property |
---|---|---|---|---|
清除文件夹Clear the folder | 在写入之前是否清除目标文件夹If the destination folder is cleared prior to write | 否no | true 或 false true or false |
truncatetruncate |
文件名选项File name option | 写入的数据的命名格式。The naming format of the data written. 默认情况下,每个分区有一个 part-#####-tid-<guid> 格式的文件By default, one file per partition in format part-#####-tid-<guid> |
否no | 模式:StringPattern: String 按分区:String[]Per partition: String[] 作为列中的数据:StringAs data in column: String 输出到单个文件: ['<fileName>'] Output to single file: ['<fileName>'] |
filePatternfilePattern partitionFileNamespartitionFileNames rowUrlColumnrowUrlColumn partitionFileNamespartitionFileNames |
全部引用Quote all | 将所有值放在引号中Enclose all values in quotes | 否no | true 或 false true or false |
quoteAllquoteAll |
接收器示例Sink example
下图是映射数据流中带分隔符的文本接收器配置的示例。The below image is an example of a delimited text sink configuration in mapping data flows.
关联的数据流脚本为:The associated data flow script is:
CSVSource sink(allowSchemaDrift: true,
validateSchema: false,
truncate: true,
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> CSVSink