Azure 数据工厂中带分隔符的文本格式Delimited text format in Azure Data Factory

适用于: Azure 数据工厂 Azure Synapse Analytics(预览版)

若要 分析带分隔符的文本文件或以带分隔符的文本格式写入数据 ,请遵循此文章中的说明。Follow this article when you want to parse the delimited text files or write the data into delimited text format .

以下连接器支持带分隔符的文本格式:Amazon S3Azure BlobAzure Data Lake Storage Gen2Azure 文件存储文件系统FTPGoogle 云存储HDFSHTTPSFTPDelimited text format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP.

数据集属性Dataset properties

有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集一文。For a full list of sections and properties available for defining datasets, see the Datasets article. 本部分提供带分隔符的文本数据集支持的属性列表。This section provides a list of properties supported by the delimited text dataset.

属性Property 说明Description 必须Required
typetype 数据集的类型属性必须设置为 DelimitedTextThe type property of the dataset must be set to DelimitedText . Yes
locationlocation 文件的位置设置。Location settings of the file(s). 每个基于文件的连接器在 location 下都有其自己的位置类型和支持的属性。Each file-based connector has its own location type and supported properties under location. Yes
columnDelimitercolumnDelimiter 用于分隔文件中的列的字符。The character(s) used to separate columns in a file.
默认值为“逗号(,)”。The default value is comma , . 当列分隔符定义为空字符串(意味着没有分隔符)时,整行将被视为单个列。When the column delimiter is defined as empty string, which means no delimiter, the whole line is taken as a single column.
No
rowDelimiterrowDelimiter 用于分隔文件中的行的单个字符或“\r\n”。The single character or "\r\n" used to separate rows in a file.
由复制活动读取或写入时,默认值分别为:[“\r\n”、“\r”、“\n”] 和“\n”或“\r\n” 。The default value is any of the following values on read: ["\r\n", "\r", "\n"] , and "\n" or “\r\n” on write by Copy activity.
No
quoteCharquoteChar 当列包含列分隔符时,用于括住列值的单个字符。The single character to quote column values if it contains column delimiter.
默认值为 双引号 "The default value is double quotes ".
对于复制活动,当 quoteChar 定义为空字符串时,表示不使用引号字符且不用引号括住列值,escapeChar 用于转义列分隔符和值本身。For Copy activity, when quoteChar is defined as empty string, it means there is no quote char and column value is not quoted, and escapeChar is used to escape the column delimiter and itself.
No
escapeCharescapeChar 用于转义括住值中的引号的单个字符。The single character to escape quotes inside a quoted value.
默认值为 反斜杠\The default value is backslash \ .
对于复制活动,当 escapeChar 定义为空字符串时,quoteChar 也必须设置为空字符串,在这种情况下,应确保所有列值不包含分隔符。For Copy activity, when escapeChar is defined as empty string, the quoteChar must be set as empty string as well, in which case make sure all column values don't contain delimiters.
No
firstRowAsHeaderfirstRowAsHeader 指定是否要将第一行视为/设为包含列名称的标头行。Specifies whether to treat/make the first row as a header line with names of columns.
允许的值为 truetrue (默认值)。Allowed values are true and false (default).
No
nullValuenullValue 指定 null 值的字符串表示形式。Specifies the string representation of null value.
默认值为 空字符串The default value is empty string .
No
encodingNameencodingName 用于读取/写入测试文件的编码类型。The encoding type used to read/write test files.
可用的值如下:"UTF-8"、"UTF-16"、"UTF-16BE"、"UTF-32"、"UTF-32BE"、"US-ASCII"、“UTF-7”、"BIG5"、"EUC-JP"、"EUC-KR"、"GB2312"、"GB18030"、"JOHAB"、"SHIFT-JIS"、"CP875"、"CP866"、"IBM00858"、"IBM037"、"IBM273"、"IBM437"、"IBM500"、"IBM737"、"IBM775"、"IBM850"、"IBM852"、"IBM855"、"IBM857"、"IBM860"、"IBM861"、"IBM863"、"IBM864"、"IBM865"、"IBM869"、"IBM870"、"IBM01140"、"IBM01141"、"IBM01142"、"IBM01143"、"IBM01144"、"IBM01145"、"IBM01146"、"IBM01147"、"IBM01148"、"IBM01149"、"ISO-2022-JP"、"ISO-2022-KR"、"ISO-8859-1"、"ISO-8859-2"、"ISO-8859-3"、"ISO-8859-4"、"ISO-8859-5"、"ISO-8859-6"、"ISO-8859-7"、"ISO-8859-8"、"ISO-8859-9"、"ISO-8859-13"、"ISO-8859-15"、"WINDOWS-874"、"WINDOWS-1250"、"WINDOWS-1251"、"WINDOWS-1252"、"WINDOWS-1253"、"WINDOWS-1254"、"WINDOWS-1255"、"WINDOWS-1256"、"WINDOWS-1257"、"WINDOWS-1258”。Allowed values are as follows: "UTF-8", "UTF-16", "UTF-16BE", "UTF-32", "UTF-32BE", "US-ASCII", “UTF-7”, "BIG5", "EUC-JP", "EUC-KR", "GB2312", "GB18030", "JOHAB", "SHIFT-JIS", "CP875", "CP866", "IBM00858", "IBM037", "IBM273", "IBM437", "IBM500", "IBM737", "IBM775", "IBM850", "IBM852", "IBM855", "IBM857", "IBM860", "IBM861", "IBM863", "IBM864", "IBM865", "IBM869", "IBM870", "IBM01140", "IBM01141", "IBM01142", "IBM01143", "IBM01144", "IBM01145", "IBM01146", "IBM01147", "IBM01148", "IBM01149", "ISO-2022-JP", "ISO-2022-KR", "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-13", "ISO-8859-15", "WINDOWS-874", "WINDOWS-1250", "WINDOWS-1251", "WINDOWS-1252", "WINDOWS-1253", "WINDOWS-1254", "WINDOWS-1255", "WINDOWS-1256", "WINDOWS-1257", "WINDOWS-1258”.
No
compressionCodeccompressionCodec 用于读取/写入文本文件的压缩编解码器。The compression codec used to read/write text files.
允许的值为 bzip2、gzip、deflate、ZipDeflate、TarGzip、snappy 或 lz4 。Allowed values are bzip2 , gzip , deflate , ZipDeflate , TarGzip , snappy , or lz4 . 默认设置是不压缩。Default is not compressed.
注意,复制活动当前不支持“snappy”和“lz4”。Note currently Copy activity doesn't support "snappy" & "lz4".
注意,使用复制活动解压缩 ZipDeflate/TarGzip 文件并将其写入基于文件的接收器数据存储时,默认情况下文件将提取到 <path specified in dataset>/<folder named as source compressed file>/ 文件夹,对复制活动源使用 preserveZipFileNameAsFolder/preserveCompressionFileNameAsFolder 来控制是否以文件夹结构形式保留压缩文件名 。Note when using copy activity to decompress ZipDeflate/TarGzip file(s) and write to file-based sink data store, by default files are extracted to the folder:<path specified in dataset>/<folder named as source compressed file>/, use preserveZipFileNameAsFolder/preserveCompressionFileNameAsFolder on copy activity source to control whether to preserve the name of the compressed file(s) as folder structure.
No
compressionLevelcompressionLevel 压缩率。The compression ratio.
允许的值为 OptimalFastestAllowed values are Optimal or Fastest .
- Fastest :尽快完成压缩操作,不过,无法以最佳方式压缩生成的文件。- Fastest: The compression operation should complete as quickly as possible, even if the resulting file is not optimally compressed.
- Optimal :以最佳方式完成压缩操作,不过,需要耗费更长的时间。- Optimal : The compression operation should be optimally compressed, even if the operation takes a longer time to complete. 有关详细信息,请参阅 Compression Level(压缩级别)主题。For more information, see Compression Level topic.
No

下面是 Azure Blob 存储上的带分隔符的文本数据集的示例:Below is an example of delimited text dataset on Azure Blob Storage:

{
    "name": "DelimitedTextDataset",
    "properties": {
        "type": "DelimitedText",
        "linkedServiceName": {
            "referenceName": "<Azure Blob Storage linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [ < physical schema, optional, retrievable during authoring > ],
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "container": "containername",
                "folderPath": "folder/subfolder",
            },
            "columnDelimiter": ",",
            "quoteChar": "\"",
            "escapeChar": "\"",
            "firstRowAsHeader": true,
            "compressionCodec": "gzip"
        }
    }
}

复制活动属性Copy activity properties

有关可用于定义活动的各部分和属性的完整列表,请参阅管道一文。For a full list of sections and properties available for defining activities, see the Pipelines article. 本部分提供带分隔符的文本源和接收器支持的属性列表。This section provides a list of properties supported by the delimited text source and sink.

带分隔符的文本作为源Delimited text as source

复制活动的 *source* 节支持以下属性。The following properties are supported in the copy activity *source* section.

属性Property 说明Description 必须Required
typetype 复制活动源的 type 属性必须设置为 DelimitedTextSourceThe type property of the copy activity source must be set to DelimitedTextSource . Yes
formatSettingsformatSettings 一组属性。A group of properties. 请参阅下面的“带分隔符的文本读取设置”表。Refer to Delimited text read settings table below. No
storeSettingsstoreSettings 有关如何从数据存储读取数据的一组属性。A group of properties on how to read data from a data store. 每个基于文件的连接器在 storeSettings 下都有其自己支持的读取设置。Each file-based connector has its own supported read settings under storeSettings. No

formatSettings 下支持的 带分隔符的文本读取设置Supported delimited text read settings under formatSettings:

属性Property 说明Description 必须Required
typetype formatSettings 的类型必须设置为 DelimitedTextReadSettingsThe type of formatSettings must be set to DelimitedTextReadSettings . Yes
skipLineCountskipLineCount 指示从输入文件读取数据时要跳过的非空行数 。Indicates the number of non-empty rows to skip when reading data from input files.
如果同时指定了 skipLineCount 和 firstRowAsHeader,则先跳过行,然后从输入文件读取标头信息。If both skipLineCount and firstRowAsHeader are specified, the lines are skipped first and then the header information is read from the input file.
No
compressionPropertiescompressionProperties 一组属性,指示如何为给定的压缩编解码器解压缩数据。A group of properties on how to decompress data for a given compression codec. No
preserveZipFileNameAsFolderpreserveZipFileNameAsFolder
(在 compressionProperties->type 下为 ZipDeflateReadSettings( under compressionProperties->type as ZipDeflateReadSettings )
当输入数据集配置了 ZipDeflate 压缩时适用。Applies when input dataset is configured with ZipDeflate compression. 指示是否在复制过程中以文件夹结构形式保留源 zip 文件名。Indicates whether to preserve the source zip file name as folder structure during copy.
- 当设置为 true(默认值)时,数据工厂会将已解压缩的文件写入 <path specified in dataset>/<folder named as source zip file>/- When set to true (default) , Data Factory writes unzipped files to <path specified in dataset>/<folder named as source zip file>/.
- 当设置为 false 时,数据工厂会直接将未解压缩的文件写入 <path specified in dataset>- When set to false , Data Factory writes unzipped files directly to <path specified in dataset>. 请确保不同的源 zip 文件中没有重复的文件名,以避免产生冲突或出现意外行为。Make sure you don't have duplicated file names in different source zip files to avoid racing or unexpected behavior.
No
preserveCompressionFileNameAsFolderpreserveCompressionFileNameAsFolder
(在 compressionProperties->type 下为 TarGZipReadSettings( under compressionProperties->type as TarGZipReadSettings )
当输入数据集配置了 TarGzip 压缩时适用。Applies when input dataset is configured with TarGzip compression. 指示是否在复制过程中以文件夹结构形式保留源压缩文件名。Indicates whether to preserve the source compressed file name as folder structure during copy.
- 当设置为 true(默认值)时,数据工厂会将已解压缩的文件写入 <path specified in dataset>/<folder named as source compressed file>/- When set to true (default) , Data Factory writes decompressed files to <path specified in dataset>/<folder named as source compressed file>/.
- 当设置为 false 时,数据工厂会直接将已解压缩的文件写入 <path specified in dataset>- When set to false , Data Factory writes decompressed files directly to <path specified in dataset>. 请确保不同的源文件中没有重复的文件名,以避免产生冲突或出现意外行为。Make sure you don't have duplicated file names in different source files to avoid racing or unexpected behavior.
No
"activities": [
    {
        "name": "CopyFromDelimitedText",
        "type": "Copy",
        "typeProperties": {
            "source": {
                "type": "DelimitedTextSource",
                "storeSettings": {
                    "type": "AzureBlobStorageReadSettings",
                    "recursive": true
                },
                "formatSettings": {
                    "type": "DelimitedTextReadSettings",
                    "skipLineCount": 3,
                    "compressionProperties": {
                        "type": "ZipDeflateReadSettings",
                        "preserveZipFileNameAsFolder": false
                    }
                }
            },
            ...
        }
        ...
    }
]

带分隔符的文本作为接收器Delimited text as sink

复制活动的 *sink* 节支持以下属性。The following properties are supported in the copy activity *sink* section.

属性Property 说明Description 必须Required
typetype 复制活动源的 type 属性必须设置为 DelimitedTextSinkThe type property of the copy activity source must be set to DelimitedTextSink . Yes
formatSettingsformatSettings 一组属性。A group of properties. 请参阅下面的“带分隔符的文本写入设置”表。Refer to Delimited text write settings table below. No
storeSettingsstoreSettings 有关如何将数据写入到数据存储的一组属性。A group of properties on how to write data to a data store. 每个基于文件的连接器在 storeSettings 下都有其自身支持的写入设置。Each file-based connector has its own supported write settings under storeSettings. No

formatSettings 下支持的 带分隔符的文本写入设置Supported delimited text write settings under formatSettings:

属性Property 说明Description 必须Required
typetype formatSettings 的类型必须设置为 DelimitedTextWriteSettingsThe type of formatSettings must be set to DelimitedTextWriteSettings . Yes
fileExtensionfileExtension 用来为输出文件命名的扩展名,例如 .csv.txtThe file extension used to name the output files, for example, .csv, .txt. 未在 DelimitedText 输出数据集中指定 fileName 时,必须指定该扩展名。It must be specified when the fileName is not specified in the output DelimitedText dataset. 如果在输出数据集中配置了文件名,则它将其用作接收器文件名,并且将忽略文件扩展名设置。When file name is configured in the output dataset, it will be used as the sink file name and the file extension setting will be ignored. 未在输出数据集中指定文件名时为“是”Yes when file name is not specified in output dataset
maxRowsPerFilemaxRowsPerFile 在将数据写入到文件夹时,可选择写入多个文件,并指定每个文件的最大行数。When writing data into a folder, you can choose to write to multiple files and specify the max rows per file. No
fileNamePrefixfileNamePrefix 配置 maxRowsPerFile 时适用。Applicable when maxRowsPerFile is configured.
在将数据写入多个文件时,指定文件名前缀,生成的模式为 <fileNamePrefix>_00000.<fileExtension>Specify the file name prefix when writing data to multiple files, resulted in this pattern: <fileNamePrefix>_00000.<fileExtension>. 如果未指定,将自动生成文件名前缀。If not specified, file name prefix will be auto generated. 如果源是基于文件的存储或已启用分区选项的数据存储,则此属性不适用。This property does not apply when source is file-based store or partition-option-enabled data store.
No

后续步骤Next steps