Azure 数据工厂(旧版)中支持的文件格式和压缩编解码器Supported file formats and compression codecs in Azure Data Factory (legacy)

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

本文适用于以下连接器: Amazon S3Azure BlobAzure Data Lake Storage Gen2Azure 文件存储文件系统FTPGoogle 云存储HDFSHTTPSFTPThis article applies to the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP.

重要

数据工厂引入基于新格式的数据集模型,有关详细信息,请参阅对应格式的文章:Data Factory introduced new format-based dataset model, see corresponding format article with details:
- Avro 格式- Avro format
- 二进制格式- Binary format
- 带分隔符的文本格式- Delimited text format
- JSON 格式- JSON format
- ORC 格式- ORC format
- Parquet 格式- Parquet format
依旧支持本文中提到的其余配置,以实现后向兼容性。The rest configurations mentioned in this article are still supported as-is for backward compabitility. 建议你今后使用新模型。You are suggested to use the new model going forward.

文本格式(旧版)Text format (legacy)

备注

带分隔符的文本格式一文中了解新模型。Learn the new model from Delimited text format article. 仍然按原样支持基于文件的数据存储数据集的以下配置,以实现向后兼容性。The following configurations on file-based data store dataset is still supported as-is for backward compabitility. 建议你今后使用新模型。You are suggested to use the new model going forward.

如果想要读取或写入某个文本文件,请将数据集的 format 节中的 type 属性设置为 TextFormatIf you want to read from a text file or write to a text file, set the type property in the format section of the dataset to TextFormat. 也可在 format 节指定以下可选属性。You can also specify the following optional properties in the format section. 请参阅 TextFormat 示例部分,了解如何进行配置。See TextFormat example section on how to configure.

属性Property 说明Description 允许的值Allowed values 必须Required
columnDelimitercolumnDelimiter 用于分隔文件中的列的字符。The character used to separate columns in a file. 可以考虑使用数据中可能不存在的极少见的不可打印字符。You can consider to use a rare unprintable character that may not exist in your data. 例如,指定“\u0001”表示标题开头 (SOH)。For example, specify "\u0001", which represents Start of Heading (SOH). 只能使用一个字符。Only one character is allowed. 默认值为逗号(“,”)The default value is comma (',').

若要使用 Unicode 字符,请参阅 Unicode 字符获取相应的代码。To use a Unicode character, refer to Unicode Characters to get the corresponding code for it.
No
rowDelimiterrowDelimiter 用于分隔文件中的行的字符。The character used to separate rows in a file. 只能使用一个字符。Only one character is allowed. 默认值为以下任何一项: [“\r\n”、“\r”、“\n”] (读取时)和 “\r\n” (写入时)。The default value is any of the following values on read: ["\r\n", "\r", "\n"] and "\r\n" on write. No
escapeCharescapeChar 用于转义输入文件内容中的列分隔符的特殊字符。The special character used to escape a column delimiter in the content of input file.

不能同时指定表的 escapeChar 和 quoteChar。You cannot specify both escapeChar and quoteChar for a table.
只能使用一个字符。Only one character is allowed. 没有默认值。No default value.

示例:如果使用逗号 (',') 作为列分隔符,但希望在文本中包含逗号字符(例如:“Hello, world”),可以将“$”定义为转义字符,并在源代码中使用字符串“Hello$, world”。Example: if you have comma (',') as the column delimiter but you want to have the comma character in the text (example: "Hello, world"), you can define '$' as the escape character and use string "Hello$, world" in the source.
No
quoteCharquoteChar 将字符串值用引号括起来的字符。The character used to quote a string value. 引号字符内的列和行分隔符将被视为字符串值的一部分。The column and row delimiters inside the quote characters would be treated as part of the string value. 此属性适用于输入和输出数据集。This property is applicable to both input and output datasets.

不能同时指定表的 escapeChar 和 quoteChar。You cannot specify both escapeChar and quoteChar for a table.
只能使用一个字符。Only one character is allowed. 没有默认值。No default value.

例如,如果以逗号(“,”)作为列分隔符,但想要在文本中使用逗号字符(例如:<Hello, world>),可以将 "(双引号)定义为引号字符,在源中使用字符串“Hello, world”。For example, if you have comma (',') as the column delimiter but you want to have comma character in the text (example: <Hello, world>), you can define " (double quote) as the quote character and use the string "Hello, world" in the source.
No
nullValuenullValue 用于表示 null 值的一个或多个字符。One or more characters used to represent a null value. 一个或多个字符。One or more characters. 默认值为 “\N”和“NULL” (读取时)及 “\N” (写入时)。The default values are "\N" and "NULL" on read and "\N" on write. No
encodingNameencodingName 指定编码名称。Specify the encoding name. 有效的编码名称。A valid encoding name. 请参阅 Encoding.EncodingName 属性see Encoding.EncodingName Property. 例如:windows-1250 或 shift_jis。Example: windows-1250 or shift_jis. 默认值为 UTF-8The default value is UTF-8. No
firstRowAsHeaderfirstRowAsHeader 指定是否将第一行视为标头。Specifies whether to consider the first row as a header. 对于输入数据集,数据工厂将读取第一行作为标头。For an input dataset, Data Factory reads first row as a header. 对于输出数据集,数据工厂将写入第一行作为标头。For an output dataset, Data Factory writes first row as a header.

有关示例方案,请参阅 firstRowAsHeaderskipLineCount 使用方案See Scenarios for using firstRowAsHeader and skipLineCount for sample scenarios.
TrueTrue
False(默认值)False (default)
No
skipLineCountskipLineCount 指示从输入文件读取数据时要跳过的非空行数 。Indicates the number of non-empty rows to skip when reading data from input files. 如果同时指定了 skipLineCount 和 firstRowAsHeader,则先跳过行,然后从输入文件读取标头信息。If both skipLineCount and firstRowAsHeader are specified, the lines are skipped first and then the header information is read from the input file.

有关示例方案,请参阅 firstRowAsHeaderskipLineCount 使用方案See Scenarios for using firstRowAsHeader and skipLineCount for sample scenarios.
IntegerInteger No
treatEmptyAsNulltreatEmptyAsNull 指定是否在从输入文件读取数据时将 null 或空字符串视为 null 值。Specifies whether to treat null or empty string as a null value when reading data from an input file. True(默认值)True (default)
FalseFalse
No

TextFormat 示例TextFormat example

在某个数据集的以下 JSON 定义中,指定了一些可选属性。In the following JSON definition for a dataset, some of the optional properties are specified.

"typeProperties":
{
    "folderPath": "mycontainer/myfolder",
    "fileName": "myblobname",
    "format":
    {
        "type": "TextFormat",
        "columnDelimiter": ",",
        "rowDelimiter": ";",
        "quoteChar": "\"",
        "NullValue": "NaN",
        "firstRowAsHeader": true,
        "skipLineCount": 0,
        "treatEmptyAsNull": true
    }
},

要使用 escapeChar 而非 quoteChar,请将 quoteChar 所在的行替换为以下 escapeChar:To use an escapeChar instead of quoteChar, replace the line with quoteChar with the following escapeChar:

"escapeChar": "$",

firstRowAsHeader 和 skipLineCount 的使用方案Scenarios for using firstRowAsHeader and skipLineCount

  • 要从非文件源复制到文本文件,并想要添加包含架构元数据的标头行(例如:SQL 架构)。You are copying from a non-file source to a text file and would like to add a header line containing the schema metadata (for example: SQL schema). 对于此方案,请在输出数据集中将 firstRowAsHeader 指定为 true。Specify firstRowAsHeader as true in the output dataset for this scenario.
  • 要从包含标头行的文本文件复制到非文件接收器,并想要删除该行。You are copying from a text file containing a header line to a non-file sink and would like to drop that line. 请在输入数据集中将 firstRowAsHeader 指定为 true。Specify firstRowAsHeader as true in the input dataset.
  • 要从文本文件复制,并想跳过不包含数据或标头信息的开头几行。You are copying from a text file and want to skip a few lines at the beginning that contain no data or header information. 通过指定 skipLineCount 指明要跳过的行数。Specify skipLineCount to indicate the number of lines to be skipped. 如果文件的剩余部分包含标头行,则也可指定 firstRowAsHeaderIf the rest of the file contains a header line, you can also specify firstRowAsHeader. 如果同时指定了 skipLineCountfirstRowAsHeader,则先跳过代码行,然后从输入文件读取标头信息If both skipLineCount and firstRowAsHeader are specified, the lines are skipped first and then the header information is read from the input file

JSON 格式(旧版)JSON format (legacy)

备注

JSON 格式一文中了解新模型。Learn the new model from JSON format article. 仍然按原样支持基于文件的数据存储数据集的以下配置,以实现向后兼容性。The following configurations on file-based data store dataset is still supported as-is for backward compabitility. 建议你今后使用新模型。You are suggested to use the new model going forward.

若要在 Azure Cosmos DB 中按原样导入/导出 JSON 文件,请参阅将数据移入/移出 Azure Cosmos DB 一文中的“导入/导出 JSON 文档”部分。To import/export a JSON file as-is into/from Azure Cosmos DB, see Import/export JSON documents section in Move data to/from Azure Cosmos DB article.

要分析 JSON 文件或以 JSON 格式写入数据,请将 format 节中的 type 属性设置为 JsonFormatIf you want to parse the JSON files or write the data in JSON format, set the type property in the format section to JsonFormat. 也可在 format 节指定以下可选属性。You can also specify the following optional properties in the format section. 请参阅 JsonFormat 示例部分,了解如何进行配置。See JsonFormat example section on how to configure.

属性Property 说明Description 必须Required
filePatternfilePattern 指示每个 JSON 文件中存储的数据模式。Indicate the pattern of data stored in each JSON file. 允许的值为:setOfObjectsarrayOfObjectsAllowed values are: setOfObjects and arrayOfObjects. 默认值为 setOfObjectsThe default value is setOfObjects. 请参阅 JSON 文件模式部分,详细了解这些模式。See JSON file patterns section for details about these patterns. No
jsonNodeReferencejsonNodeReference 若要进行迭代操作,以同一模式从数组字段中的对象提取数据,请指定该数组的 JSON 路径。If you want to iterate and extract data from the objects inside an array field with the same pattern, specify the JSON path of that array. 只有 JSON 文件复制数据时,才支持此属性。This property is supported only when copying data from JSON files. No
jsonPathDefinitionjsonPathDefinition 为每个使用自定义列名映射的列指定 JSON 路径表达式(开头为小写)。Specify the JSON path expression for each column mapping with a customized column name (start with lowercase). 只有 JSON 文件复制数据时,才支持此属性,而且用户可以从对象或数组提取数据。This property is supported only when copying data from JSON files, and you can extract data from object or array.

对于根对象下的字段,请以根 $ 开头;对于按 jsonNodeReference 属性选择的数组中的字段,请以数组元素开头。For fields under root object, start with root $; for fields inside the array chosen by jsonNodeReference property, start from the array element. 请参阅 JsonFormat 示例部分,了解如何进行配置。See JsonFormat example section on how to configure.
No
encodingNameencodingName 指定编码名称。Specify the encoding name. 有关有效编码名称的列表,请参阅:Encoding.EncodingName 属性。For the list of valid encoding names, see: Encoding.EncodingName Property. 例如:windows-1250 或 shift_jis。For example: windows-1250 or shift_jis. 默认值为 :UTF-8The default value is: UTF-8. No
nestingSeparatornestingSeparator 用于分隔嵌套级别的字符。Character that is used to separate nesting levels. 默认值为“.”(点)。The default value is '.' (dot). No

备注

对于将数组中的数据交叉应用到多行的情况(案例 1 -> JsonFormat 示例中的示例 2),只能选择使用属性 jsonNodeReference 展开单个数组。For the case of cross-apply data in array into multiple rows (case 1 -> sample 2 in JsonFormat examples), you can only choose to expand single array using property jsonNodeReference.

JSON 文件模式JSON file patterns

复制活动可分析以下 JSON 文件模式:Copy activity can parse the following patterns of JSON files:

  • 类型 I:setOfObjectsType I: setOfObjects

    每个文件包含单个对象,或者以行分隔/串连的多个对象。Each file contains single object, or line-delimited/concatenated multiple objects. 在输出数据集中选择此选项时,复制活动将生成单个 JSON 文件,其中每行包含一个对象(以行分隔)。When this option is chosen in an output dataset, copy activity produces a single JSON file with each object per line (line-delimited).

    • 单一对象 JSON 示例single object JSON example

      {
          "time": "2015-04-29T07:12:20.9100000Z",
          "callingimsi": "466920403025604",
          "callingnum1": "678948008",
          "callingnum2": "567834760",
          "switch1": "China",
          "switch2": "Germany"
      }
      
    • 行分隔的 JSON 示例line-delimited JSON example

      {"time":"2015-04-29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":"567834760","switch1":"China","switch2":"Germany"}
      {"time":"2015-04-29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":"789037573","switch1":"US","switch2":"UK"}
      {"time":"2015-04-29T07:13:21.4370000Z","callingimsi":"466923101048691","callingnum1":"678901578","callingnum2":"345626404","switch1":"Germany","switch2":"UK"}
      
    • 串连的 JSON 示例concatenated JSON example

      {
          "time": "2015-04-29T07:12:20.9100000Z",
          "callingimsi": "466920403025604",
          "callingnum1": "678948008",
          "callingnum2": "567834760",
          "switch1": "China",
          "switch2": "Germany"
      }
      {
          "time": "2015-04-29T07:13:21.0220000Z",
          "callingimsi": "466922202613463",
          "callingnum1": "123436380",
          "callingnum2": "789037573",
          "switch1": "US",
          "switch2": "UK"
      }
      {
          "time": "2015-04-29T07:13:21.4370000Z",
          "callingimsi": "466923101048691",
          "callingnum1": "678901578",
          "callingnum2": "345626404",
          "switch1": "Germany",
          "switch2": "UK"
      }
      
  • 类型 II:arrayOfObjectsType II: arrayOfObjects

    每个文件包含对象的数组。Each file contains an array of objects.

    [
        {
            "time": "2015-04-29T07:12:20.9100000Z",
            "callingimsi": "466920403025604",
            "callingnum1": "678948008",
            "callingnum2": "567834760",
            "switch1": "China",
            "switch2": "Germany"
        },
        {
            "time": "2015-04-29T07:13:21.0220000Z",
            "callingimsi": "466922202613463",
            "callingnum1": "123436380",
            "callingnum2": "789037573",
            "switch1": "US",
            "switch2": "UK"
        },
        {
            "time": "2015-04-29T07:13:21.4370000Z",
            "callingimsi": "466923101048691",
            "callingnum1": "678901578",
            "callingnum2": "345626404",
            "switch1": "Germany",
            "switch2": "UK"
        }
    ]
    

JsonFormat 示例JsonFormat example

案例 1:从 JSON 文件复制数据Case 1: Copying data from JSON files

示例 1:从对象和数组中提取数据Sample 1: extract data from object and array

在此示例中,预期要将一个根 JSON 对象映射到表格结果中的单个记录。In this sample, you expect one root JSON object maps to single record in tabular result. 如果 JSON 文件包含以下内容:If you have a JSON file with the following content:

{
    "id": "ed0e4960-d9c5-11e6-85dc-d7996816aad3",
    "context": {
        "device": {
            "type": "PC"
        },
        "custom": {
            "dimensions": [
                {
                    "TargetResourceType": "Microsoft.Compute/virtualMachines"
                },
                {
                    "ResourceManagementProcessRunId": "827f8aaa-ab72-437c-ba48-d8917a7336a3"
                },
                {
                    "OccurrenceTime": "1/13/2017 11:24:37 AM"
                }
            ]
        }
    }
}

并且你想要通过从对象和数组中提取数据,使用以下格式将该文件复制到 Azure SQL 表:and you want to copy it into an Azure SQL table in the following format, by extracting data from both objects and array:

IDID deviceTypedeviceType targetResourceTypetargetResourceType resourceManagementProcessRunIdresourceManagementProcessRunId occurrenceTimeoccurrenceTime
ed0e4960-d9c5-11e6-85dc-d7996816aad3ed0e4960-d9c5-11e6-85dc-d7996816aad3 PCPC Microsoft.Compute/virtualMachinesMicrosoft.Compute/virtualMachines 827f8aaa-ab72-437c-ba48-d8917a7336a3827f8aaa-ab72-437c-ba48-d8917a7336a3 1/13/2017 11:24:37 AM1/13/2017 11:24:37 AM

JsonFormat 类型的输入数据集定义如下(部分定义,仅包含相关部件)。The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts). 更具体地说:More specifically:

  • structure 节定义自定义列名以及在转换为表格数据时的相应数据类型。structure section defines the customized column names and the corresponding data type while converting to tabular data. 本节为可选,除非需要进行列映射。This section is optional unless you need to do column mapping. 有关详细信息,请参阅将源数据集列映射到目标数据集列For more information, see Map source dataset columns to destination dataset columns.
  • jsonPathDefinition 为每个列指定 JSON 路径,表明从何处提取数据。jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. 若要从数组中复制数据,可以使用 array[x].propertyxth 对象中提取给定属性的值,或者使用 array[*].property 从包含此类属性的任何对象中查找该值。To copy data from array, you can use array[x].property to extract value of the given property from the xth object, or you can use array[*].property to find the value from any object containing such property.
"properties": {
    "structure": [
        {
            "name": "id",
            "type": "String"
        },
        {
            "name": "deviceType",
            "type": "String"
        },
        {
            "name": "targetResourceType",
            "type": "String"
        },
        {
            "name": "resourceManagementProcessRunId",
            "type": "String"
        },
        {
            "name": "occurrenceTime",
            "type": "DateTime"
        }
    ],
    "typeProperties": {
        "folderPath": "mycontainer/myfolder",
        "format": {
            "type": "JsonFormat",
            "filePattern": "setOfObjects",
            "jsonPathDefinition": {"id": "$.id", "deviceType": "$.context.device.type", "targetResourceType": "$.context.custom.dimensions[0].TargetResourceType", "resourceManagementProcessRunId": "$.context.custom.dimensions[1].ResourceManagementProcessRunId", "occurrenceTime": " $.context.custom.dimensions[2].OccurrenceTime"}
        }
    }
}

示例 2:使用数组中的相同模式交叉应用多个对象Sample 2: cross apply multiple objects with the same pattern from array

在此示例中,预期要将一个根 JSON 对象转换为表格结果中的多个记录。In this sample, you expect to transform one root JSON object into multiple records in tabular result. 如果 JSON 文件包含以下内容:If you have a JSON file with the following content:

{
    "ordernumber": "01",
    "orderdate": "20170122",
    "orderlines": [
        {
            "prod": "p1",
            "price": 23
        },
        {
            "prod": "p2",
            "price": 13
        },
        {
            "prod": "p3",
            "price": 231
        }
    ],
    "city": [ { "sanmateo": "No 1" } ]
}

而用户需要按以下格式通过平展数组中数据的方式将其复制到 Azure SQL 表中,并使用常见的根信息进行交叉联接:and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the array and cross join with the common root info:

ordernumber orderdate order_pd order_price city
0101 2017012220170122 P1P1 2323 [{"sanmateo":"No 1"}]
0101 2017012220170122 P2P2 1313 [{"sanmateo":"No 1"}]
0101 2017012220170122 P3P3 231231 [{"sanmateo":"No 1"}]

JsonFormat 类型的输入数据集定义如下(部分定义,仅包含相关部件)。The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts). 更具体地说:More specifically:

  • structure 节定义自定义列名以及在转换为表格数据时的相应数据类型。structure section defines the customized column names and the corresponding data type while converting to tabular data. 本节为可选,除非需要进行列映射。This section is optional unless you need to do column mapping. 有关详细信息,请参阅将源数据集列映射到目标数据集列For more information, see Map source dataset columns to destination dataset columns.
  • jsonNodeReference 指示在数组 orderlines 下以同一模式从对象中循环访问数据并提取数据。jsonNodeReference indicates to iterate and extract data from the objects with the same pattern under array orderlines.
  • jsonPathDefinition 为每个列指定 JSON 路径,表明从何处提取数据。jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. 在以下示例中,ordernumberorderdatecity 位于 JSON 路径以“$.”开头的根对象下,而 order_pdorder_price 在定义时使用的路径派生自没有“$.”的数组元素。In this example, ordernumber, orderdate, and city are under root object with JSON path starting with $., while order_pd and order_price are defined with path derived from the array element without $..
"properties": {
    "structure": [
        {
            "name": "ordernumber",
            "type": "String"
        },
        {
            "name": "orderdate",
            "type": "String"
        },
        {
            "name": "order_pd",
            "type": "String"
        },
        {
            "name": "order_price",
            "type": "Int64"
        },
        {
            "name": "city",
            "type": "String"
        }
    ],
    "typeProperties": {
        "folderPath": "mycontainer/myfolder",
        "format": {
            "type": "JsonFormat",
            "filePattern": "setOfObjects",
            "jsonNodeReference": "$.orderlines",
            "jsonPathDefinition": {"ordernumber": "$.ordernumber", "orderdate": "$.orderdate", "order_pd": "prod", "order_price": "price", "city": " $.city"}
        }
    }
}

请注意以下几点:Note the following points:

  • 如果未在数据工厂数据集中定义 structurejsonPathDefinition,复制活动将检测第一个对象的架构,并平展整个对象。If the structure and jsonPathDefinition are not defined in the Data Factory dataset, the Copy Activity detects the schema from the first object and flatten the whole object.
  • 如果 JSON 输入包含数组,则默认情况下,复制活动会将整个数组值转换为字符串。If the JSON input has an array, by default the Copy Activity converts the entire array value into a string. 可以选择使用 jsonNodeReference 和/或 jsonPathDefinition 通过它提取数据,也可以不在 jsonPathDefinition 中指定它,从而将它跳过。You can choose to extract data from it using jsonNodeReference and/or jsonPathDefinition, or skip it by not specifying it in jsonPathDefinition.
  • 如果同一级别存在重复的名称,复制活动将选择最后一个名称。If there are duplicate names at the same level, the Copy Activity picks the last one.
  • 属性名称区分大小写。Property names are case-sensitive. 名称相同但大小写不同的两个属性被视为两个不同的属性。Two properties with same name but different casings are treated as two separate properties.

案例 2:将数据写入 JSON 文件Case 2: Writing data to JSON file

如果 SQL 数据库中存在以下表:If you have the following table in SQL Database:

IDID order_dateorder_date order_priceorder_price order_byorder_by
11 2017011920170119 20002000 DavidDavid
22 2017012020170120 35003500 PatrickPatrick
33 2017012120170121 40004000 JasonJason

每个记录将按以下格式写入到 JSON 对象中:and for each record, you expect to write to a JSON object in the following format:

{
    "id": "1",
    "order": {
        "date": "20170119",
        "price": 2000,
        "customer": "David"
    }
}

JsonFormat 类型的输出数据集定义如下(部分定义,仅包含相关部件)。The output dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts). 更具体说来,structure 节用于定义目标文件中的自定义属性名称,nestingSeparator(默认为“.”)则用于标识名称中的嵌套层。More specifically, structure section defines the customized property names in destination file, nestingSeparator (default is ".") are used to identify the nest layer from the name. 本节为可选,除非需要将属性名称更改为与源列名不同的名称,或者需要嵌套部分属性。This section is optional unless you want to change the property name comparing with source column name, or nest some of the properties.

"properties": {
    "structure": [
        {
            "name": "id",
            "type": "String"
        },
        {
            "name": "order.date",
            "type": "String"
        },
        {
            "name": "order.price",
            "type": "Int64"
        },
        {
            "name": "order.customer",
            "type": "String"
        }
    ],
    "typeProperties": {
        "folderPath": "mycontainer/myfolder",
        "format": {
            "type": "JsonFormat"
        }
    }
}

Parquet 格式(旧版)Parquet format (legacy)

备注

Parquet 格式一文中了解新模型。Learn the new model from Parquet format article. 仍然按原样支持基于文件的数据存储数据集的以下配置,以实现向后兼容性。The following configurations on file-based data store dataset is still supported as-is for backward compabitility. 建议你今后使用新模型。You are suggested to use the new model going forward.

若要分析 Parquet 文件或以 Parquet 格式写入数据,请将 format type 属性设置为 ParquetFormatIf you want to parse the Parquet files or write the data in Parquet format, set the format type property to ParquetFormat. 不需在 typeProperties 节的 Format 节中指定任何属性。You do not need to specify any properties in the Format section within the typeProperties section. 示例:Example:

"format":
{
    "type": "ParquetFormat"
}

请注意以下几点:Note the following points:

  • 不支持复杂数据类型(MAP、LIST)。Complex data types are not supported (MAP, LIST).
  • 不支持列名称中的空格。White space in column name is not supported.
  • Parquet 文件提供以下压缩相关的选项:NONE、SNAPPY、GZIP 和 LZO。Parquet file has the following compression-related options: NONE, SNAPPY, GZIP, and LZO. 数据工厂支持以这些压缩格式中的任意一种格式从 Parquet 文件中读取数据 - 但 LZO 格式除外,它使用元数据中的压缩编解码器来读取数据。Data Factory supports reading data from Parquet file in any of these compressed formats except LZO - it uses the compression codec in the metadata to read the data. 但是,写入 Parquet 文件时,数据工厂会选择 SNAPPY,这是 Parquet 格式的默认选项。However, when writing to a Parquet file, Data Factory chooses SNAPPY, which is the default for Parquet format. 目前没有任何选项可以重写此行为。Currently, there is no option to override this behavior.

重要

对于由自承载集成运行时(例如,在本地与云数据存储之间)支持的复制,如果不是按原样复制 Parquet 文件,则需要在 IR 计算机上安装 64 位 JRE 8(Java 运行时环境)或 OpenJDKFor copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not copying Parquet files as-is, you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your IR machine. 请参阅下面段落中的更多详细信息。See the following paragraph with more details.

对于使用 Parquet 文件序列化/反序列化在自承载集成运行时上运行的复制,ADF 将通过首先检查 JRE 的注册表项 (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) 来查找 Java 运行时,如果未找到,则会检查系统变量 JAVA_HOME 来查找 OpenJDK。For copy running on Self-hosted IR with Parquet file serialization/deserialization, ADF locates the Java runtime by firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for JRE, if not found, secondly checking system variable JAVA_HOME for OpenJDK.

  • 若要使用 JRE:64 位 IR 需要 64 位 JRE。To use JRE: The 64-bit IR requires 64-bit JRE. 可在此处找到它。You can find it from here.
  • 若要使用 OpenJDK:从 IR 版本 3.13 开始受支持。To use OpenJDK: it's supported since IR version 3.13. 将 jvm.dll 以及所有其他必需的 OpenJDK 程序集打包到自承载 IR 计算机中,并相应地设置系统环境变量 JAVA_HOME。Package the jvm.dll with all other required assemblies of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.

提示

如果使用自承载集成运行时将数据复制为 Parquet 格式或从 Parquet 格式复制数据,并遇到“调用 java 时发生错误,消息: java.lang.OutOfMemoryError:Java 堆空间”的错误,则可以在托管自承载 IR 的计算机上添加环境变量 _JAVA_OPTIONS,以便调整 JVM 的最小/最大堆大小,以支持此类复制,然后重新运行管道 。If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space", you can add an environment variable _JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such copy, then rerun the pipeline.

在自承载 IR 上设置 JVM 堆大小

示例:将变量 _JAVA_OPTIONS 的值设置为 -Xms256m -Xmx16gExample: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g. 标志 Xms 指定 Java 虚拟机 (JVM) 的初始内存分配池,而 Xmx 指定最大内存分配池。The flag Xms specifies the initial memory allocation pool for a Java Virtual Machine (JVM), while Xmx specifies the maximum memory allocation pool. 这意味着 JVM 初始内存为 Xms,并且能够使用的最多内存为 XmxThis means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx amount of memory. 默认情况下,ADF 最少使用 64MB 且最多使用 1G。By default, ADF use min 64MB and max 1G.

Parquet 文件的数据类型映射Data type mapping for Parquet files

数据工厂临时数据类型Data factory interim data type Parquet 基元类型Parquet Primitive Type Parquet 原始类型(反序列化)Parquet Original Type (Deserialize) Parquet 原始类型(串行化)Parquet Original Type (Serialize)
布尔Boolean 布尔Boolean 空值N/A 空值N/A
SByteSByte Int32Int32 Int8Int8 Int8Int8
ByteByte Int32Int32 UInt8UInt8 Int16Int16
Int16Int16 Int32Int32 Int16Int16 Int16Int16
UInt16UInt16 Int32Int32 UInt16UInt16 Int32Int32
Int32Int32 Int32Int32 Int32Int32 Int32Int32
UInt32UInt32 Int64Int64 UInt32UInt32 Int64Int64
Int64Int64 Int64Int64 Int64Int64 Int64Int64
UInt64UInt64 Int64/二进制Int64/Binary UInt64UInt64 小数Decimal
SingleSingle FloatFloat 空值N/A 空值N/A
DoubleDouble DoubleDouble 空值N/A 空值N/A
小数Decimal 二进制Binary 小数Decimal 小数Decimal
StringString 二进制Binary Utf8Utf8 Utf8Utf8
DateTimeDateTime Int96Int96 空值N/A 空值N/A
TimeSpanTimeSpan Int96Int96 空值N/A 空值N/A
DateTimeOffsetDateTimeOffset Int96Int96 空值N/A 空值N/A
ByteArrayByteArray 二进制Binary 空值N/A 空值N/A
GuidGuid 二进制Binary Utf8Utf8 Utf8Utf8
CharChar 二进制Binary Utf8Utf8 Utf8Utf8
CharArrayCharArray 不支持Not supported 空值N/A 空值N/A

ORC 格式(旧版)ORC format (legacy)

备注

ORC 格式一文了解中新模型。Learn the new model from ORC format article. 仍然按原样支持基于文件的数据存储数据集的以下配置,以实现向后兼容性。The following configurations on file-based data store dataset is still supported as-is for backward compabitility. 建议你今后使用新模型。You are suggested to use the new model going forward.

若要分析 ORC 文件或以 ORC 格式写入数据,请将 format type 属性设置为 OrcFormatIf you want to parse the ORC files or write the data in ORC format, set the format type property to OrcFormat. 不需在 typeProperties 节的 Format 节中指定任何属性。You do not need to specify any properties in the Format section within the typeProperties section. 示例:Example:

"format":
{
    "type": "OrcFormat"
}

请注意以下几点:Note the following points:

  • 不支持复杂数据类型(STRUCT、MAP、LIST、UNION)。Complex data types are not supported (STRUCT, MAP, LIST, UNION).
  • 不支持列名称中的空格。White space in column name is not supported.
  • ORC 文件有三个压缩相关的选项:NONE、ZLIB、SNAPPY。ORC file has three compression-related options: NONE, ZLIB, SNAPPY. 数据工厂支持从使用其中任一压缩格式的 ORC 文件中读取数据。Data Factory supports reading data from ORC file in any of these compressed formats. 它使用元数据中的压缩编解码器来读取数据。It uses the compression codec is in the metadata to read the data. 但是,写入 ORC 文件时,数据工厂会选择 ZLIB,这是 ORC 的默认选项。However, when writing to an ORC file, Data Factory chooses ZLIB, which is the default for ORC. 目前没有任何选项可以重写此行为。Currently, there is no option to override this behavior.

重要

对于由自承载集成运行时(例如,在本地与云数据存储之间)支持的复制,如果不是按原样复制 ORC 文件,则需要在 IR 计算机上安装 64 位 JRE 8(Java 运行时环境)或 OpenJDKFor copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not copying ORC files as-is, you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your IR machine. 请参阅下面段落中的更多详细信息。See the following paragraph with more details.

对于使用 ORC 文件序列化/反序列化在自承载集成运行时上运行的复制,ADF 将通过首先检查 JRE 的注册表项 (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) 来查找 Java 运行时,如果未找到,则会检查系统变量 JAVA_HOME 来查找 OpenJDK。For copy running on Self-hosted IR with ORC file serialization/deserialization, ADF locates the Java runtime by firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for JRE, if not found, secondly checking system variable JAVA_HOME for OpenJDK.

  • 若要使用 JRE:64 位 IR 需要 64 位 JRE。To use JRE: The 64-bit IR requires 64-bit JRE. 可在此处找到它。You can find it from here.
  • 若要使用 OpenJDK:从 IR 版本 3.13 开始受支持。To use OpenJDK: it's supported since IR version 3.13. 将 jvm.dll 以及所有其他必需的 OpenJDK 程序集打包到自承载 IR 计算机中,并相应地设置系统环境变量 JAVA_HOME。Package the jvm.dll with all other required assemblies of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.

ORC 文件的数据类型映射Data type mapping for ORC files

数据工厂临时数据类型Data factory interim data type ORC 类型ORC types
布尔Boolean 布尔Boolean
SByteSByte ByteByte
ByteByte ShortShort
Int16Int16 ShortShort
UInt16UInt16 intInt
Int32Int32 intInt
UInt32UInt32 LongLong
Int64Int64 LongLong
UInt64UInt64 StringString
SingleSingle FloatFloat
DoubleDouble DoubleDouble
小数Decimal 小数Decimal
StringString StringString
DateTimeDateTime TimestampTimestamp
DateTimeOffsetDateTimeOffset TimestampTimestamp
TimeSpanTimeSpan TimestampTimestamp
ByteArrayByteArray 二进制Binary
GuidGuid StringString
CharChar Char(1)Char(1)

AVRO 格式(旧版)AVRO format (legacy)

备注

Avro 格式一文中了解新模型。Learn the new model from Avro format article. 仍然按原样支持基于文件的数据存储数据集的以下配置,以实现向后兼容性。The following configurations on file-based data store dataset is still supported as-is for backward compabitility. 建议你今后使用新模型。You are suggested to use the new model going forward.

若要分析 Avro 文件或以 Avro 格式写入数据,请将 format type 属性设置为 AvroFormatIf you want to parse the Avro files or write the data in Avro format, set the format type property to AvroFormat. 不需在 typeProperties 节的 Format 节中指定任何属性。You do not need to specify any properties in the Format section within the typeProperties section. 示例:Example:

"format":
{
    "type": "AvroFormat",
}

若要在 Hive 表中使用 Avro 格式,可以参考 Apache Hive 教程To use Avro format in a Hive table, you can refer to Apache Hive's tutorial.

请注意以下几点:Note the following points:

压缩支持(旧版)Compression support (legacy)

在复制期间,Azure 数据工厂支持压缩/解压缩数据。Azure Data Factory supports compress/decompress data during copy. 在输入数据集中指定 compression 属性时,复制活动从源读取压缩的数据并对其进行解压缩;在输出数据集中指定属性时,复制活动将压缩数据并将其写入到接收器。When you specify compression property in an input dataset, the copy activity read the compressed data from the source and decompress it; and when you specify the property in an output dataset, the copy activity compress then write data to the sink. 下面是一些示例方案:Here are a few sample scenarios:

  • 从 Azure Blob 读取 GZIP 压缩的数据,将其解压缩,然后将结果数据写入 Azure SQL 数据库。Read GZIP compressed data from an Azure blob, decompress it, and write result data to Azure SQL Database. 使用值为 GZIP 的 compression type 属性定义输入 Azure Blob 数据集。You define the input Azure Blob dataset with the compression type property as GZIP.
  • 从来自本地文件系统的纯文本文件读取数据、使用 GZip 格式进行压缩并将压缩的数据写入到 Azure Blob。Read data from a plain-text file from on-premises File System, compress it using GZip format, and write the compressed data to an Azure blob. 使用值为 GZip 的 compression type 属性定义输出 Azure Blob 数据集。You define an output Azure Blob dataset with the compression type property as GZip.
  • 从 Azure Blob 读取 GZIP 压缩的数据,将其解压缩、使用 BZIP2 将其压缩,然后将结果数据写入 Azure Blob。Read a GZIP-compressed data from an Azure blob, decompress it, compress it using BZIP2, and write result data to an Azure blob. 通过将 compression type 设为 GZIP 来定义输入 Azure Blob 数据集,通过将 compression type 设为 BZIP2 来定义输出数据集。You define the input Azure Blob dataset with compression type set to GZIP and the output dataset with compression type set to BZIP2.

若要为数据集指定压缩,请在数据集 JSON 中使用 compression 属性,如以下示例所示:To specify compression for a dataset, use the compression property in the dataset JSON as in the following example:

{
    "name": "AzureBlobDataSet",
    "properties": {
        "type": "AzureBlob",
        "linkedServiceName": {
            "referenceName": "StorageLinkedService",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "fileName": "pagecounts.csv.gz",
            "folderPath": "compression/file/",
            "format": {
                "type": "TextFormat"
            },
            "compression": {
                "type": "GZip",
                "level": "Optimal"
            }
        }
    }
}

compression 节包含两个属性:The compression section has two properties:

  • Type: 压缩编解码器,可以是 GZIPDeflateBZIP2ZipDeflateType: the compression codec, which can be GZIP, Deflate, BZIP2, or ZipDeflate. 请注意,使用复制活动解压缩 ZipDeflate 文件并写入到基于文件的接收器数据存储时,会将文件提取到文件夹:<path specified in dataset>/<folder named as source zip file>/Note when using copy activity to decompress ZipDeflate file(s) and write to file-based sink data store, files will be extracted to the folder: <path specified in dataset>/<folder named as source zip file>/.

  • Level: 压缩比,可以是 OptimalFastestLevel: the compression ratio, which can be Optimal or Fastest.

    • 最快: 尽快完成压缩操作,不过,无法以最佳方式压缩生成的文件。Fastest: The compression operation should complete as quickly as possible, even if the resulting file is not optimally compressed.

    • 最佳:以最佳方式完成压缩操作,不过,需要耗费更长的时间。Optimal: The compression operation should be optimally compressed, even if the operation takes a longer time to complete.

      有关详细信息,请参阅 Compression Level(压缩级别)主题。For more information, see Compression Level topic.

备注

目前采用 AvroFormatOrcFormatParquetFormat 的数据不支持压缩设置。Compression settings are not supported for data in the AvroFormat, OrcFormat, or ParquetFormat. 读取采用这些格式的文件时,数据工厂将检测并使用元数据中的压缩编解码器。When reading files in these formats, Data Factory detects and uses the compression codec in the metadata. 在采用这些格式的文件中写入数据时,数据工厂将选择适用于该格式的默认压缩编解码器。When writing to files in these formats, Data Factory chooses the default compression codec for that format. 例如,为 OrcFormat 使用 ZLIB,为 ParquetFormat 使用 SNAPPY。For example, ZLIB for OrcFormat and SNAPPY for ParquetFormat.

不支持的文件类型和压缩格式Unsupported file types and compression formats

可以使用 Azure 数据工厂的可扩展性功能来转换不受支持的文件。You can use the extensibility features of Azure Data Factory to transform files that aren't supported. 两个选项包括 Azure Functions 和使用 Azure Batch 的自定义任务。Two options include Azure Functions and custom tasks by using Azure Batch.

可以看到一个示例,它使用 Azure 函数提取 tar 文件的内容You can see a sample that uses an Azure function to extract the contents of a tar file. 有关详细信息,请参阅 Azure Functions 活动For more information, see Azure Functions activity.

还可以使用自定义 dotnet 活动构建此功能。You can also build this functionality using a custom dotnet activity. 更多的信息可以在这里找到Further information is available here

后续步骤Next steps

支持的文件格式和压缩中了解最新的受支持文件格式和压缩。Learn the latest supported file formats and compressions from Supported file formats and compressions.