Azure 数据工厂中的 XML 格式XML format in Azure Data Factory

适用于: Azure 数据工厂 Azure Synapse Analytics

如果要 分析 XML 文件,请按此文的要求操作。Follow this article when you want to parse the XML files.

以下连接器支持 XML 格式:Amazon S3Azure BlobAzure Data Lake Storage Gen2Azure 文件存储文件系统FTPGoogle 云存储HDFSHTTPSFTPXML format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP. 它可以作为源,但不可作为接收器。It is supported as source but not sink.

数据集属性Dataset properties

有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集一文。For a full list of sections and properties available for defining datasets, see the Datasets article. 本部分提供 XML 数据集支持的属性列表。This section provides a list of properties supported by the XML dataset.

属性Property 说明Description 必选Required
typetype 数据集的 type 属性必须设置为 Xml。The type property of the dataset must be set to Xml. Yes
locationlocation 文件的位置设置。Location settings of the file(s). 每个基于文件的连接器在 location 下都有其自己的位置类型和支持的属性。Each file-based connector has its own location type and supported properties under location. 请在连接器文章 -> 数据集属性部分中查看详细信息See details in connector article -> Dataset properties section. Yes
encodingNameencodingName 用于读取/写入测试文件的编码类型。The encoding type used to read/write test files.
可用的值如下:"UTF-8"、"UTF-16"、"UTF-16BE"、"UTF-32"、"UTF-32BE"、"US-ASCII"、"UTF-7"、"BIG5"、"EUC-JP"、"EUC-KR"、"GB2312"、"GB18030"、"JOHAB"、"SHIFT-JIS"、"CP875"、"CP866"、"IBM00858"、"IBM037"、"IBM273"、"IBM437"、"IBM500"、"IBM737"、"IBM775"、"IBM850"、"IBM852"、"IBM855"、"IBM857"、"IBM860"、"IBM861"、"IBM863"、"IBM864"、"IBM865"、"IBM869"、"IBM870"、"IBM01140"、"IBM01141"、"IBM01142"、"IBM01143"、"IBM01144"、"IBM01145"、"IBM01146"、"IBM01147"、"IBM01148"、"IBM01149"、"ISO-2022-JP"、"ISO-2022-KR"、"ISO-8859-1"、"ISO-8859-2"、"ISO-8859-3"、"ISO-8859-4"、"ISO-8859-5"、"ISO-8859-6"、"ISO-8859-7"、"ISO-8859-8"、"ISO-8859-9"、"ISO-8859-13"、"ISO-8859-15"、"WINDOWS-874"、"WINDOWS-1250"、"WINDOWS-1251"、"WINDOWS-1252"、"WINDOWS-1253"、"WINDOWS-1254"、"WINDOWS-1255"、"WINDOWS-1256"、"WINDOWS-1257"、"WINDOWS-1258"。Allowed values are as follows: "UTF-8", "UTF-16", "UTF-16BE", "UTF-32", "UTF-32BE", "US-ASCII", "UTF-7", "BIG5", "EUC-JP", "EUC-KR", "GB2312", "GB18030", "JOHAB", "SHIFT-JIS", "CP875", "CP866", "IBM00858", "IBM037", "IBM273", "IBM437", "IBM500", "IBM737", "IBM775", "IBM850", "IBM852", "IBM855", "IBM857", "IBM860", "IBM861", "IBM863", "IBM864", "IBM865", "IBM869", "IBM870", "IBM01140", "IBM01141", "IBM01142", "IBM01143", "IBM01144", "IBM01145", "IBM01146", "IBM01147", "IBM01148", "IBM01149", "ISO-2022-JP", "ISO-2022-KR", "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-13", "ISO-8859-15", "WINDOWS-874", "WINDOWS-1250", "WINDOWS-1251", "WINDOWS-1252", "WINDOWS-1253", "WINDOWS-1254", "WINDOWS-1255", "WINDOWS-1256", "WINDOWS-1257", "WINDOWS-1258".
No
nullValuenullValue 指定 null 值的字符串表示形式。Specifies the string representation of null value.
默认值为 空字符串The default value is empty string.
No
compressioncompression 用来配置文件压缩的属性组。Group of properties to configure file compression. 如果需要在活动执行期间进行压缩/解压缩,请配置此部分。Configure this section when you want to do compression/decompression during activity execution. No
typetype
(在 compression 下)(under compression)
用来读取/写入 XML 文件的压缩编解码器。The compression codec used to read/write XML files.
允许的值为 bzip2、gzip、deflate、ZipDeflate、TarGzip、Tar、snappy 或 lz4 。Allowed values are bzip2, gzip, deflate, ZipDeflate, TarGzip, Tar, snappy, or lz4. 默认设置是不压缩。Default is not compressed.
注意:目前,复制活动不支持“snappy”和“lz4”,映射数据流不支持“ZipDeflate”、“TarGzip”和“Tar”。Note currently Copy activity doesn't support "snappy" & "lz4", and mapping data flow doesn't support "ZipDeflate", "TarGzip" and "Tar".
注意,使用复制活动解压缩 ZipDeflate/TarGzip/Tar 文件并将其写入基于文件的接收器数据存储时,默认情况下文件将提取到 <path specified in dataset>/<folder named as source compressed file>/ 文件夹,对复制活动源使用 preserveZipFileNameAsFolder/preserveCompressionFileNameAsFolder 来控制是否以文件夹结构形式保留压缩文件名 。Note when using copy activity to decompress ZipDeflate/TarGzip/Tar file(s) and write to file-based sink data store, by default files are extracted to the folder:<path specified in dataset>/<folder named as source compressed file>/, use preserveZipFileNameAsFolder/preserveCompressionFileNameAsFolder on copy activity source to control whether to preserve the name of the compressed file(s) as folder structure.
否。No.
levellevel
(在 compression 下)(under compression)
压缩率。The compression ratio.
允许的值为 OptimalFastestAllowed values are Optimal or Fastest.
- Fastest:尽快完成压缩操作,不过,无法以最佳方式压缩生成的文件。- Fastest: The compression operation should complete as quickly as possible, even if the resulting file is not optimally compressed.
- Optimal:以最佳方式完成压缩操作,不过,需要耗费更长的时间。- Optimal: The compression operation should be optimally compressed, even if the operation takes a longer time to complete. 有关详细信息,请参阅 Compression Level(压缩级别)主题。For more information, see Compression Level topic.
No

下面是 Azure Blob 存储上的 XML 数据集的示例:Below is an example of XML dataset on Azure Blob Storage:

{
    "name": "XMLDataset",
    "properties": {
        "type": "Xml",
        "linkedServiceName": {
            "referenceName": "<Azure Blob Storage linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "container": "containername",
                "folderPath": "folder/subfolder",
            },
            "compression": {
                "type": "ZipDeflate"
            }
        }
    }
}

复制活动属性Copy activity properties

有关可用于定义活动的各部分和属性的完整列表,请参阅管道一文。For a full list of sections and properties available for defining activities, see the Pipelines article. 本部分提供 XML 源支持的属性列表。This section provides a list of properties supported by the XML source.

了解如何通过架构映射来映射 XML 数据和接收器数据存储/格式。Learn about how to map XML data and sink data store/format from schema mapping. 预览 XML 文件时,使用 JSON 层次结构显示数据,并使用 JSON 路径指向字段。When previewing XML files, data is shown with JSON hierarchy, and you use JSON path to point to the fields.

XML 作为源XML as source

复制活动的 *source* 节支持以下属性。The following properties are supported in the copy activity *source* section. 详细了解 XML 连接器行为Learn more from XML connector behavior.

属性Property 说明Description 必选Required
typetype 复制活动的 type 属性必须设置为 XmlSource。The type property of the copy activity source must be set to XmlSource. Yes
formatSettingsformatSettings 一组属性。A group of properties. 请参阅下面的“XML 读取设置”表。Refer to XML read settings table below. No
storeSettingsstoreSettings 有关如何从数据存储读取数据的一组属性。A group of properties on how to read data from a data store. 每个基于文件的连接器在 storeSettings 下都有其自己支持的读取设置。Each file-based connector has its own supported read settings under storeSettings. 请在连接器文章 -> 复制活动属性部分中查看详细信息See details in connector article -> Copy activity properties section. No

formatSettings 下支持的“XML 读取设置”:Supported XML read settings under formatSettings:

属性Property 说明Description 必选Required
typetype formatSettings 的 type 必须设置为 XmlReadSettings。The type of formatSettings must be set to XmlReadSettings. Yes
validationModevalidationMode 指定是否要验证 XML 架构。Specifies whether to validate the XML schema.
允许的值为 none(默认值、无验证)、xsd(使用 XSD 验证)以及 dtd (使用 DTD 验证) 。Allowed values are none (default, no validation), xsd (validate using XSD), dtd (validate using DTD).
No
namespacesnamespaces 分析 XML 文件时是否启用命名空间。Whether to enable namespace when parsing the XML files. 允许的值是:true(默认)、false。Allowed values are: true (default), false. No
namespacePrefixesnamespacePrefixes 命名空间 URI 到前缀的映射,用于在分析 xml 文件时为字段命名。Namespace URI to prefix mapping, which is used to name fields when parsing the xml file.
如果 XML 文件具有命名空间,且已启用命名空间,则默认情况下,字段名称与 XML 文档中的名称相同。If an XML file has namespace and namespace is enabled, by default, the field name is the same as it is in the XML document.
如果在此映射中为命名空间 URI 定义了一个项,则字段名称为 prefix:fieldNameIf there is an item defined for the namespace URI in this map, the field name is prefix:fieldName.
No
detectDataTypedetectDataType 是否检测整数、双精度和布尔数据类型。Whether to detect integer, double, and Boolean data types. 允许的值是:true(默认)、false。Allowed values are: true (default), false. No
compressionPropertiescompressionProperties 一组属性,指示如何为给定的压缩编解码器解压缩数据。A group of properties on how to decompress data for a given compression codec. No
preserveZipFileNameAsFolderpreserveZipFileNameAsFolder
(在 compressionProperties->type 下为 ZipDeflateReadSettings(under compressionProperties->type as ZipDeflateReadSettings)
当输入数据集配置了 ZipDeflate 压缩时适用。Applies when input dataset is configured with ZipDeflate compression. 指示是否在复制过程中以文件夹结构形式保留源 zip 文件名。Indicates whether to preserve the source zip file name as folder structure during copy.
- 当设置为 true(默认值)时,数据工厂会将已解压缩的文件写入 <path specified in dataset>/<folder named as source zip file>/- When set to true (default), Data Factory writes unzipped files to <path specified in dataset>/<folder named as source zip file>/.
- 当设置为 false 时,数据工厂会直接将未解压缩的文件写入 <path specified in dataset>- When set to false, Data Factory writes unzipped files directly to <path specified in dataset>. 请确保不同的源 zip 文件中没有重复的文件名,以避免产生冲突或出现意外行为。Make sure you don't have duplicated file names in different source zip files to avoid racing or unexpected behavior.
No
preserveCompressionFileNameAsFolderpreserveCompressionFileNameAsFolder
(在 compressionProperties->type 下为 TarGZipReadSettingsTarReadSettings(under compressionProperties->type as TarGZipReadSettings or TarReadSettings)
当输入数据集配置了 TarGzip/Tar 压缩时适用 。Applies when input dataset is configured with TarGzip/Tar compression. 指示是否在复制过程中以文件夹结构形式保留源压缩文件名。Indicates whether to preserve the source compressed file name as folder structure during copy.
- 当设置为 true(默认值)时,数据工厂会将已解压缩的文件写入 <path specified in dataset>/<folder named as source compressed file>/- When set to true (default), Data Factory writes decompressed files to <path specified in dataset>/<folder named as source compressed file>/.
- 当设置为 false 时,数据工厂会直接将已解压缩的文件写入 <path specified in dataset>- When set to false, Data Factory writes decompressed files directly to <path specified in dataset>. 请确保不同的源文件中没有重复的文件名,以避免产生冲突或出现意外行为。Make sure you don't have duplicated file names in different source files to avoid racing or unexpected behavior.
No

映射数据流属性Mapping data flow properties

在映射数据流中,可以在以下数据存储中读取和写入 XML 格式:Azure Blob 存储Azure Data Lake Storage Gen2In mapping data flows, you can read and write to XML format in the following data stores: Azure Blob Storage, and Azure Data Lake Storage Gen2. 可以使用 XML 数据集或使用内联数据集来指向 XML 文件。You can point to XML files either using XML dataset or using an inline dataset.

源属性Source properties

下表列出了 XML 源支持的属性。The below table lists the properties supported by an XML source. 可以在“源选项”选项卡中编辑这些属性。详细了解 XML 连接器行为You can edit these properties in the Source options tab. Learn more from XML connector behavior. 在使用内联数据集时,你将看到其他文件设置,这些设置与数据集属性部分中描述的属性相同。When using inline dataset, you will see additional file settings, which are the same as the properties described in dataset properties section.

名称Name 说明Description 必需Required 允许的值Allowed values 数据流脚本属性Data flow script property
通配符路径Wild card paths 将处理与通配符路径匹配的所有文件。All files matching the wildcard path will be processed. 重写数据集中设置的文件夹和文件路径。Overrides the folder and file path set in the dataset. No String[]String[] wildcardPathswildcardPaths
分区根路径Partition root path 对于已分区的文件数据,可以输入分区根路径,以便将已分区的文件夹读取为列For file data that is partitioned, you can enter a partition root path in order to read partitioned folders as columns No 字符串String partitionRootPathpartitionRootPath
文件列表List of files 源是否指向某个列出待处理文件的文本文件Whether your source is pointing to a text file that lists files to process No truefalsetrue or false fileListfileList
用于存储文件名的列Column to store file name 使用源文件名称和路径创建新列Create a new column with the source file name and path No 字符串String rowUrlColumnrowUrlColumn
完成后After completion 在处理后删除或移动文件。Delete or move the files after processing. 文件路径从容器根开始File path starts from the container root No 删除:truefalseDelete: true or false
Move:['<from>', '<to>']Move: ['<from>', '<to>']
purgeFilespurgeFiles
moveFilesmoveFiles
按上次修改时间筛选Filter by last modified 选择根据上次更改文件的时间来筛选文件Choose to filter files based upon when they were last altered No 时间戳Timestamp ModifiedAftermodifiedAfter
modifiedBeforemodifiedBefore
验证模式Validation mode 指定是否要验证 XML 架构。Specifies whether to validate the XML schema. No None(默认值,无验证)None (default, no validation)
xsd(使用 XSD 进行验证)xsd (validate using XSD)
dtd(使用 DTD 进行验证)。dtd (validate using DTD).
validationModevalidationMode
命名空间Namespaces 分析 XML 文件时是否启用命名空间。Whether to enable namespace when parsing the XML files. No true(默认值)或 falsetrue (default) or false namespacesnamespaces
命名空间前缀对Namespace prefix pairs 命名空间 URI 到前缀的映射,用于在分析 xml 文件时为字段命名。Namespace URI to prefix mapping, which is used to name fields when parsing the xml file.
如果 XML 文件具有命名空间,且已启用命名空间,则默认情况下,字段名称与 XML 文档中的名称相同。If an XML file has namespace and namespace is enabled, by default, the field name is the same as it is in the XML document.
如果在此映射中为命名空间 URI 定义了一个项,则字段名称为 prefix:fieldNameIf there is an item defined for the namespace URI in this map, the field name is prefix:fieldName.
No 使用模式 ['URI1'->'prefix1','URI2'->'prefix2'] 的数组Array with pattern['URI1'->'prefix1','URI2'->'prefix2'] namespacePrefixesnamespacePrefixes
允许找不到文件Allow no files found 如果为 true,则找不到文件时不会引发错误If true, an error is not thrown if no files are found no truefalsetrue or false ignoreNoFilesFoundignoreNoFilesFound

XML 源脚本示例XML source script example

下面的脚本是使用数据集模式的映射数据流中 XML 源配置的示例。The below script is an example of an XML source configuration in mapping data flows using dataset mode.

source(allowSchemaDrift: true,
    validateSchema: false,
    validationMode: 'xsd',
    namespaces: true) ~> XMLSource

下面的脚本是使用内联数据集模式的 XML 源配置的示例。The below script is an example of an XML source configuration using inline dataset mode.

source(allowSchemaDrift: true,
    validateSchema: false,
    format: 'xml',
    fileSystem: 'filesystem',
    folderPath: 'folder',
    validationMode: 'xsd',
    namespaces: true) ~> XMLSource

XML 连接器行为XML connector behavior

使用 XML 作为源时,请注意以下事项。Note the following when using XML as source.

  • XML 属性:XML attributes:

    • 元素的属性将被分析为层次结构中元素的子字段。Attributes of an element are parsed as the subfields of the element in the hierarchy.
    • 属性字段的名称遵循模式 @attributeNameThe name of the attribute field follows the pattern @attributeName.
  • XML 架构验证:XML schema validation:

    • 可以选择不验证架构,或者使用 XSD 或 DTD 验证架构。You can choose to not validate schema, or validate schema using XSD or DTD.
    • 使用 XSD 或 DTD 验证 XML 文件时,必须通过相对路径在 XML 文件内部引用 XSD/DTD。When using XSD or DTD to validate XML files, the XSD/DTD must be referred inside the XML files through relative path.
  • 命名空间处理:Namespace handling:

    • 在使用数据流时,可以禁用命名空间,在这种情况下,用于定义命名空间的属性将会被分析为普通属性。Namespace can be disabled when using data flow, in which case the attributes that defines the namespace will be parsed as normal attributes.
    • 启用命名空间后,默认情况下,元素和属性的名称将遵循模式 namespaceUri,elementNamenamespaceUri,@attributeNameWhen namespace is enabled, the names of the element and attributes follow the pattern namespaceUri,elementName and namespaceUri,@attributeName by default. 可以为源中的每个命名空间 URI 定义命名空间前缀,在此情况下,元素和属性的名称将遵循模式 definedPrefix:elementNamedefinedPrefix:@attributeNameYou can define namespace prefix for each namespace URI in source, in which case the names of the element and attributes follow the pattern definedPrefix:elementName or definedPrefix:@attributeName instead.
  • 值列:Value column:

    • 如果 XML 元素同时具有简单文本值和属性/子元素,则简单文本值将被解析为“值列”的值,该列采用内置字段名称 _value_If an XML element has both simple text value and attributes/child elements, the simple text value is parsed as the value of a "value column" with built-in field name _value_. 如果适用,它还会继承元素的命名空间。And it inherits the namespace of the element as well if applies.

后续步骤Next steps