Azure 数据工厂中的 XML 格式XML format in Azure Data Factory

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

如果要分析 XML 文件,请按此文的要求操作。Follow this article when you want to parse the XML files.

以下连接器支持 XML 格式:Amazon S3Azure BlobAzure Data Lake Storage Gen2Azure 文件存储文件系统FTPGoogle 云存储HDFSHTTPSFTPXML format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP. 它可以作为源,但不可作为接收器。It is supported as source but not sink.

数据集属性Dataset properties

有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集一文。For a full list of sections and properties available for defining datasets, see the Datasets article. 本部分提供 XML 数据集支持的属性列表。This section provides a list of properties supported by the XML dataset.

属性Property 描述Description 必须Required
typetype 数据集的 type 属性必须设置为 Xml。The type property of the dataset must be set to Xml. Yes
locationlocation 文件的位置设置。Location settings of the file(s). 每个基于文件的连接器在 location 下都有其自己的位置类型和支持的属性。Each file-based connector has its own location type and supported properties under location. 请在连接器文章 -> 数据集属性部分中查看详细信息See details in connector article -> Dataset properties section. Yes
encodingNameencodingName 用于读取/写入测试文件的编码类型。The encoding type used to read/write test files.
可用的值如下:"UTF-8"、"UTF-16"、"UTF-16BE"、"UTF-32"、"UTF-32BE"、"US-ASCII"、"UTF-7"、"BIG5"、"EUC-JP"、"EUC-KR"、"GB2312"、"GB18030"、"JOHAB"、"SHIFT-JIS"、"CP875"、"CP866"、"IBM00858"、"IBM037"、"IBM273"、"IBM437"、"IBM500"、"IBM737"、"IBM775"、"IBM850"、"IBM852"、"IBM855"、"IBM857"、"IBM860"、"IBM861"、"IBM863"、"IBM864"、"IBM865"、"IBM869"、"IBM870"、"IBM01140"、"IBM01141"、"IBM01142"、"IBM01143"、"IBM01144"、"IBM01145"、"IBM01146"、"IBM01147"、"IBM01148"、"IBM01149"、"ISO-2022-JP"、"ISO-2022-KR"、"ISO-8859-1"、"ISO-8859-2"、"ISO-8859-3"、"ISO-8859-4"、"ISO-8859-5"、"ISO-8859-6"、"ISO-8859-7"、"ISO-8859-8"、"ISO-8859-9"、"ISO-8859-13"、"ISO-8859-15"、"WINDOWS-874"、"WINDOWS-1250"、"WINDOWS-1251"、"WINDOWS-1252"、"WINDOWS-1253"、"WINDOWS-1254"、"WINDOWS-1255"、"WINDOWS-1256"、"WINDOWS-1257"、"WINDOWS-1258"。Allowed values are as follows: "UTF-8", "UTF-16", "UTF-16BE", "UTF-32", "UTF-32BE", "US-ASCII", "UTF-7", "BIG5", "EUC-JP", "EUC-KR", "GB2312", "GB18030", "JOHAB", "SHIFT-JIS", "CP875", "CP866", "IBM00858", "IBM037", "IBM273", "IBM437", "IBM500", "IBM737", "IBM775", "IBM850", "IBM852", "IBM855", "IBM857", "IBM860", "IBM861", "IBM863", "IBM864", "IBM865", "IBM869", "IBM870", "IBM01140", "IBM01141", "IBM01142", "IBM01143", "IBM01144", "IBM01145", "IBM01146", "IBM01147", "IBM01148", "IBM01149", "ISO-2022-JP", "ISO-2022-KR", "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-13", "ISO-8859-15", "WINDOWS-874", "WINDOWS-1250", "WINDOWS-1251", "WINDOWS-1252", "WINDOWS-1253", "WINDOWS-1254", "WINDOWS-1255", "WINDOWS-1256", "WINDOWS-1257", "WINDOWS-1258".
No
nullValuenullValue 指定 null 值的字符串表示形式。Specifies the string representation of null value.
默认值为空字符串The default value is empty string.
No
compressioncompression 用来配置文件压缩的属性组。Group of properties to configure file compression. 如果需要在活动执行期间进行压缩/解压缩,请配置此部分。Configure this section when you want to do compression/decompression during activity execution. No
typetype
(在 compression 下)(under compression)
用来读取/写入 XML 文件的压缩编解码器。The compression codec used to read/write XML files.
允许的值为 bzip2gzipdeflateZipDeflatesnappylz4Allowed values are bzip2, gzip, deflate, ZipDeflate, snappy, or lz4. 保存文件时使用。to use when saving the file. 默认设置是不压缩。Default is not compressed.
注意,复制活动当前不支持“snappy”和“lz4”。Note currently Copy activity doesn't support "snappy" & "lz4".
注意,使用复制活动解压缩 ZipDeflate 文件并将其写入基于文件的接收器数据存储时,默认情况下文件将提取到 <path specified in dataset>/<folder named as source zip file>/ 文件夹,对复制活动源使用 preserveZipFileNameAsFolder 来控制是否以文件夹结构形式保留 zip 文件名。Note when using copy activity to decompress ZipDeflate file(s) and write to file-based sink data store, by default files are extracted to the folder: <path specified in dataset>/<folder named as source zip file>/, use preserveZipFileNameAsFolder on copy activity source to control whether to preserve zip file name as folder structure.
否。No.
levellevel
(在 compression 下)(under compression)
压缩率。The compression ratio.
允许的值为 OptimalFastestAllowed values are Optimal or Fastest.
- Fastest:尽快完成压缩操作,不过,无法以最佳方式压缩生成的文件。- Fastest: The compression operation should complete as quickly as possible, even if the resulting file is not optimally compressed.
- Optimal:以最佳方式完成压缩操作,不过,需要耗费更长的时间。- Optimal: The compression operation should be optimally compressed, even if the operation takes a longer time to complete. 有关详细信息,请参阅 Compression Level(压缩级别)主题。For more information, see Compression Level topic.
No

下面是 Azure Blob 存储上的 XML 数据集的示例:Below is an example of XML dataset on Azure Blob Storage:

{
    "name": "XMLDataset",
    "properties": {
        "type": "Xml",
        "linkedServiceName": {
            "referenceName": "<Azure Blob Storage linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "container": "containername",
                "folderPath": "folder/subfolder",
            },
            "compression": {
                "type": "ZipDeflate"
            }
        }
    }
}

复制活动属性Copy activity properties

有关可用于定义活动的各部分和属性的完整列表,请参阅管道一文。For a full list of sections and properties available for defining activities, see the Pipelines article. 本部分提供 XML 源支持的属性列表。This section provides a list of properties supported by the XML source.

了解如何通过架构映射来映射 XML 数据和接收器数据存储/格式。Learn about how to map XML data and sink data store/format from schema mapping. 预览 XML 文件时,使用 JSON 层次结构显示数据,并使用 JSON 路径指向字段。When previewing XML files, data is shown with JSON hierarchy, and you use JSON path to point to the fields.

XML 作为源XML as source

复制活动的 *source* 节支持以下属性。The following properties are supported in the copy activity *source* section. 详细了解 XML 连接器行为Learn more from XML connector behavior.

属性Property 描述Description 必须Required
typetype 复制活动的 type 属性必须设置为 XmlSource。The type property of the copy activity source must be set to XmlSource. Yes
formatSettingsformatSettings 一组属性。A group of properties. 请参阅下面的“XML 读取设置”表。Refer to XML read settings table below. No
storeSettingsstoreSettings 有关如何从数据存储读取数据的一组属性。A group of properties on how to read data from a data store. 每个基于文件的连接器在 storeSettings 下都有其自己支持的读取设置。Each file-based connector has its own supported read settings under storeSettings. 请在连接器文章 -> 复制活动属性部分中查看详细信息See details in connector article -> Copy activity properties section. No

formatSettings 下支持的“XML 读取设置”:Supported XML read settings under formatSettings:

属性Property 描述Description 必须Required
typetype formatSettings 的 type 必须设置为 XmlReadSettings。The type of formatSettings must be set to XmlReadSettings. Yes
validationModevalidationMode 指定是否要验证 XML 架构。Specifies whether to validate the XML schema.
允许的值为 none(默认值、无验证)、xsd(使用 XSD 验证)以及 dtd (使用 DTD 验证) 。Allowed values are none (default, no validation), xsd (validate using XSD), dtd (validate using DTD).
No
namespacePrefixesnamespacePrefixes 命名空间 URI 到前缀的映射,用于在分析 xml 文件时为字段命名。Namespace URI to prefix mapping, which is used to name fields when parsing the xml file.
如果 XML 文件具有命名空间,且已启用命名空间,则默认情况下,字段名称与 XML 文档中的名称相同。If an XML file has namespace and namespace is enabled, by default, the field name is the same as it is in the XML document.
如果在此映射中为命名空间 URI 定义了一个项,则字段名称为 prefix:fieldNameIf there is an item defined for the namespace URI in this map, the field name is prefix:fieldName.
No
compressionPropertiescompressionProperties 一组属性,指示如何为给定的压缩编解码器解压缩数据。A group of properties on how to decompress data for a given compression codec. No
preserveZipFileNameAsFolderpreserveZipFileNameAsFolder
(在 compressionProperties 下)(under compressionProperties)
当输入数据集配置了 ZipDeflate 压缩时适用。Applies when input dataset is configured with ZipDeflate compression. 指示是否在复制过程中以文件夹结构形式保留源 zip 文件名。Indicates whether to preserve the source zip file name as folder structure during copy.
- 当设置为“true”(默认值)时,数据工厂会将已解压缩的文件写入 <path specified in dataset>/<folder named as source zip file>/- When set to true (default), Data Factory writes unzipped files to <path specified in dataset>/<folder named as source zip file>/.
- 当设置为“false”时,数据工厂会直接将已解压缩的文件写入 <path specified in dataset>- When set to false, Data Factory writes unzipped files directly to <path specified in dataset>. 请确保在不同的源 zip 文件中没有重复的文件名,以避免出现争用或意外行为。Make sure you don't have duplicated file names in different source zip files to avoid racing or unexpected behavior.
No

XML 连接器行为XML connector behavior

使用 XML 作为源时,请注意以下事项。Note the following when using XML as source.

  • XML 属性:XML attributes:

    • 元素的属性将被分析为层次结构中元素的子字段。Attributes of an element are parsed as the subfields of the element in the hierarchy.
    • 属性字段的名称遵循模式 @attributeNameThe name of the attribute field follows the pattern @attributeName.
  • XML 架构验证:XML schema validation:

    • 可以选择不验证架构,或者使用 XSD 或 DTD 验证架构。You can choose to not validate schema, or validate schema using XSD or DTD.
    • 使用 XSD 或 DTD 验证 XML 文件时,必须通过相对路径在 XML 文件内部引用 XSD/DTD。When using XSD or DTD to validate XML files, the XSD/DTD must be referred inside the XML files through relative path.
  • 命名空间处理:Namespace handling:

    • 启用命名空间后,默认情况下,元素和属性的名称将遵循模式 namespaceUri,elementNamenamespaceUri,@attributeNameWhen namespace is enabled, the names of the element and attributes follow the pattern namespaceUri,elementName and namespaceUri,@attributeName by default. 可以为源中的每个命名空间 URI 定义命名空间前缀,在此情况下,元素和属性的名称将遵循模式 definedPrefix:elementNamedefinedPrefix:@attributeNameYou can define namespace prefix for each namespace URI in source, in which case the names of the element and attributes follow the pattern definedPrefix:elementName or definedPrefix:@attributeName instead.
  • 值列:Value column:

    • 如果 XML 元素同时具有简单文本值和属性/子元素,则简单文本值将被解析为“值列”的值,其中内置字段名称为 _value_If an XML element has both simple text value and attributes/child elements, the simple text value is parsed as the value of a "value column" with built-in field name _value_. 如果适用,它还会继承元素的命名空间。And it inherits the namespace of the element as well if applies.

后续步骤Next steps