Azure 数据工厂中的 Excel 格式Excel format in Azure Data Factory

适用于: Azure 数据工厂 Azure Synapse Analytics

如果要 分析 Excel 文件,请按此文的要求操作。Follow this article when you want to parse the Excel files. Azure 数据工厂同时支持“.xls”和“.xlsx”。Azure Data Factory supports both ".xls" and ".xlsx".

以下连接器支持 Excel 格式:Amazon S3Azure BlobAzure Data Lake Storage Gen2Azure 文件存储文件系统FTPGoogle 云存储HDFSHTTPSFTPExcel format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP. 它可以作为源,但不可作为接收器。It is supported as source but not sink.

注意:使用 HTTP 时不支持“.xls”格式。Note: ".xls" format is not supported while using HTTP.

数据集属性Dataset properties

有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集一文。For a full list of sections and properties available for defining datasets, see the Datasets article. 本部分提供 Excel 数据集支持的属性列表。This section provides a list of properties supported by the Excel dataset.

属性Property 说明Description 必需Required
typetype 数据集的 type 属性必须设置为“Excel”。The type property of the dataset must be set to Excel. Yes
locationlocation 文件的位置设置。Location settings of the file(s). 每个基于文件的连接器在 location 下都有其自己的位置类型和支持的属性。Each file-based connector has its own location type and supported properties under location. Yes
sheetNamesheetName 要读取数据的 Excel 工作表名称。The Excel worksheet name to read data. 指定 sheetNamesheetIndexSpecify sheetName or sheetIndex
sheetIndexsheetIndex 用于读取数据的 Excel 工作表索引,从 0 开始。The Excel worksheet index to read data, starting from 0. 指定 sheetNamesheetIndexSpecify sheetName or sheetIndex
rangerange 给定工作表中用于定位选择性数据的单元格范围,例如:The cell range in the given worksheet to locate the selective data, e.g.:
- 未指定:以表的形式从第一个非空行和列读取整个工作表- Not specified: reads the whole worksheet as a table from the first non-empty row and column
- A3:读取从给定单元格开始的表,动态检测下面的所有行和右侧的所有列- A3: reads a table starting from the given cell, dynamically detects all the rows below and all the columns to the right
- A3:H5:以表的形式读取此固定范围- A3:H5: reads this fixed range as a table
- A3:A3:读取此单个单元格- A3:A3: reads this single cell
No
firstRowAsHeaderfirstRowAsHeader 指定是否要将给定工作表/范围内的第一行视为带有列名的标题行。Specifies whether to treat the first row in the given worksheet/range as a header line with names of columns.
允许的值为 truetrue(默认值)。Allowed values are true and false (default).
No
nullValuenullValue 指定 null 值的字符串表示形式。Specifies the string representation of null value.
默认值为 空字符串The default value is empty string.
No
compressioncompression 用来配置文件压缩的属性组。Group of properties to configure file compression. 如果需要在活动执行期间进行压缩/解压缩,请配置此部分。Configure this section when you want to do compression/decompression during activity execution. No
typetype
(在 compression 下)(under compression)
用来读取/写入 JSON 文件的压缩编解码器。The compression codec used to read/write JSON files.
允许的值为 bzip2、gzip、deflate、ZipDeflate、TarGzip、Tar、snappy 或 lz4 。Allowed values are bzip2, gzip, deflate, ZipDeflate, TarGzip, Tar, snappy, or lz4. 默认设置是不压缩。Default is not compressed.
注意:目前,复制活动不支持“snappy”和“lz4”,映射数据流不支持“ZipDeflate”、“TarGzip”和“Tar”。Note currently Copy activity doesn't support "snappy" & "lz4", and mapping data flow doesn't support "ZipDeflate", "TarGzip" and "Tar".
注意,使用复制活动来解压缩 ZipDeflate 文件并写入到基于文件的接收器数据存储时,会将文件提取到以下文件夹:<path specified in dataset>/<folder named as source zip file>/Note when using copy activity to decompress ZipDeflate file(s) and write to file-based sink data store, files are extracted to the folder: <path specified in dataset>/<folder named as source zip file>/.
否。No.
levellevel
(在 compression 下)(under compression)
压缩率。The compression ratio.
允许的值为 OptimalFastestAllowed values are Optimal or Fastest.
- Fastest:尽快完成压缩操作,不过,无法以最佳方式压缩生成的文件。- Fastest: The compression operation should complete as quickly as possible, even if the resulting file is not optimally compressed.
- Optimal:以最佳方式完成压缩操作,不过,需要耗费更长的时间。- Optimal: The compression operation should be optimally compressed, even if the operation takes a longer time to complete. 有关详细信息,请参阅 Compression Level(压缩级别)主题。For more information, see Compression Level topic.
No

下面是 Azure Blob 存储上 Excel 数据集的示例:Below is an example of Excel dataset on Azure Blob Storage:

{
    "name": "ExcelDataset",
    "properties": {
        "type": "Excel",
        "linkedServiceName": {
            "referenceName": "<Azure Blob Storage linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [ < physical schema, optional, retrievable during authoring > ],
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "container": "containername",
                "folderPath": "folder/subfolder",
            },
            "sheetName": "MyWorksheet",
            "range": "A3:H5",
            "firstRowAsHeader": true
        }
    }
}

复制活动属性Copy activity properties

有关可用于定义活动的各部分和属性的完整列表,请参阅管道一文。For a full list of sections and properties available for defining activities, see the Pipelines article. 本部分提供 Excel 源支持的属性列表。This section provides a list of properties supported by the Excel source.

Excel 作为源Excel as source

复制活动的 __source* * 节支持以下属性。The following properties are supported in the copy activity __source** section.

属性Property 说明Description 必需Required
typetype 复制活动源的 type 属性必须设置为“ExcelSource”。The type property of the copy activity source must be set to ExcelSource. Yes
storeSettingsstoreSettings 有关如何从数据存储读取数据的一组属性。A group of properties on how to read data from a data store. 每个基于文件的连接器在 storeSettings 下都有其自己支持的读取设置。Each file-based connector has its own supported read settings under storeSettings. No
"activities": [
    {
        "name": "CopyFromExcel",
        "type": "Copy",
        "typeProperties": {
            "source": {
                "type": "ExcelSource",
                "storeSettings": {
                    "type": "AzureBlobStorageReadSettings",
                    "recursive": true
                }
            },
            ...
        }
        ...
    }
]

映射数据流属性Mapping data flow properties

在映射数据流中,可以在以下数据存储中读取 XML 格式:Azure Blob 存储Azure Data Lake Storage Gen2In mapping data flows, you can read Excel format in the following data stores: Azure Blob Storage, and Azure Data Lake Storage Gen2. 可以使用 Excel 数据集或内联数据集来指向 Excel 文件。You can point to Excel files either using Excel dataset or using an inline dataset.

源属性Source properties

下表列出了 Excel 源支持的属性。The below table lists the properties supported by an Excel source. 你可以在“源选项”选项卡中编辑这些属性。在使用内联数据集时,你会看到其他文件设置,这些设置与数据集属性部分描述的属性相同。You can edit these properties in the Source options tab. When using inline dataset, you will see additional file settings, which are the same as the properties described in dataset properties section.

名称Name 说明Description 必需Required 允许的值Allowed values 数据流脚本属性Data flow script property
通配符路径Wild card paths 将处理与通配符路径匹配的所有文件。All files matching the wildcard path will be processed. 替代在数据集中设置的文件夹和文件路径。Overrides the folder and file path set in the dataset. no String[]String[] wildcardPathswildcardPaths
分区根路径Partition root path 对于已分区的文件数据,你可以输入分区根路径,以便将已分区的文件夹作为列读取For file data that is partitioned, you can enter a partition root path in order to read partitioned folders as columns no 字符串String partitionRootPathpartitionRootPath
文件列表List of files 源是否指向列出了要处理的文件的文本文件Whether your source is pointing to a text file that lists files to process no truefalsetrue or false fileListfileList
用于存储文件名的列Column to store file name 使用源文件名称和路径创建一个新列Create a new column with the source file name and path no 字符串String rowUrlColumnrowUrlColumn
完成后After completion 在处理后删除或移动文件。Delete or move the files after processing. 文件路径从容器根目录开始File path starts from the container root no Delete:truefalseDelete: true or false
Move:['<from>', '<to>']Move: ['<from>', '<to>']
purgeFilespurgeFiles
moveFilesmoveFiles
按上次修改时间筛选Filter by last modified 选择根据上次更改时间筛选文件Choose to filter files based upon when they were last altered no 时间戳Timestamp ModifiedAftermodifiedAfter
modifiedBeforemodifiedBefore
允许找不到文件Allow no files found 如果为 true,则找不到文件时不会引发错误If true, an error is not thrown if no files are found no truefalsetrue or false ignoreNoFilesFoundignoreNoFilesFound

源示例Source example

下图是使用数据集模式的映射数据流中 Excel 源配置的示例。The below image is an example of an Excel source configuration in mapping data flows using dataset mode.

Excel 源

关联的数据流脚本为:The associated data flow script is:

source(allowSchemaDrift: true,
    validateSchema: false,
    wildcardPaths:['*.xls']) ~> ExcelSource

如果你使用内联数据集,则会在映射数据流中看到以下源选项。If you use inline dataset, you see the following source options in mapping data flow.

Excel 源内联数据集

关联的数据流脚本为:The associated data flow script is:

source(allowSchemaDrift: true,
    validateSchema: false,
    format: 'excel',
    fileSystem: 'container',
    folderPath: 'path',
    fileName: 'sample.xls',
    sheetName: 'worksheet',
    firstRowAsHeader: true) ~> ExcelSourceInlineDataset

后续步骤Next steps