映射数据流中的源转换Source transformation in mapping data flow

适用于: Azure 数据工厂 Azure Synapse Analytics

源转换为数据流配置数据源。A source transformation configures your data source for the data flow. 设计数据流时,第一步始终是配置源转换。When you design data flows, your first step is always configuring a source transformation. 若要添加源,请在数据流画布中选择“添加源”框。To add a source, select the Add Source box in the data flow canvas.

每个数据流至少需要一个源转换,但你可根据需要添加任意多个源来完成数据转换。Every data flow requires at least one source transformation, but you can add as many sources as necessary to complete your data transformations. 可以使用联接、查找或联合转换将这些源联接到一起。You can join those sources together with a join, lookup, or a union transformation.

每个源转换只与一个数据集或链接服务相关联。Each source transformation is associated with exactly one dataset or linked service. 数据集定义要写入或读取的数据的形状和位置。The dataset defines the shape and location of the data you want to write to or read from. 如果使用基于文件的数据集,则可在源中使用通配符和文件列表,一次性处理多个文件。If you use a file-based dataset, you can use wildcards and file lists in your source to work with more than one file at a time.

内联数据集Inline datasets

创建源转换时所做的第一次决策是源信息是在数据集对象内定义还是在源转换内定义。The first decision you make when you create a source transformation is whether your source information is defined inside a dataset object or within the source transformation. 大多数格式只适用于其中之一。Most formats are available in only one or the other. 若要了解如何使用特定连接器,请参阅相应的连接器文档。To learn how to use a specific connector, see the appropriate connector document.

内联和数据集对象同时支持某一格式时,两者都有优势。When a format is supported for both inline and in a dataset object, there are benefits to both. 数据集对象是可重用的实体,可在其他数据流和复制等活动中使用。Dataset objects are reusable entities that can be used in other data flows and activities such as Copy. 使用强化的架构时,这些可重复使用的实体特别有用。These reusable entities are especially useful when you use a hardened schema. 数据集不基于 Spark。Datasets aren't based in Spark. 有时,可能需要在源转换中替代某些设置或架构投影。Occasionally, you might need to override certain settings or schema projection in the source transformation.

如果使用灵活的架构、一次性源实例或参数化源,建议使用内联数据集。Inline datasets are recommended when you use flexible schemas, one-off source instances, or parameterized sources. 如果源的参数化程度非常高,则使用内联数据集可以不创建“虚拟”对象。If your source is heavily parameterized, inline datasets allow you to not create a "dummy" object. 内联数据集基于 Spark,其属性是数据流中固有的。Inline datasets are based in Spark, and their properties are native to data flow.

若要使用内联数据集,请在“源类型”选择器中选择所需的格式。To use an inline dataset, select the format you want in the Source type selector. 选择要与其连接的链接服务,而不是选择源数据集。Instead of selecting a source dataset, you select the linked service you want to connect to.

显示已选择“内联”的屏幕截图。Screenshot that shows Inline selected.

受支持的源类型Supported source types

映射数据流遵循提取、加载和转换 (ELT) 方法,可与全部位于 Azure 中的过渡数据集结合使用。Mapping data flow follows an extract, load, and transform (ELT) approach and works with staging datasets that are all in Azure. 目前,以下数据集可用于源转换。Currently, the following datasets can be used in a source transformation.

连接器Connector 格式Format 数据集/内联Dataset/inline
Azure Blob 存储Azure Blob Storage AvroAvro
带分隔符的文本Delimited text
增量Delta
ExcelExcel
JSONJSON
ORCORC
ParquetParquet
XMLXML
✓/-✓/-
✓/-✓/-
-/✓-/✓
✓/✓✓/✓
✓/-✓/-
✓/✓✓/✓
✓/-✓/-
✓/✓✓/✓
Azure Cosmos DB (SQL API)Azure Cosmos DB (SQL API) ✓/-✓/-
Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2 AvroAvro
常见数据模型Common Data Model
带分隔符的文本Delimited text
增量Delta
ExcelExcel
JSONJSON
ORCORC
ParquetParquet
XMLXML
✓/-✓/-
-/✓-/✓
✓/-✓/-
-/✓-/✓
✓/✓✓/✓
✓/-✓/-
✓/✓✓/✓
✓/-✓/-
✓/✓✓/✓
Azure Database for PostgreSQLAzure Database for PostgreSQL ✓/✓✓/✓
Azure SQL 数据库Azure SQL Database ✓/-✓/-
Azure SQL 托管实例(预览版)Azure SQL Managed Instance (preview) ✓/-✓/-
Azure Synapse AnalyticsAzure Synapse Analytics ✓/-✓/-
HiveHive -/✓-/✓
SnowflakeSnowflake ✓/✓✓/✓

特定于这些连接器的设置位于“源选项”选项卡上。有关这些设置的信息和数据流脚本示例位于连接器文档中。Settings specific to these connectors are located on the Source options tab. Information and data flow script examples on these settings are located in the connector documentation.

Azure 数据工厂可以访问 90 多个原生连接器Azure Data Factory has access to more than 90 native connectors. 若要将其他这些源中的数据包含到你自己的数据流中,请使用复制活动将该数据加载到其中一个受支持的暂存区域。To include data from those other sources in your data flow, use the Copy Activity to load that data into one of the supported staging areas.

源设置Source settings

添加源后,通过“源设置”选项卡进行配置。可在此处选取或创建源指向的数据集。After you've added a source, configure via the Source settings tab. Here you can pick or create the dataset your source points at. 还可选择数据的架构和采样选项。You can also select schema and sampling options for your data.

可在调试设置中配置数据集参数的开发值。Development values for dataset parameters can be configured in debug settings. (必须打开调试模式。)(Debug mode must be turned on.)

显示“源设置”选项卡的屏幕截图。Screenshot that shows the Source settings tab.

输出流名称:源转换的名称。Output stream name: The name of the source transformation.

源类型:选择是要使用内联数据集还是现有数据集对象。Source type: Choose whether you want to use an inline dataset or an existing dataset object.

测试连接:测试数据流的 Spark 服务是否可以成功连接到源数据集中使用的链接服务。Test connection: Test whether or not the data flow's Spark service can successfully connect to the linked service used in your source dataset. 要启用此功能,必须打开调试模式。Debug mode must be on for this feature to be enabled.

架构偏差架构偏差是数据工厂的功能,可在本机处理数据流中的灵活架构,而无需显式定义列更改。Schema drift: Schema drift is the ability of Data Factory to natively handle flexible schemas in your data flows without needing to explicitly define column changes.

  • 如果源列经常更改,请选择“允许架构偏差”复选框。Select the Allow schema drift check box if the source columns will change often. 此设置可让所有传入的源字段通过转换流向接收器。This setting allows all incoming source fields to flow through the transformations to the sink.

  • 选择“推断偏移列类型”指示数据工厂检测并定义发现的每个新列的数据类型。Selecting Infer drifted column types instructs Data Factory to detect and define data types for each new column discovered. 关闭此功能后,所有偏移列都将为字符串类型。With this feature turned off, all drifted columns will be of type string.

验证架构: 如果选择“验证架构”,那么如果传入的源数据与数据集定义的架构不匹配,数据流将无法运行。Validate schema: If Validate schema is selected, the data flow will fail to run if the incoming source data doesn't match the defined schema of the dataset.

跳过行计数:“跳过行计数”字段指定在数据集的开头要忽略多少行。Skip line count: The Skip line count field specifies how many lines to ignore at the beginning of the dataset.

采样:启用“采样”来限制源中的行数。Sampling: Enable Sampling to limit the number of rows from your source. 当你测试源中的数据或对其进行采样以进行调试时,请使用此设置。Use this setting when you test or sample data from your source for debugging purposes.

若要验证源的配置是否正确,请打开调试模式并提取数据预览。To validate your source is configured correctly, turn on debug mode and fetch a data preview. 有关详细信息,请参阅调试模式For more information, see Debug mode.

备注

当调试模式处于打开状态时,调试设置中的行限制配置将覆盖数据预览期间源中的采样设置。When debug mode is turned on, the row limit configuration in debug settings will overwrite the sampling setting in the source during data preview.

源选项Source options

“源选项”选项卡包含特定于所选连接器和格式的设置。The Source options tab contains settings specific to the connector and format chosen. 有关详细信息和示例,请参阅相关的连接器文档For more information and examples, see the relevant connector documentation.

投影Projection

与数据集中的架构一样,源中的投影定义源数据中的数据列、类型和格式。Like schemas in datasets, the projection in a source defines the data columns, types, and formats from the source data. 对于大多数数据集类型(例如 SQL 和 Parquet),源中的投影是固定的,用于反映数据集中定义的架构。For most dataset types, such as SQL and Parquet, the projection in a source is fixed to reflect the schema defined in a dataset. 如果你的源文件不是强类型(例如,平面 .csv 文件而不是 Parquet 文件),则可在源转换中为每个字段定义数据类型。When your source files aren't strongly typed (for example, flat .csv files rather than Parquet files), you can define the data types for each field in the source transformation.

显示“投影”选项卡上的设置的屏幕截图。Screenshot that shows settings on the Projection tab.

如果文本文件未定义架构,请选择“检测数据类型”,这样数据工厂将采样并推断数据类型。If your text file has no defined schema, select Detect data type so that Data Factory will sample and infer the data types. 选择“定义默认格式”以自动检测默认数据格式。Select Define default format to autodetect the default data formats.

“重置架构”将投影重置为在引用的数据集中定义的投影。Reset schema resets the projection to what is defined in the referenced dataset.

可以在下游派生列转换中修改列数据类型。You can modify the column data types in a downstream derived-column transformation. 使用选择转换来修改列名称。Use a select transformation to modify the column names.

导入架构Import schema

选择“投影”选项卡上的“导入架构”按钮,以使用活动调试群集来创建架构投影 。Select the Import schema button on the Projection tab to use an active debug cluster to create a schema projection. 它在每个源类型中都可用。It's available in every source type. 在此处导入架构将替代数据集中定义的投影。Importing the schema here will override the projection defined in the dataset. 数据集对象不会更改。The dataset object won't be changed.

导入架构在支持复杂数据结构的数据集(如 Avro 和 Azure Cosmos DB)中十分有用,这些数据集不需要架构定义存在于数据集中。Importing schema is useful in datasets like Avro and Azure Cosmos DB that support complex data structures that don't require schema definitions to exist in the dataset. 对于内联数据集,导入架构是唯一一种引用列元数据而不会发生架构偏差的方法。For inline datasets, importing schema is the only way to reference column metadata without schema drift.

优化源转换Optimize the source transformation

“优化”选项卡允许在每个转换步骤编辑分区信息。The Optimize tab allows for editing of partition information at each transformation step. 在大多数情况下,“使用当前分区”会针对源的理想分区结构进行优化。In most cases, Use current partitioning will optimize for the ideal partitioning structure for a source.

如果要从 Azure SQL 数据库源读取数据,自定义源分区可能会以最快的速度读取数据。If you're reading from an Azure SQL Database source, custom Source partitioning will likely read data the fastest. 数据工厂将通过建立与数据库的并行连接来读取大型查询。Data Factory will read large queries by making connections to your database in parallel. 此源分区可在列上完成,也可使用查询来完成。This source partitioning can be done on a column or by using a query.

显示“源”分区设置的屏幕截图。Screenshot that shows the Source partition settings.

有关映射数据流中的优化的详细信息,请参阅“优化”选项卡For more information on optimization within mapping data flow, see the Optimize tab.

后续步骤Next steps

开始使用派生列转换选择转换来构建数据流。Begin building your data flow with a derived-column transformation and a select transformation.