使用 Azure 数据工厂(旧版)从 MongoDB 复制数据Copy data from MongoDB using Azure Data Factory (legacy)

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

本文概述了如何使用 Azure 数据工厂中的复制活动从 MongoDB 数据库复制数据。This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a MongoDB database. 它是基于概述复制活动总体的复制活动概述一文。It builds on the copy activity overview article that presents a general overview of copy activity.

重要

ADF 发布了一款新的 MongoDB 连接器,与这个基于 ODBC 的实现相比,该连接器提供更好的本机 MongoDB 支持,详情请参阅 MongoDB 连接器一文。ADF release a new MongoDB connector which provides better native MongoDB support comparing to this ODBC-based implementation, refer to MongoDB connector article on details. 这一旧版 MongoDB 连接器仍支持向后兼容,但是对于任何新的工作负载,请使用新连接器。This legacy MongoDB connector is kept supported as-is for backward compability, while for any new workload, please use the new connector.

支持的功能Supported capabilities

可以将数据从 MongoDB 数据库复制到任何支持的接收器数据存储。You can copy data from MongoDB database to any supported sink data store. 有关复制活动支持作为源/接收器的数据存储列表,请参阅支持的数据存储表。For a list of data stores that are supported as sources/sinks by the copy activity, see the Supported data stores table.

具体而言,此 MongoDB 连接器支持:Specifically, this MongoDB connector supports:

  • MongoDB 版本 2.4、2.6、3.0、3.2、3.4 和 3.6MongoDB versions 2.4, 2.6, 3.0, 3.2, 3.4 and 3.6.
  • 使用基本或匿名身份验证复制数据。Copying data using Basic or Anonymous authentication.

先决条件Prerequisites

如果数据存储位于本地网络、Azure 虚拟网络或 Amazon Virtual Private Cloud 内部,则需要设置自承载集成运行时才能连接到该数据存储。If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private Cloud, you need to set up a self-hosted integration runtime to connect to it.

如果数据存储是托管的云数据服务,则可以使用 Azure 集成运行时。If your data store is a managed cloud data service, you can use Azure integration runtime. 如果访问范围限制为防火墙规则中列入白名单的 IP,可以选择将 Azure 集成运行时 IP 添加到允许列表。If the access is restricted to IPs that are whitelisted in the firewall rules, you can choose to add Azure Integration Runtime IPs into the allow list.

要详细了解网络安全机制和数据工厂支持的选项,请参阅数据访问策略For more information about the network security mechanisms and options supported by Data Factory, see Data access strategies.

集成运行时提供内置 MongoDB 驱动程序,因此从 MongoDB 复制数据时,无需手动安装任何驱动程序。The Integration Runtime provides a built-in MongoDB driver, therefore you don't need to manually install any driver when copying data from MongoDB.

入门Getting started

若要使用管道执行复制活动,可以使用以下工具或 SDK 之一:To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:

对于特定于 MongoDB 连接器的数据工厂实体,以下部分提供有关用于定义这些实体的属性的详细信息。The following sections provide details about properties that are used to define Data Factory entities specific to MongoDB connector.

链接服务属性Linked service properties

MongoDB 链接的服务支持以下属性:The following properties are supported for MongoDB linked service:

属性Property 说明Description 必须Required
typetype type 属性必须设置为:MongoDbThe type property must be set to: MongoDb Yes
serverserver MongoDB 服务器的 IP 地址或主机名。IP address or host name of the MongoDB server. Yes
portport MongoDB 服务器用于侦听客户端连接的 TCP 端口。TCP port that the MongoDB server uses to listen for client connections. 否(默认值为 27017)No (default is 27017)
databaseNamedatabaseName 要访问的 MongoDB 数据库名称。Name of the MongoDB database that you want to access. Yes
authenticationTypeauthenticationType 用于连接 MongoDB 数据库的身份验证类型。Type of authentication used to connect to the MongoDB database.
允许值包括:基本和匿名 。Allowed values are: Basic, and Anonymous.
Yes
usernameusername 用于访问 MongoDB 的用户帐户。User account to access MongoDB. 是(如果使用基本身份验证)。Yes (if basic authentication is used).
passwordpassword 用户密码。Password for the user. 将此字段标记为 SecureString 以安全地将其存储在数据工厂中或引用存储在 Azure Key Vault 中的机密Mark this field as a SecureString to store it securely in Data Factory, or reference a secret stored in Azure Key Vault. 是(如果使用基本身份验证)。Yes (if basic authentication is used).
authSourceauthSource 要用于检查身份验证凭据的 MongoDB 数据库名称。Name of the MongoDB database that you want to use to check your credentials for authentication. 否。No. 对于基本身份验证,默认使用管理员帐户和使用 databaseName 属性指定的数据库。For basic authentication, default is to use the admin account and the database specified using databaseName property.
enableSslenableSsl 指定是否使用 TLS 加密到服务器的连接。Specifies whether the connections to the server are encrypted using TLS. 默认值为 false。The default value is false. No
allowSelfSignedServerCertallowSelfSignedServerCert 指定是否允许来自服务器的自签名证书。Specifies whether to allow self-signed certificates from the server. 默认值为 false。The default value is false. No
connectViaconnectVia 用于连接到数据存储的集成运行时The Integration Runtime to be used to connect to the data store. 先决条件部分了解更多信息。Learn more from Prerequisites section. 如果未指定,则使用默认 Azure Integration Runtime。If not specified, it uses the default Azure Integration Runtime. No

示例:Example:

{
    "name": "MongoDBLinkedService",
    "properties": {
        "type": "MongoDb",
        "typeProperties": {
            "server": "<server name>",
            "databaseName": "<database name>",
            "authenticationType": "Basic",
            "username": "<username>",
            "password": {
                "type": "SecureString",
                "value": "<password>"
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

数据集属性Dataset properties

有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集和链接服务For a full list of sections and properties that are available for defining datasets, see Datasets and linked services. MongoDB 数据集支持以下属性:The following properties are supported for MongoDB dataset:

属性Property 说明Description 必须Required
typetype 数据集的 type 属性必须设置为:MongoDbCollectionThe type property of the dataset must be set to: MongoDbCollection Yes
collectionNamecollectionName MongoDB 数据库中集合的名称。Name of the collection in MongoDB database. Yes

示例:Example:

{
    "name": "MongoDbDataset",
    "properties": {
        "type": "MongoDbCollection",
        "linkedServiceName": {
            "referenceName": "<MongoDB linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "collectionName": "<Collection name>"
        }
    }
}

复制活动属性Copy activity properties

有关可用于定义活动的各部分和属性的完整列表,请参阅管道一文。For a full list of sections and properties available for defining activities, see the Pipelines article. 本部分提供 MongoDB 源支持的属性列表。This section provides a list of properties supported by MongoDB source.

以 MongoDB 作为源MongoDB as source

复制活动source部分支持以下属性:The following properties are supported in the copy activity source section:

属性Property 说明Description 必须Required
typetype 复制活动 source 的 type 属性必须设置为:MongoDbSourceThe type property of the copy activity source must be set to: MongoDbSource Yes
查询query 使用自定义 SQL-92 查询读取数据。Use the custom SQL-92 query to read data. 例如:select * from MyTable。For example: select * from MyTable. 否(如果指定了数据集中的“collectionName”)No (if "collectionName" in dataset is specified)

示例:Example:

"activities":[
    {
        "name": "CopyFromMongoDB",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<MongoDB input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "MongoDbSource",
                "query": "SELECT * FROM MyTable"
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

提示

指定 SQL 查询时,请注意 DateTime 的格式。When specify the SQL query, pay attention to the DateTime format. 例如:SELECT * FROM Account WHERE LastModifiedDate >= '2018-06-01' AND LastModifiedDate < '2018-06-02',或使用参数 SELECT * FROM Account WHERE LastModifiedDate >= '@{formatDateTime(pipeline().parameters.StartTime,'yyyy-MM-dd HH:mm:ss')}' AND LastModifiedDate < '@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-dd HH:mm:ss')}'For example: SELECT * FROM Account WHERE LastModifiedDate >= '2018-06-01' AND LastModifiedDate < '2018-06-02' or to use parameter SELECT * FROM Account WHERE LastModifiedDate >= '@{formatDateTime(pipeline().parameters.StartTime,'yyyy-MM-dd HH:mm:ss')}' AND LastModifiedDate < '@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-dd HH:mm:ss')}'

数据工厂的构架Schema by Data Factory

Azure 数据工厂服务通过使用 MongoDB 集合中最新的 100 个文档来推断该集合的架构。Azure Data Factory service infers schema from a MongoDB collection by using the latest 100 documents in the collection. 如果这 100 个文档不包含完整架构,则在复制操作期间可能忽略某些列。If these 100 documents do not contain full schema, some columns may be ignored during the copy operation.

MongoDB 的数据类型映射Data type mapping for MongoDB

从 MongoDB 复制数据时,以下映射用于从 MongoDB 数据类型映射到 Azure 数据工厂临时数据类型。When copying data from MongoDB, the following mappings are used from MongoDB data types to Azure Data Factory interim data types. 若要了解复制活动如何将源架构和数据类型映射到接收器,请参阅架构和数据类型映射See Schema and data type mappings to learn about how copy activity maps the source schema and data type to the sink.

MongoDB 数据类型MongoDB data type 数据工厂临时数据类型Data factory interim data type
二进制Binary Byte[]Byte[]
布尔Boolean 布尔Boolean
DateDate DateTimeDateTime
NumberDoubleNumberDouble DoubleDouble
NumberIntNumberInt Int32Int32
NumberLongNumberLong Int64Int64
ObjectIDObjectID StringString
StringString StringString
UUIDUUID GuidGuid
ObjectObject 重新标准化为平展列,以“”作为嵌套分隔符Renormalized into flatten columns with “" as nested separator

备注

要了解对使用虚拟表的数组的支持,请参阅支持使用虚拟表的复杂类型一节。To learn about support for arrays using virtual tables, refer to Support for complex types using virtual tables section.

目前不支持以下 MongoDB 数据类型:DBPointer、JavaScript、最大/最小键、正则表达式、符号、时间戳、未定义。Currently, the following MongoDB data types are not supported: DBPointer, JavaScript, Max/Min key, Regular Expression, Symbol, Timestamp, Undefined.

支持使用虚拟表的复杂类型Support for complex types using virtual tables

Azure 数据工厂使用内置的 ODBC 驱动程序连接到 MongoDB 数据库,并从中复制数据。Azure Data Factory uses a built-in ODBC driver to connect to and copy data from your MongoDB database. 对于数组或文档间不同类型的对象等复杂类型,该驱动程序会将数据重新标准化到相应虚拟表中。For complex types such as arrays or objects with different types across the documents, the driver re-normalizes data into corresponding virtual tables. 具体而言,如果表中包含此类列,该驱动程序会生成以下虚拟表:Specifically, if a table contains such columns, the driver generates the following virtual tables:

  • 基表,其中包含与实际表相同的数据(复杂类型列除外)。A base table, which contains the same data as the real table except for the complex type columns. 基表使用与其所表示的实际表相同的名称。The base table uses the same name as the real table that it represents.
  • 对于每个复杂类型列生成一个虚拟表,这会扩展嵌套数据。A virtual table for each complex type column, which expands the nested data. 使用实际表名称、分隔符“”和数组或对象的名称,对虚拟表命名。The virtual tables are named using the name of the real table, a separator “" and the name of the array or object.

虚拟表引用实际表中的数据,以使驱动程序能访问非规范化的数据。Virtual tables refer to the data in the real table, enabling the driver to access the denormalized data. 通过查询和联接虚拟表,可访问 MongoDB 数组的内容。You can access the content of MongoDB arrays by querying and joining the virtual tables.

示例Example

例如,此处 ExampleTable 为 MongoDB 表,其中“发票”列的每个单元格包含对象数组,“评级”列包含标量类型数组。For example, ExampleTable here is a MongoDB table that has one column with an array of Objects in each cell – Invoices, and one column with an array of Scalar types – Ratings.

_id_id 客户名称Customer Name 发票Invoices 服务级别Service Level 评级Ratings
11111111 ABCABC [{invoice_id:"123", item:"toaster", price:"456", discount:"0.2"}, {invoice_id:"124", item:"oven", price:"1235", discount:"0.2"}][{invoice_id:"123", item:"toaster", price:"456", discount:"0.2"}, {invoice_id:"124", item:"oven", price: "1235", discount: "0.2"}] SilverSilver [5,6][5,6]
22222222 XYZXYZ [{invoice_id:"135", item:"fridge", price:"12543", discount:"0.0"}][{invoice_id:"135", item:"fridge", price: "12543", discount: "0.0"}] GoldGold [1,2][1,2]

该驱动程序会生成多个虚拟表来表示此单个表。The driver would generate multiple virtual tables to represent this single table. 第一个虚拟表是名为“ExampleTable”的基表,如下例所示。The first virtual table is the base table named “ExampleTable", shown in the example. 基表包含原始表中的所有数据,但已省略数组中的数据,这些数据会在虚拟表中展开。The base table contains all the data of the original table, but the data from the arrays has been omitted and is expanded in the virtual tables.

_id_id 客户名称Customer Name 服务级别Service Level
11111111 ABCABC SilverSilver
22222222 XYZXYZ GoldGold

下表显示在示例中表示原始数组的虚拟表。The following tables show the virtual tables that represent the original arrays in the example. 这些表包含以下项:These tables contain the following:

  • 通过 _id 列返回到原始主键列的引用,该列与原始数组的行对应A reference back to the original primary key column corresponding to the row of the original array (via the _id column)
  • 原始数组中数据位置的指示An indication of the position of the data within the original array
  • 该数组中每个元素展开的数据The expanded data for each element within the array

“ExampleTable_Invoices”表:Table “ExampleTable_Invoices":

_id_id ExampleTable_Invoices_dim1_idxExampleTable_Invoices_dim1_idx invoice_idinvoice_id itemitem priceprice 折扣Discount
11111111 00 123123 吐司炉toaster 456456 0.20.2
11111111 11 124124 烤箱oven 12351235 0.20.2
22222222 00 135135 冰箱fridge 1254312543 0.00.0

“ExampleTable_Ratings”表:Table “ExampleTable_Ratings":

_id_id ExampleTable_Ratings_dim1_idxExampleTable_Ratings_dim1_idx ExampleTable_RatingsExampleTable_Ratings
11111111 00 55
11111111 11 66
22222222 00 11
22222222 11 22

后续步骤Next steps

有关 Azure 数据工厂中复制活动支持作为源和接收器的数据存储的列表,请参阅支持的数据存储For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported data stores.