使用 Azure 数据工厂在 Azure Cosmos DB (SQL API) 中复制和转换数据Copy and transform data in Azure Cosmos DB (SQL API) by using Azure Data Factory

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

本文概述如何使用 Azure 数据工厂中的复制活动在 Azure Cosmos DB (SQL API) 中复制和粘贴数据。This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Cosmos DB (SQL API). 若要了解 Azure 数据工厂,请阅读介绍性文章To learn about Azure Data Factory, read the introductory article.

备注

此连接器仅支持 Cosmos DB SQL API。This connector only support Cosmos DB SQL API. 有关 MongoDB API,请参阅适用于 MongoDB 的 Auzre Cosmos DB API 的连接器For MongoDB API, refer to connector for Azure Cosmos DB's API for MongoDB. 目前不支持其他 API 类型。Other API types are not supported now.

支持的功能Supported capabilities

以下活动支持此 Azure Cosmos DB (SQL API) 连接器:This Azure Cosmos DB (SQL API) connector is supported for the following activities:

对于复制活动,此 Azure Cosmos DB(SQL API) 连接器支持:For Copy activity, this Azure Cosmos DB (SQL API) connector supports:

  • 在 Azure Cosmos DB SQL API 中复制和粘贴数据。Copy data from and to the Azure Cosmos DB SQL API.
  • insertupsert 的形式写入 Azure Cosmos DB。Write to Azure Cosmos DB as insert or upsert.
  • 按原样导入和导出 JSON 文档,或在表格数据集中复制或粘贴数据。Import and export JSON documents as-is, or copy data from or to a tabular dataset. 示例包括 SQL 数据库和 CSV 文件。Examples include a SQL database and a CSV file. 若要在 JSON 文件或另一个 Azure Cosmos DB 集合中按原样复制或粘贴文档,请参阅导入和导出 JSON 文档To copy documents as-is to or from JSON files or to or from another Azure Cosmos DB collection, see Import and export JSON documents.

数据工厂与 Azure Cosmos DB 批量执行程序库集成,以便在写入 Azure Cosmos DB 时提供最佳性能。Data Factory integrates with the Azure Cosmos DB bulk executor library to provide the best performance when you write to Azure Cosmos DB.

入门Get started

若要使用管道执行复制活动,可以使用以下工具或 SDK 之一:To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:

对于特定于 Azure Cosmos DB (SQL API) 的数据工厂实体,以下部分提供有关可用于定义这些实体的属性的详细信息。The following sections provide details about properties you can use to define Data Factory entities that are specific to Azure Cosmos DB (SQL API).

链接服务属性Linked service properties

Azure Cosmos DB (SQL API) 链接服务支持以下属性:The following properties are supported for the Azure Cosmos DB (SQL API) linked service:

属性Property 说明Description 必须Required
typetype type 属性必须设置为 CosmosDbThe type property must be set to CosmosDb. Yes
connectionStringconnectionString 指定连接 Azure Cosmos DB 数据库所需的信息。Specify information that's required to connect to the Azure Cosmos DB database.
注意:必须如以下示例所示,在连接字符串中指定数据库信息。Note: You must specify database information in the connection string as shown in the examples that follow.
还可以将帐户密钥放在 Azure 密钥保管库中,并从连接字符串中拉取 accountKey 配置。You can also put account key in Azure Key Vault and pull the accountKey configuration out of the connection string. 有关更多详细信息,请参阅以下示例和在 Azure 密钥保管库中存储凭据一文。Refer to the following samples and Store credentials in Azure Key Vault article with more details.
Yes
connectViaconnectVia 用于连接到数据存储的 Integration RuntimeThe Integration Runtime to use to connect to the data store. 可使用 Azure Integration Runtime 或自承载集成运行时(如果数据存储位于专用网络)。You can use the Azure Integration Runtime or a self-hosted integration runtime (if your data store is located in a private network). 如果未指定此属性,则使用默认的 Azure Integration Runtime。If this property isn't specified, the default Azure Integration Runtime is used. No

示例Example

{
    "name": "CosmosDbSQLAPILinkedService",
    "properties": {
        "type": "CosmosDb",
        "typeProperties": {
            "connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>"
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

示例:在 Azure 密钥保管库中存储帐户密钥Example: store account key in Azure Key Vault

{
    "name": "CosmosDbSQLAPILinkedService",
    "properties": {
        "type": "CosmosDb",
        "typeProperties": {
            "connectionString": "AccountEndpoint=<EndpointUrl>;Database=<Database>",
            "accountKey": { 
                "type": "AzureKeyVaultSecret", 
                "store": { 
                    "referenceName": "<Azure Key Vault linked service name>", 
                    "type": "LinkedServiceReference" 
                }, 
                "secretName": "<secretName>" 
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

数据集属性Dataset properties

有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集和链接服务For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.

Azure Cosmos DB (SQL API) 数据集支持以下属性:The following properties are supported for Azure Cosmos DB (SQL API) dataset:

属性Property 说明Description 必须Required
typetype 数据集的 type 属性必须设置为 CosmosDbSqlApiCollectionThe type property of the dataset must be set to CosmosDbSqlApiCollection. Yes
collectionNamecollectionName Azure Cosmos DB 文档集合的名称。The name of the Azure Cosmos DB document collection. Yes

如果使用“DocumentDbCollection”类型的数据集,则仍按原样提供支持,以实现复制和查找活动的后向兼容性。If you use "DocumentDbCollection" type dataset, it is still supported as-is for backward compatibility for Copy and Lookup activity. 建议你今后使用新模型。You are suggested to use the new model going forward.

示例Example

{
    "name": "CosmosDbSQLAPIDataset",
    "properties": {
        "type": "CosmosDbSqlApiCollection",
        "linkedServiceName":{
            "referenceName": "<Azure Cosmos DB linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [],
        "typeProperties": {
            "collectionName": "<collection name>"
        }
    }
}

复制活动属性Copy Activity properties

本部分列出了Azure Cosmos DB (SQL API) 源和接收器支持的属性。This section provides a list of properties that the Azure Cosmos DB (SQL API) source and sink support. 有关可用于定义活动的各个部分和属性的完整列表,请参阅管道For a full list of sections and properties that are available for defining activities, see Pipelines.

Azure Cosmos DB (SQL API) 用作源Azure Cosmos DB (SQL API) as source

若要从 Azure Cosmos DB (SQL API) 复制数据,请将复制活动中的 source 类型设置为 DocumentDbCollectionSourceTo copy data from Azure Cosmos DB (SQL API), set the source type in Copy Activity to DocumentDbCollectionSource.

复制活动 source 节支持以下属性:The following properties are supported in the Copy Activity source section:

属性Property 说明Description 必须Required
typetype 复制活动源的 type 属性必须设置为 CosmosDbSqlApiSourceThe type property of the copy activity source must be set to CosmosDbSqlApiSource. Yes
查询query 指定要读取数据的 Azure Cosmos DB 查询。Specify the Azure Cosmos DB query to read data.

示例:Example:
SELECT c.BusinessEntityID, c.Name.First AS FirstName, c.Name.Middle AS MiddleName, c.Name.Last AS LastName, c.Suffix, c.EmailPromotion FROM c WHERE c.ModifiedDate > \"2009-01-01T00:00:00\"
No

如果未指定,则执行此 SQL 语句:select <columns defined in structure> from mycollectionIf not specified, this SQL statement is executed: select <columns defined in structure> from mycollection
preferredRegionspreferredRegions 从 Cosmos DB 检索数据时要连接到的区域的首选列表。The preferred list of regions to connect to when retrieving data from Cosmos DB. No
pageSizepageSize 查询结果的每页文档数。The number of documents per page of the query result. 默认值为“-1”,表示使用服务端动态页大小,最大为 1000。Default is "-1" which means uses the service side dynamic page size up to 1000. No

如果使用“DocumentDbCollectionSource”类型的源,则仍按原样提供支持以实现后向兼容性。If you use "DocumentDbCollectionSource" type source, it is still supported as-is for backward compatibility. 建议今后使用新模型,新模型提供了更丰富的功能来从 Cosmos DB 复制数据。You are suggested to use the new model going forward which provide richer capabilities to copy data from Cosmos DB.

示例Example

"activities":[
    {
        "name": "CopyFromCosmosDBSQLAPI",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<Cosmos DB SQL API input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "CosmosDbSqlApiSource",
                "query": "SELECT c.BusinessEntityID, c.Name.First AS FirstName, c.Name.Middle AS MiddleName, c.Name.Last AS LastName, c.Suffix, c.EmailPromotion FROM c WHERE c.ModifiedDate > \"2009-01-01T00:00:00\"",
                "preferredRegions": [
                    "China East 2"
                ]
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

从 Cosmos DB 复制数据时,除非你想原样导出 JSON 文档,否则,最佳做法是在复制活动中指定映射。When copy data from Cosmos DB, unless you want to export JSON documents as-is, the best practice is to specify the mapping in copy activity. 数据工厂遵循你在活动上指定的映射 - 如果某个行的某个列中未包含值,则会为列值提供 null 值。Data Factory honors the mapping you specified on the activity - if a row doesn't contain a value for a column, a null value is provided for the column value. 如果未指定映射,则数据工厂将使用数据中的第一行来推断架构。If you don't specify a mapping, Data Factory infers the schema by using the first row in the data. 如果第一行不包含完整架构,则活动操作的结果中将丢失部分列。If the first row doesn't contain the full schema, some columns will be missing in the result of the activity operation.

Azure Cosmos DB (SQL API) 用作接收器Azure Cosmos DB (SQL API) as sink

若要将数据复制到 Azure Cosmos DB (SQL API),请将复制活动中的 sink 类型设置为 DocumentDbCollectionSinkTo copy data to Azure Cosmos DB (SQL API), set the sink type in Copy Activity to DocumentDbCollectionSink.

复制活动 source 节支持以下属性:The following properties are supported in the Copy Activity source section:

属性Property 说明Description 必须Required
typetype 复制活动接收器的 type 属性必须设置为 CosmosDbSqlApiSinkThe type property of the Copy Activity sink must be set to CosmosDbSqlApiSink. Yes
writeBehaviorwriteBehavior 描述如何将数据写入 Azure Cosmos DB。Describes how to write data to Azure Cosmos DB. 允许的值为 insertupsertAllowed values: insert and upsert.

upsert 的行为是,如果已存在具有相同 ID 的文档,则替换该文档;否则将插入该文档。The behavior of upsert is to replace the document if a document with the same ID already exists; otherwise, insert the document.

注意:如果未在原始文档中指定 ID,或未通过列映射指定 ID,则数据工厂会自动为文档生成 ID。Note: Data Factory automatically generates an ID for a document if an ID isn't specified either in the original document or by column mapping. 这表示必须先确保文档有 ID,才能让 upsert 按预期工作。This means that you must ensure that, for upsert to work as expected, your document has an ID.
No
(默认值为 insert(the default is insert)
writeBatchSizewriteBatchSize 数据工厂使用 Azure Cosmos DB 批量执行程序库将数据写入 Azure Cosmos DB。Data Factory uses the Azure Cosmos DB bulk executor library to write data to Azure Cosmos DB. writeBatchSize 属性控制 ADF 提供给库的文档的大小。The writeBatchSize property controls the size of documents that ADF provides to the library. 可尝试增加 writeBatchSize 的值以提高性能,并在文档大小较大时降低该值 - 请参阅下面的提示。You can try increasing the value for writeBatchSize to improve performance and decreasing the value if your document size being large - see below tips. No
(默认值为 10,000(the default is 10,000)
disableMetricsCollectiondisableMetricsCollection 数据工厂收集指标(如 Cosmos DB RU),以获取复制性能优化和建议。Data Factory collects metrics such as Cosmos DB RUs for copy performance optimization and recommendations. 如果你担心此行为,请指定 true 将其关闭。If you are concerned with this behavior, specify true to turn it off. 否(默认值为 falseNo (default is false)

提示

若要按原样导入 JSON 文档,请参阅导入或导出 JSON 文档部分;若要从表格形数据复制,请参阅从关系数据库迁移到 Cosmos DBTo import JSON documents as-is, refer to Import or export JSON documents section; to copy from tabular-shaped data, refer to Migrate from relational database to Cosmos DB.

提示

Cosmos DB 将单个请求的大小限制为 2MB。Cosmos DB limits single request's size to 2MB. 公式为请求大小 = 单个文档大小 * 写入批大小。The formula is Request Size = Single Document Size * Write Batch Size. 若出现“请求太大。”错误,请减少复制接收器配置中的 writeBatchSize 值 。If you hit error saying "Request size is too large.", reduce the writeBatchSize value in copy sink configuration.

如果使用“DocumentDbCollectionSink”类型的源,则仍按原样提供支持以实现后向兼容性。If you use "DocumentDbCollectionSink" type source, it is still supported as-is for backward compatibility. 建议今后使用新模型,新模型提供了更丰富的功能来从 Cosmos DB 复制数据。You are suggested to use the new model going forward which provide richer capabilities to copy data from Cosmos DB.

示例Example

"activities":[
    {
        "name": "CopyToCosmosDBSQLAPI",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<Document DB output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "<source type>"
            },
            "sink": {
                "type": "CosmosDbSqlApiSink",
                "writeBehavior": "upsert"
            }
        }
    }
]

架构映射Schema mapping

要将数据从 Azure Cosmos DB 复制到表格接收器或进行反向复制,请参阅架构映射To copy data from Azure Cosmos DB to tabular sink or reversed, refer to schema mapping.

查找活动属性Lookup activity properties

若要了解有关属性的详细信息,请查看 Lookup 活动To learn details about the properties, check Lookup activity.

导入和导出 JSON 文档Import and export JSON documents

使用此 Azure Cosmos DB(SQL API) 连接器,可以轻松地:You can use this Azure Cosmos DB (SQL API) connector to easily:

  • 在两个 Azure Cosmos DB 集合之间按原样复制文档。Copy documents between two Azure Cosmos DB collections as-is.
  • 将各种源(包括 Azure Blob 存储和 Azure 数据工厂所支持的其他基于文件的存储)中的 JSON 文档导入 Azure Cosmos DB。Import JSON documents from various sources to Azure Cosmos DB, including from Azure Blob storage, and other file-based stores that Azure Data Factory supports.
  • 将 JSON 文档从 Azure Cosmos DB 集合导出到各种基于文件的存储。Export JSON documents from an Azure Cosmos DB collection to various file-based stores.

若要实现“架构不可知”复制,请执行以下操作:To achieve schema-agnostic copy:

  • 使用复制数据工具时,选择“原样导出到 JSON 文件或 Cosmos DB 集合”选项。When you use the Copy Data tool, select the Export as-is to JSON files or Cosmos DB collection option.
  • 使用活动创作时,请为源或接收器选择 JSON 格式以及相应的文件存储。When you use activity authoring, choose JSON format with the corresponding file store for source or sink.

从关系数据库迁移到 Cosmos DBMigrate from relational database to Cosmos DB

从关系数据库(例如 SQL Server)迁移到 Azure Cosmos DB 时,复制活动可以轻松地从源映射表格数据,以在 Cosmos DB 中平展 JSON 文档。When migrating from a relational database e.g. SQL Server to Azure Cosmos DB, copy activity can easily map tabular data from source to flatten JSON documents in Cosmos DB. 某些情况下,你可能希望根据 Azure Cosmos DB 中的数据建模重新设计数据模型,以便来针对 NoSQL 用例对其进行优化,例如,通过将所有相关子项嵌入到一个 JSON 文档中来使数据非规范化。In some cases, you may want to redesign the data model to optimize it for the NoSQL use-cases according to Data modeling in Azure Cosmos DB, for example, to denormalize the data by embedding all of the related sub-items within one JSON document. 对于这种情况,请参阅此文,其中演练了如何使用 Azure 数据工厂复制活动实现此目的。For such case, refer to this article with a walkthrough on how to achieve it using Azure Data Factory copy activity.

后续步骤Next steps

有关 Azure 数据工厂中复制活动支持用作源和接收器的数据存储的列表,请参阅支持的数据存储For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see supported data stores.