使用 Azure 数据工厂从 HTTP 终结点复制数据Copy data from an HTTP endpoint by using Azure Data Factory

适用于: Azure 数据工厂 Azure Synapse Analytics(预览版)

本文概述了如何使用 Azure 数据工厂中的复制活动从 HTTP 终结点复制数据。This article outlines how to use Copy Activity in Azure Data Factory to copy data from an HTTP endpoint. 本文是根据总体概述复制活动的 Azure 数据工厂中的复制活动编写的。The article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.

此 HTTP 连接器、REST 连接器Web 表连接器之间的区别如下:The difference among this HTTP connector, the REST connector and the Web table connector are:

  • REST 连接器专门支持从 RESTful API 复制数据。REST connector specifically support copying data from RESTful APIs;
  • HTTP 连接器是通用的,可从任何 HTTP 终结点检索数据,以执行文件下载等操作。HTTP connector is generic to retrieve data from any HTTP endpoint, e.g. to download file. 在 REST 连接器可用之前,你可能偶然使用 HTTP 连接器从 RESTful API 复制数据,这是受支持的,但与 REST 连接器相比功能较少。Before REST connector becomes available, you may happen to use the HTTP connector to copy data from RESTful API, which is supported but less functional comparing to REST connector.
  • Web 表连接器用于从 HTML 网页中提取表内容。Web table connector extracts table content from an HTML webpage.

支持的功能Supported capabilities

以下活动支持此 HTTP 连接器:This HTTP connector is supported for the following activities:

可以将数据从 HTTP 源复制到任何支持的接收器数据存储。You can copy data from an HTTP source to any supported sink data store. 有关复制活动支持作为源和接收器的数据存储的列表,请参阅支持的数据存储和格式For a list of data stores that Copy Activity supports as sources and sinks, see Supported data stores and formats.

可以使用此 HTTP 连接器:You can use this HTTP connector to:

  • 通过 HTTP GET 或 POST 方法,从 HTTP/S 终结点检索数据 。Retrieve data from an HTTP/S endpoint by using the HTTP GET or POST methods.
  • 使用以下身份验证之一检索数据:AnonymousBasicDigestWindowsClientCertificateRetrieve data by using one of the following authentications: Anonymous, Basic, Digest, Windows, or ClientCertificate.
  • 按原样复制 HTTP 响应,或者使用支持的文件格式和压缩编解码器分析该响应。Copy the HTTP response as-is or parse it by using supported file formats and compression codecs.

提示

若要先为数据检索测试 HTTP 请求,再在数据工厂中配置 HTTP 连接器,请了解标头和正文的 API 规范要求。To test an HTTP request for data retrieval before you configure the HTTP connector in Data Factory, learn about the API specification for header and body requirements. 可以使用 Postman 或 Web 浏览器等工具进行验证。You can use tools like Postman or a web browser to validate.

先决条件Prerequisites

如果数据存储位于本地网络、Azure 虚拟网络或 Amazon Virtual Private Cloud 内部,则需要设置自承载集成运行时才能连接到该数据存储。If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private Cloud, you need to set up a self-hosted integration runtime to connect to it.

如果数据存储是托管的云数据服务,则可以使用 Azure 集成运行时。If your data store is a managed cloud data service, you can use Azure integration runtime. 如果访问范围限制为防火墙规则中列入白名单的 IP,可以选择将 Azure 集成运行时 IP 添加到允许列表。If the access is restricted to IPs that are whitelisted in the firewall rules, you can choose to add Azure Integration Runtime IPs into the allow list.

要详细了解网络安全机制和数据工厂支持的选项,请参阅数据访问策略For more information about the network security mechanisms and options supported by Data Factory, see Data access strategies.

入门Get started

若要使用管道执行复制活动,可以使用以下工具或 SDK 之一:To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:

对于特定于 HTTP 连接器的数据工厂实体,以下部分提供有关用于定义这些实体的属性的详细信息。The following sections provide details about properties you can use to define Data Factory entities that are specific to the HTTP connector.

链接服务属性Linked service properties

HTTP 链接的服务支持以下属性:The following properties are supported for the HTTP linked service:

属性Property 说明Description 必须Required
typetype type 属性必须设置为 HttpServer 。The type property must be set to HttpServer. Yes
urlurl Web 服务器的基 URL。The base URL to the web server. Yes
enableServerCertificateValidationenableServerCertificateValidation 指定连接到 HTTP 终结点时,是否是启用服务器 TLS/SSL 证书验证。Specify whether to enable server TLS/SSL certificate validation when you connect to an HTTP endpoint. HTTPS 服务器使用自签名证书时,将此属性设置为 false 。If your HTTPS server uses a self-signed certificate, set this property to false. No
(默认值为 true) (the default is true)
authenticationTypeauthenticationType 指定身份验证类型。Specifies the authentication type. 允许的值为:Anonymous、Basic、Digest、Windows 和 ClientCertificate 。Allowed values are Anonymous, Basic, Digest, Windows, and ClientCertificate.

有关这些身份验证类型的更多属性和 JSON 示例,请参阅此表格下面的部分。See the sections that follow this table for more properties and JSON samples for these authentication types.
Yes
connectViaconnectVia 用于连接到数据存储的 Integration RuntimeThe Integration Runtime to use to connect to the data store. 先决条件部分了解更多信息。Learn more from Prerequisites section. 如果未指定,则使用默认 Azure Integration Runtime。If not specified, the default Azure Integration Runtime is used. No

使用基本、摘要或 Windows 身份验证Using Basic, Digest, or Windows authentication

将 authenticationType 属性设置为 Basic、Digest 或 Windows 。Set the authenticationType property to Basic, Digest, or Windows. 除了前面部分所述的通用属性,还指定以下属性:In addition to the generic properties that are described in the preceding section, specify the following properties:

属性Property 说明Description 必须Required
userNameuserName 用于访问 HTTP 终结点的用户名。The user name to use to access the HTTP endpoint. Yes
passwordpassword 用户(userName 值)的密码 。The password for the user (the userName value). 将此字段标记为 SecureString 类型,以便安全地将其存储在数据工厂中 。Mark this field as a SecureString type to store it securely in Data Factory. 此外,还可以引用 Azure Key Vault 中存储的机密You can also reference a secret stored in Azure Key Vault. Yes

示例Example

{
    "name": "HttpLinkedService",
    "properties": {
        "type": "HttpServer",
        "typeProperties": {
            "authenticationType": "Basic",
            "url" : "<HTTP endpoint>",
            "userName": "<user name>",
            "password": {
                "type": "SecureString",
                "value": "<password>"
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

使用 ClientCertificate 身份验证Using ClientCertificate authentication

若要使用 ClientCertificate 身份验证,将 authenticationType 属性设置为ClientCertificate 。To use ClientCertificate authentication, set the authenticationType property to ClientCertificate. 除了前面部分所述的通用属性,还指定以下属性:In addition to the generic properties that are described in the preceding section, specify the following properties:

属性Property 说明Description 必须Required
embeddedCertDataembeddedCertData Base64 编码的证书数据。Base64-encoded certificate data. 指定是 embeddedCertData,还是 certThumbprint 。Specify either embeddedCertData or certThumbprint.
certThumbprintcertThumbprint 自承载集成运行时计算机的证书存储中所安装证书的指纹。The thumbprint of the certificate that's installed on your self-hosted Integration Runtime machine's cert store. 仅当在 connectVia 属性中指定自承载类型的 Integration Runtime 时适用 。Applies only when the self-hosted type of Integration Runtime is specified in the connectVia property. 指定是 embeddedCertData,还是 certThumbprint 。Specify either embeddedCertData or certThumbprint.
passwordpassword 与证书关联的密码。The password that's associated with the certificate. 将此字段标记为 SecureString 类型,以便安全地将其存储在数据工厂中 。Mark this field as a SecureString type to store it securely in Data Factory. 此外,还可以引用 Azure Key Vault 中存储的机密You can also reference a secret stored in Azure Key Vault. No

如果使用 certThumbprint 进行身份验证,并在本地计算机的个人存储中安装了证书,则需要授予对自承载集成运行时的读取权限 :If you use certThumbprint for authentication and the certificate is installed in the personal store of the local computer, grant read permissions to the self-hosted Integration Runtime:

  1. 打开 Microsoft 管理控制台 (MMC)。Open the Microsoft Management Console (MMC). 添加面向“本地计算机”的“证书”管理单元。 Add the Certificates snap-in that targets Local Computer.
  2. 展开“证书” > “个人”,然后选择“证书” 。Expand Certificates > Personal, and then select Certificates.
  3. 右键单击个人存储中的证书,并选择“所有任务” > “管理私钥” 。Right-click the certificate from the personal store, and then select All Tasks > Manage Private Keys.
  4. 在“安全性”选项卡上,添加运行 Integration Runtime 主机服务 (DIAHostService) 的、对证书具有读取访问权限的用户帐户 。On the Security tab, add the user account under which the Integration Runtime Host Service (DIAHostService) is running, with read access to the certificate.

示例 1:使用 certThumbprintExample 1: Using certThumbprint

{
    "name": "HttpLinkedService",
    "properties": {
        "type": "HttpServer",
        "typeProperties": {
            "authenticationType": "ClientCertificate",
            "url": "<HTTP endpoint>",
            "certThumbprint": "<thumbprint of certificate>"
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

示例 2:使用 embeddedCertDataExample 2: Using embeddedCertData

{
    "name": "HttpLinkedService",
    "properties": {
        "type": "HttpServer",
        "typeProperties": {
            "authenticationType": "ClientCertificate",
            "url": "<HTTP endpoint>",
            "embeddedCertData": "<Base64-encoded cert data>",
            "password": {
                "type": "SecureString",
                "value": "password of cert"
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

数据集属性Dataset properties

有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集一文。For a full list of sections and properties available for defining datasets, see the Datasets article.

Azure 数据工厂支持以下文件格式。Azure Data Factory supports the following file formats. 请参阅每一篇介绍基于格式的设置的文章。Refer to each article for format-based settings.

基于格式的数据集中 location 设置下的 HTTP 支持以下属性:The following properties are supported for HTTP under location settings in format-based dataset:

属性Property 说明Description 必须Required
typetype 数据集中 location 下的 type 属性必须设置为 HttpServerLocationThe type property under location in dataset must be set to HttpServerLocation. Yes
relativeUrlrelativeUrl 包含数据的资源的相对 URL。A relative URL to the resource that contains the data. HTTP 连接器从以下组合 URL 复制数据:[URL specified in linked service][relative URL specified in dataset]The HTTP connector copies data from the combined URL: [URL specified in linked service][relative URL specified in dataset]. No

备注

支持的 HTTP 请求有效负载大小约为 500 KB。The supported HTTP request payload size is around 500 KB. 如果要传递给 Web 终结点的有效负载大小大于 500 KB,请考虑以更小的区块对该有效负载进行批处理。If the payload size you want to pass to your web endpoint is larger than 500 KB, consider batching the payload in smaller chunks.

示例:Example:

{
    "name": "DelimitedTextDataset",
    "properties": {
        "type": "DelimitedText",
        "linkedServiceName": {
            "referenceName": "<HTTP linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [ < physical schema, optional, auto retrieved during authoring > ],
        "typeProperties": {
            "location": {
                "type": "HttpServerLocation",
                "relativeUrl": "<relative url>"
            },
            "columnDelimiter": ",",
            "quoteChar": "\"",
            "firstRowAsHeader": true,
            "compressionCodec": "gzip"
        }
    }
}

复制活动属性Copy Activity properties

本部分提供 HTTP 源支持的属性列表。This section provides a list of properties that the HTTP source supports.

有关可用于定义活动的各个部分和属性的完整列表,请参阅管道For a full list of sections and properties that are available for defining activities, see Pipelines.

HTTP 作为源HTTP as source

Azure 数据工厂支持以下文件格式。Azure Data Factory supports the following file formats. 请参阅每一篇介绍基于格式的设置的文章。Refer to each article for format-based settings.

基于格式的复制源中 storeSettings 设置下的 HTTP 支持以下属性:The following properties are supported for HTTP under storeSettings settings in format-based copy source:

属性Property 说明Description 必须Required
typetype storeSettings 下的 type 属性必须设置为 HttpReadSettingsThe type property under storeSettings must be set to HttpReadSettings. Yes
requestMethodrequestMethod HTTP 方法。The HTTP method.
允许的值为 Get(默认值)和 Post 。Allowed values are Get (default) and Post.
No
addtionalHeadersaddtionalHeaders 附加的 HTTP 请求标头。Additional HTTP request headers. No
requestBodyrequestBody HTTP 请求的正文。The body for the HTTP request. No
httpRequestTimeouthttpRequestTimeout 用于获取响应的 HTTP 请求的超时 (TimeSpan 值) 。The timeout (the TimeSpan value) for the HTTP request to get a response. 该值是获取响应而不是读取响应数据的超时。This value is the timeout to get a response, not the timeout to read response data. 默认值为 00:01:40 。The default value is 00:01:40. No
maxConcurrentConnectionsmaxConcurrentConnections 可以同时连接到存储库的连接数。The number of the connections to connect to storage store concurrently. 仅在要限制与数据存储的并发连接时指定。Specify only when you want to limit the concurrent connection to the data store. No

示例:Example:

"activities":[
    {
        "name": "CopyFromHTTP",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<Delimited text input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "DelimitedTextSource",
                "formatSettings":{
                    "type": "DelimitedTextReadSettings",
                    "skipLineCount": 10
                },
                "storeSettings":{
                    "type": "HttpReadSettings",
                    "requestMethod": "Post",
                    "additionalHeaders": "<header key: header value>\n<header key: header value>\n",
                    "requestBody": "<body for POST HTTP request>"
                }
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

Lookup 活动属性Lookup activity properties

若要了解有关属性的详细信息,请查看 Lookup 活动To learn details about the properties, check Lookup activity.

旧模型Legacy models

备注

仍按原样支持以下模型,以实现向后兼容性。The following models are still supported as-is for backward compatibility. 建议你以后使用前面部分中提到的新模型,ADF 创作 UI 已经切换到生成新模型。You are suggested to use the new model mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.

旧数据集模型Legacy dataset model

属性Property 说明Description 必须Required
typetype 数据集的 type 属性必须设置为 HttpFile 。The type property of the dataset must be set to HttpFile. Yes
relativeUrlrelativeUrl 包含数据的资源的相对 URL。A relative URL to the resource that contains the data. 未指定此属性时,仅使用链接服务定义中指定的 URL。When this property isn't specified, only the URL that's specified in the linked service definition is used. No
requestMethodrequestMethod HTTP 方法。The HTTP method. 允许的值为 Get(默认值)和 Post 。Allowed values are Get (default) and Post. No
additionalHeadersadditionalHeaders 附加的 HTTP 请求标头。Additional HTTP request headers. No
requestBodyrequestBody HTTP 请求的正文。The body for the HTTP request. No
formatformat 如果要在未经分析的情况下从 HTTP 终结点按原样检索数据,并将其复制到基于文件的存储,请跳过输入和输出数据集定义中的格式部分 。If you want to retrieve data from the HTTP endpoint as-is without parsing it, and then copy the data to a file-based store, skip the format section in both the input and output dataset definitions.

如果要在复制期间分析 HTTP 响应内容,则支持以下文件格式类型:TextFormat、JsonFormat、AvroFormat、OrcFormat 和 ParquetFormat 。If you want to parse the HTTP response content during copy, the following file format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, and ParquetFormat. 请将格式中的“type”属性设置为上述值之一 。Under format, set the type property to one of these values. 有关详细信息,请参阅 JSON 格式文本格式Avro 格式Orc 格式Parquet 格式For more information, see JSON format, Text format, Avro format, Orc format, and Parquet format.
No
compressioncompression 指定数据的压缩类型和级别。Specify the type and level of compression for the data. 有关详细信息,请参阅受支持的文件格式和压缩编解码器For more information, see Supported file formats and compression codecs.

支持的类型:GZipDeflateBZip2ZipDeflateSupported types: GZip, Deflate, BZip2, and ZipDeflate.
支持的级别:“最佳”和“最快” 。Supported levels: Optimal and Fastest.
No

备注

支持的 HTTP 请求有效负载大小约为 500 KB。The supported HTTP request payload size is around 500 KB. 如果要传递给 Web 终结点的有效负载大小大于 500 KB,请考虑以更小的区块对该有效负载进行批处理。If the payload size you want to pass to your web endpoint is larger than 500 KB, consider batching the payload in smaller chunks.

示例 1:使用 Get 方法(默认)Example 1: Using the Get method (default)

{
    "name": "HttpSourceDataInput",
    "properties": {
        "type": "HttpFile",
        "linkedServiceName": {
            "referenceName": "<HTTP linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "relativeUrl": "<relative url>",
            "additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n"
        }
    }
}

示例 2:使用 Post 方法Example 2: Using the Post method

{
    "name": "HttpSourceDataInput",
    "properties": {
        "type": "HttpFile",
        "linkedServiceName": {
            "referenceName": "<HTTP linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "relativeUrl": "<relative url>",
            "requestMethod": "Post",
            "requestBody": "<body for POST HTTP request>"
        }
    }
}

旧复制活动源模型Legacy copy activity source model

属性Property 说明Description 必须Required
typetype 复制活动源的 type 属性必须设置为:HttpSource 。The type property of the copy activity source must be set to HttpSource. Yes
httpRequestTimeouthttpRequestTimeout 用于获取响应的 HTTP 请求的超时 (TimeSpan 值) 。The timeout (the TimeSpan value) for the HTTP request to get a response. 该值是获取响应而不是读取响应数据的超时。This value is the timeout to get a response, not the timeout to read response data. 默认值为 00:01:40 。The default value is 00:01:40. No

示例Example

"activities":[
    {
        "name": "CopyFromHTTP",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<HTTP input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "HttpSource",
                "httpRequestTimeout": "00:01:00"
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

后续步骤Next steps

有关 Azure 数据工厂中复制活动支持用作源和接收器的数据存储的列表,请参阅支持的数据存储和格式For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported data stores and formats.