在 Azure 数据工厂中使用 Spark 活动转换数据Transform data using Spark activity in Azure Data Factory

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

数据工厂管道中的 Spark 活动在自己的按需 HDInsight 群集上执行 Spark 程序。The Spark activity in a Data Factory pipeline executes a Spark program on your own or on-demand HDInsight cluster. 本文基于数据转换活动一文,它概述了数据转换和受支持的转换活动。This article builds on the data transformation activities article, which presents a general overview of data transformation and the supported transformation activities. 使用按需的 Spark 链接服务时,数据工厂会自动为你实时创建用于处理数据的 Spark 群集,然后在处理完成后删除群集。When you use an on-demand Spark linked service, Data Factory automatically creates a Spark cluster for you just-in-time to process the data and then deletes the cluster once the processing is complete.

Spark 活动属性Spark activity properties

下面是 Spark 活动的示例 JSON 定义:Here is the sample JSON definition of a Spark Activity:

{
    "name": "Spark Activity",
    "description": "Description",
    "type": "HDInsightSpark",
    "linkedServiceName": {
        "referenceName": "MyHDInsightLinkedService",
        "type": "LinkedServiceReference"
    },
    "typeProperties": {
        "sparkJobLinkedService": {
            "referenceName": "MyAzureStorageLinkedService",
            "type": "LinkedServiceReference"
        },
        "rootPath": "adfspark",
        "entryFilePath": "test.py",
        "sparkConfig": {
            "ConfigItem1": "Value"
        },
        "getDebugInfo": "Failure",
        "arguments": [
            "SampleHadoopJobArgument1"
        ]
    }
}

下表描述了 JSON 定义中使用的 JSON 属性:The following table describes the JSON properties used in the JSON definition:

属性Property 说明Description 必须Required
namename 管道中活动的名称。Name of the activity in the pipeline. Yes
descriptiondescription 描述活动用途的文本。Text describing what the activity does. No
typetype 对于 Spark 活动,活动类型是 HDInsightSpark。For Spark Activity, the activity type is HDInsightSpark. Yes
linkedServiceNamelinkedServiceName 运行 Spark 程序的 HDInsight Spark 链接服务的名称。Name of the HDInsight Spark Linked Service on which the Spark program runs. 若要了解此链接服务,请参阅计算链接服务一文。To learn about this linked service, see Compute linked services article. Yes
SparkJobLinkedServiceSparkJobLinkedService 用于保存 Spark 作业文件、依赖项和日志的 Azure 存储链接服务。The Azure Storage linked service that holds the Spark job file, dependencies, and logs. 如果未指定此属性的值,将使用与 HDInsight 群集关联的存储。If you do not specify a value for this property, the storage associated with HDInsight cluster is used. 此属性的值只能是 Azure 存储链接服务。The value of this property can only be an Azure Storage linked service. No
rootPathrootPath 包含 Spark 文件的 Azure Blob 容器和文件夹。The Azure Blob container and folder that contains the Spark file. 文件名称需区分大小写。The file name is case-sensitive. 有关此文件夹结构的详细信息,请参阅“文件夹结构”部分(下一部分)。Refer to folder structure section (next section) for details about the structure of this folder. Yes
entryFilePathentryFilePath Spark 代码/包的根文件夹的相对路径Relative path to the root folder of the Spark code/package. 条目文件必须是 Python 文件或 .jar 文件。The entry file must be either a Python file or a .jar file. Yes
classNameclassName 应用程序的 Java/Spark main 类Application's Java/Spark main class No
参数arguments Spark 程序的命令行参数列表。A list of command-line arguments to the Spark program. No
proxyUserproxyUser 用于模拟执行 Spark 程序的用户帐户The user account to impersonate to execute the Spark program No
sparkConfigsparkConfig 指定在以下主题中列出的 Spark 配置属性的值:Spark 配置 - 应用程序属性Specify values for Spark configuration properties listed in the topic: Spark Configuration - Application properties. No
getDebugInfogetDebugInfo 指定何时将 Spark 日志文件复制到 HDInsight 群集使用的(或者)sparkJobLinkedService 指定的 Azure 存储。Specifies when the Spark log files are copied to the Azure storage used by HDInsight cluster (or) specified by sparkJobLinkedService. 允许的值:None、Always 或 Failure。Allowed values: None, Always, or Failure. 默认值:无。Default value: None. No

文件夹结构Folder structure

与 Pig/Hive 作业相比,Spark 作业的可扩展性更高。Spark jobs are more extensible than Pig/Hive jobs. 对于 Spark 作业,可以提供多个依赖项,例如 jar 包(放在 java CLASSPATH 中)、python 文件(放在 PYTHONPATH 中)和其他任何文件。For Spark jobs, you can provide multiple dependencies such as jar packages (placed in the java CLASSPATH), python files (placed on the PYTHONPATH), and any other files.

在 HDInsight 链接服务引用的 Azure Blob 存储中创建以下文件夹结构。Create the following folder structure in the Azure Blob storage referenced by the HDInsight linked service. 然后,将依赖文件上传到 entryFilePath 表示的根文件夹中的相应子文件夹。Then, upload dependent files to the appropriate sub folders in the root folder represented by entryFilePath. 例如,将 python 文件上传到根文件夹的 pyFiles 子文件夹,将 jar 文件上传到根文件夹的 jars 子文件夹。For example, upload python files to the pyFiles subfolder and jar files to the jars subfolder of the root folder. 在运行时,数据工厂服务需要 Azure Blob 存储中的以下文件夹结构:At runtime, Data Factory service expects the following folder structure in the Azure Blob storage:

PathPath 说明Description 必须Required 类型Type
.(根). (root) Spark 作业在存储链接服务中的根路径The root path of the Spark job in the storage linked service Yes 文件夹Folder
<用户定义><user defined > 指向 Spark 作业入口文件的路径The path pointing to the entry file of the Spark job Yes 文件File
./jars./jars 此文件夹下的所有文件将上传并放置在群集的 java 类路径中All files under this folder are uploaded and placed on the java classpath of the cluster No 文件夹Folder
./pyFiles./pyFiles 此文件夹下的所有文件将上传并放置在群集的 PYTHONPATH 中All files under this folder are uploaded and placed on the PYTHONPATH of the cluster No 文件夹Folder
./files./files 此文件夹下的所有文件将上传并放置在执行器工作目录中All files under this folder are uploaded and placed on executor working directory No 文件夹Folder
./archives./archives 此文件夹下的所有文件未经压缩All files under this folder are uncompressed No 文件夹Folder
./logs./logs 包含 Spark 群集的日志的文件夹。The folder that contains logs from the Spark cluster. No 文件夹Folder

以下示例显示了一个在 HDInsight 链接服务引用的 Azure Blob 存储中包含两个 Spark 作业文件的存储。Here is an example for a storage containing two Spark job files in the Azure Blob Storage referenced by the HDInsight linked service.

SparkJob1
    main.jar
    files
        input1.txt
        input2.txt
    jars
        package1.jar
        package2.jar
    logs

SparkJob2
    main.py
    pyFiles
        scrip1.py
        script2.py
    logs

后续步骤Next steps

参阅以下文章了解如何以其他方式转换数据:See the following articles that explain how to transform data in other ways: