使用适用于 Visual Studio Code 的 Spark 和 Hive 工具Use Spark & Hive Tools for Visual Studio Code

了解如何使用适用于 Visual Studio Code 的 Apache Spark 和 Hive 工具。Learn how to use Apache Spark & Hive Tools for Visual Studio Code. 使用工具为 Apache Spark 创建和提交 Apache Hive 批处理作业、交互式 Hive 查询和 PySpark 脚本。Use the tools to create and submit Apache Hive batch jobs, interactive Hive queries, and PySpark scripts for Apache Spark. 首先,本文将介绍如何在 Visual Studio Code 中安装 Spark 和 Hive 工具。First we'll describe how to install Spark & Hive Tools in Visual Studio Code. 随后介绍如何将作业提交到 Spark 和 Hive 工具。Then we'll walk through how to submit jobs to Spark & Hive Tools.

可将 Spark 和 Hive 工具安装在 Visual Studio Code 支持的平台上。Spark & Hive Tools can be installed on platforms that are supported by Visual Studio Code. 注意针对不同平台的以下先决条件。Note the following prerequisites for different platforms.

先决条件Prerequisites

完成本文中的步骤需要以下各项:The following items are required for completing the steps in this article:

安装 Spark 和 Hive 工具Install Spark & Hive Tools

满足先决条件后,可遵循以下步骤安装适用于 Visual Studio Code 的 Spark 和 Hive 工具:After you meet the prerequisites, you can install Spark & Hive Tools for Visual Studio Code by following these steps:

  1. 打开 Visual Studio Code。Open Visual Studio Code.

  2. 从菜单栏中,导航到“查看” > “扩展” 。From the menu bar, navigate to View > Extensions.

  3. 在搜索框中,输入“Spark 和 Hive”。In the search box, enter Spark & Hive.

  4. 从搜索结果中选择“Spark 和 Hive 工具”,然后选择“安装”: Select Spark & Hive Tools from the search results, and then select Install:

    适用于 Visual Studio Code 的 Spark 和 Hive Python 安装

  5. 根据需要选择“重载”。Select Reload when necessary.

打开工作文件夹Open a work folder

若要打开工作文件夹并在 Visual Studio Code 中创建文件,请执行以下步骤:To open a work folder and to create a file in Visual Studio Code, follow these steps:

  1. 在菜单栏中,导航到“文件” > “打开文件夹...” > “C:\HD\HDexample”,然后选择“选择文件夹”按钮。 From the menu bar, navigate to File > Open Folder... > C:\HD\HDexample, and then select the Select Folder button. 该文件夹显示在左侧的“资源管理器”视图中。The folder appears in the Explorer view on the left.

  2. 在“资源管理器”视图中选择“HDexample”文件夹,然后选择工作文件夹旁边的“新建文件”图标: In Explorer view, select the HDexample folder, and then select the New File icon next to the work folder:

    visual studio code“新建文件”图标

  3. 为新文件命名,使用 .hql(Hive 查询)或 .py(Spark 脚本)作为文件扩展名。Name the new file by using either the .hql (Hive queries) or the .py (Spark script) file extension. 本示例使用 HelloWorld.hqlThis example uses HelloWorld.hql.

设置 Azure 环境Set the Azure environment

国家云用户请先遵循以下步骤设置 Azure 环境,然后使用“Azure:登录”命令登录到 Azure:For a national cloud user, follow these steps to set the Azure environment first, and then use the Azure: Sign In command to sign in to Azure:

  1. 导航到“文件” > “首选项” > “设置”。 Navigate to File > Preferences > Settings.

  2. 搜索以下字符串:Azure: CloudSearch on the following string: Azure: Cloud.

  3. 从列表中选择国家云:Select the national cloud from the list:

    设置默认登录入口配置

连接到 Azure 帐户Connect to an Azure account

在你可以从 Visual Studio Code 将脚本提交到群集之前,用户可以登录 Azure 订阅,或链接 HDInsight 群集Before you can submit scripts to your clusters from Visual Studio Code, user can either sign in to Azure subscription, or link a HDInsight cluster. 使用 Ambari 用户名/密码或 ESP 群集的域加入凭据连接到 HDInsight 群集。Use the Ambari username/password or domain joined credential for ESP cluster to connect to your HDInsight cluster. 遵循以下步骤连接到 Azure:Follow these steps to connect to Azure:

  1. 在菜单栏中,导航到“视图” > “命令面板...”,然后输入“Azure: Sign In”:From the menu bar, navigate to View > Command Palette..., and enter Azure: Sign In:

    适用于 Visual Studio Code 的 Spark 和 Hive 工具登录

  2. 按照登录说明登录到 Azure。Follow the sign-in instructions to sign in to Azure. 连接后,Visual Studio Code 窗口底部的状态栏上会显示 Azure 帐户名称。After you're connected, your Azure account name shows on the status bar at the bottom of the Visual Studio Code window.

可以使用 Apache Ambari 管理的用户名链接标准群集,也可以使用域用户名(例如:user1@contoso.com)链接 Enterprise Security Pack 安全 Hadoop 群集。You can link a normal cluster by using an Apache Ambari-managed username, or you can link an Enterprise Security Pack secure Hadoop cluster by using a domain username (such as: user1@contoso.com).

  1. 在菜单栏中,导航到“视图” > “命令面板...”,然后输入“Spark/Hive: Link a Cluster”。From the menu bar, navigate to View > Command Palette..., and enter Spark / Hive: Link a Cluster.

    命令面板链接群集命令

  2. 选择链接的群集类型“Azure HDInsight”。Select linked cluster type Azure HDInsight.

  3. 输入 HDInsight 群集 URL。Enter the HDInsight cluster URL.

  4. 输入 Ambari 用户名,默认为 adminEnter your Ambari username; the default is admin.

  5. 输入 Ambari 密码。Enter your Ambari password.

  6. 选择群集类型。Select the cluster type.

  7. 设置群集的显示名称(可选)。Set the display name of the cluster (optional).

  8. 查看“输出”视图以进行验证。Review OUTPUT view for verification.

    备注

    如果群集已登录到 Azure 订阅中并且已链接群集,则使用链接用户名和密码。The linked username and password are used if the cluster both logged in to the Azure subscription and linked a cluster.

  1. 在菜单栏中,导航到“视图” > “命令面板...”,然后输入“Spark/Hive: Link a Cluster”。From the menu bar, navigate to View > Command Palette..., and enter Spark / Hive: Link a Cluster.

  2. 选择链接的群集类型“通用 Livy 终结点”。Select linked cluster type Generic Livy Endpoint.

  3. 输入通用 livy 终结点。Enter the generic Livy endpoint. 例如:http://10.172.41.42:18080。For example: http://10.172.41.42:18080.

  4. 选择授权类型“基本”或“无”。 Select authorization type Basic or None. 如果选择“基本”:If you select Basic:

    1. 输入 Ambari 用户名,默认为 adminEnter your Ambari username; the default is admin.

    2. 输入 Ambari 密码。Enter your Ambari password.

  5. 查看“输出”视图以进行验证。Review OUTPUT view for verification.

列出群集List clusters

  1. 在菜单栏中,导航到“视图” > “命令面板...”,然后输入“Spark/Hive: List Cluster”。From the menu bar, navigate to View > Command Palette..., and enter Spark / Hive: List Cluster.

  2. 选择所需的订阅。Select the subscription that you want.

  3. 检查“输出”视图。Review the OUTPUT view. 此视图显示你的链接群集(或多个群集),以及你的 Azure 订阅下的所有群集:This view shows your linked cluster (or clusters) and all the clusters under your Azure subscription:

    设置默认群集配置

设置默认群集Set the default cluster

  1. 重新打开 前面所述的 HDexample 文件夹(如果已关闭)。Reopen the HDexample folder that was discussed earlier, if closed.

  2. 选择 前面创建的 HelloWorld.hql 文件。Select the HelloWorld.hql file that was created earlier. 它将在脚本编辑器中打开。It opens in the script editor.

  3. 右键单击脚本编辑器,然后选择“Spark/ Hive:Set Default Cluster”。Right-click the script editor, and then select Spark / Hive: Set Default Cluster.

  4. 连接到 Azure 帐户或链接某个群集(如果尚未这样做)。Connect to your Azure account, or link a cluster if you haven't yet done so.

  5. 选择一个群集作为当前脚本文件的默认群集。Select a cluster as the default cluster for the current script file. 工具将自动更新 .VSCode\settings.json 配置文件:The tools automatically update the .VSCode\settings.json configuration file:

    设置默认群集配置

提交交互式 Hive 查询和 Hive 批处理脚本Submit interactive Hive queries and Hive batch scripts

通过 Visual Studio Code 的 Spark 和 Hive 工具,可将交互式 Hive 查询和 Hive 批处理脚本提交到群集。With Spark & Hive Tools for Visual Studio Code, you can submit interactive Hive queries and Hive batch scripts to your clusters.

  1. 重新打开 前面所述的 HDexample 文件夹(如果已关闭)。Reopen the HDexample folder that was discussed earlier, if closed.

  2. 选择 前面创建的 HelloWorld.hql 文件。Select the HelloWorld.hql file that was created earlier. 它将在脚本编辑器中打开。It opens in the script editor.

  3. 将以下代码复制并粘贴到 Hive 文件中,然后保存该文件:Copy and paste the following code into your Hive file, and then save it:

    SELECT * FROM hivesampletable;
    
  4. 连接到 Azure 帐户或链接某个群集(如果尚未这样做)。Connect to your Azure account, or link a cluster if you haven't yet done so.

  5. 右键单击脚本编辑器,然后选择“Hive:Interactive”以提交查询,或使用 Ctrl+Alt+I 快捷键。Right-click the script editor and select Hive: Interactive to submit the query, or use the Ctrl+Alt+I keyboard shortcut. 选择“Hive:批处理”以提交脚本,或使用 Ctrl+Alt+H 快捷键。Select Hive: Batch to submit the script, or use the Ctrl+Alt+H keyboard shortcut.

  6. 如果尚未指定默认群集,请选择群集。If you haven't specified a default cluster, select a cluster. 工具还允许使用上下文菜单提交代码块而非整个脚本文件。The tools also let you submit a block of code instead of the whole script file by using the context menu. 不久之后,查询结果将显示在新选项卡中:After a few moments, the query results appear in a new tab:

    交互式 Apache Hive 查询结果

    • “结果”面板:可以将整个结果作为 CSV、JSON、Excel 保存到本地路径,也可以只选择多个行。RESULTS panel: You can save the whole result as a CSV, JSON, or Excel file to a local path or just select multiple lines.

    • “消息”面板:选择 号会跳转到运行的脚本的第一行。MESSAGES panel: When you select a Line number, it jumps to the first line of the running script.

提交交互式 PySpark 查询Submit interactive PySpark queries

用户可通过以下方式执行 PySpark Interactive 命令:Users can perform PySpark interactive in the following ways:

在 PY 文件中使用 PySpark Interactive 命令Using the PySpark interactive command in PY file

使用 PySpark Interactive 命令提交查询时,请执行以下步骤:Using the PySpark interactive command to submit the queries, follow these steps:

  1. 重新打开 前面所述的 HDexample 文件夹(如果已关闭)。Reopen the HDexample folder that was discussed earlier, if closed.

  2. 遵循 前面所述的步骤创建新的 HelloWorld.py 文件。Create a new HelloWorld.py file, following the earlier steps.

  3. 将以下代码复制并粘贴到脚本文件中:Copy and paste the following code into the script file:

    from operator import add
    lines = spark.read.text("/HdiSamples/HdiSamples/FoodInspectionData/README").rdd.map(lambda r: r[0])
    counters = lines.flatMap(lambda x: x.split(' ')) \
                 .map(lambda x: (x, 1)) \
                 .reduceByKey(add)
    
    coll = counters.collect()
    sortedCollection = sorted(coll, key = lambda r: r[1], reverse = True)
    
    for i in range(0, 5):
         print(sortedCollection[i])
    
  4. 窗口右下角显示安装 PySpark/Synapse Pyspark 内核的提示。The prompt to install PySpark/Synapse Pyspark kernel is displayed in the lower right corner of the window. 可以单击“安装”按钮继续安装 PySpark/Synapse Pyspark;或单击“跳过”按钮跳过此步骤。You can click on Install button to proceed for the PySpark/Synapse Pyspark installations; or click on Skip button to skip this step.

    安装 PySpark 内核

  5. 如果以后需要安装它,可以导航到“文件” > “首选项” > “设置”,然后在设置中取消选中“Hdinsight:允许跳过 Pyspark 安装”。If you need to install it later, you can navigate to File > Preference > Settings, then uncheck Hdinsight: Enable Skip Pyspark Installation in the settings.

    安装 PySpark 内核

  6. 如果在步骤 4 中安装成功,则会在窗口右下角显示“已成功安装 PySpark”消息框。If the installation is successful in step 4, the "PySpark installed successfully" message box is displayed in the lower right corner of the window. 单击“重载”按钮可重载此窗口。Click on Reload button to reload the window. 已成功安装 PySparkpyspark installed successfully

  7. 请使用命令提示符运行 pip install numpy == 1.19.3,然后再次重载 VSCode 窗口。Please use the command prompt to run pip install numpy == 1.19.3, and then reload the VSCode window again.

  8. 从菜单栏中,导航到“查看” > “命令面板...”或使用 Shift + Ctrl + P 键盘快捷键,然后输入“Python:选择解释器以启动 Jupyter 服务器”。From the menu bar, navigate to View > Command Palette... or use the Shift + Ctrl + P keyboard shortcut, and enter Python: Select Interpreter to start Jupyter Server.

    选择解释器以启动 jupyter 服务器

  9. 选择下面的 python 选项。Select the python option below.

    选择下面的选项

  10. 从菜单栏中,导航到“查看” > “命令面板...”或使用 Shift + Ctrl + P 键盘快捷键,然后输入“开发人员:重载窗口”。From the menu bar, navigate to View > Command Palette... or use the Shift + Ctrl + P keyboard shortcut, and enter Developer: Reload Window.

    重载窗口

  11. 连接到 Azure 帐户或链接某个群集(如果尚未这样做)。Connect to your Azure account, or link a cluster if you haven't yet done so.

  12. 选择所有代码,右键单击脚本编辑器并选择“Spark:PySpark Interactive/Synapse: Pyspark Interactive”以提交查询。Select all the code, right-click the script editor, and select Spark: PySpark Interactive�/ Synapse: Pyspark Interactive to submit the query.

    Pyspark Interactive 上下文菜单

  13. 如果尚未指定默认群集,请选择群集。Select the cluster, if you haven't specified a default cluster. 不久之后,新选项卡中会显示“Python Interactive 结果”。单击 PySpark 可将内核切换到“PySpark/Synapse Pyspark”,代码会成功运行。After a few moments, the Python Interactive results appear in a new tab. Click on PySpark to switch the kernel to PySpark / Synapse Pyspark, and the code will run successfully. 如果要切换到 Synapse Pyspark 内核,建议在 Azure 门户中禁用自动设置。If you want to switch to Synapse Pyspark kernel, disabling auto-settings in Azure portal is encouraged. 否则,首次使用时,可能需要很长时间才能唤醒群集和设置 synapse 内核。Otherwise it may take a long while to wake up the cluster and set synapse kernel for the first time use. 如果工具还允许使用上下文菜单提交代码块而非整个脚本文件:If The tools also let you submit a block of code instead of the whole script file by using the context menu:

    pyspark interactive - python interactive 窗口

  14. 输入 %%info,然后按 Shift+Enter 查看作业信息(可选):Enter %%info, and then press Shift+Enter to view the job information (optional):

    查看作业信息

  15. 工具还支持 Spark SQL 查询:The tool also supports the Spark SQL query:

    Pyspark Interactive - 查看结果

使用 #%% 注释在 PY 文件中执行交互式查询Perform interactive query in PY file using a #%% comment

  1. #%% 添加到 PY 代码之前以获取笔记本体验。Add #%% before the Py code to get notebook experience.

    添加 #%%

  2. 单击“运行单元格”。Click on Run Cell. 不久之后,Python Interactive 结果会显示在一个新选项卡中。单击 PySpark 将内核切换为“PySpark/Synapse PySpark”,然后再次单击“运行单元格”,代码将成功运行。After a few moments, the Python Interactive results appear in a new tab. Click on PySpark to switch the kernel to PySpark/Synapse PySpark, then, click on Run Cell again, and the code will run successfully.

    “运行单元格”命令的结果

利用 Python 扩展中的 IPYNB 支持Leverage IPYNB support from Python extension

  1. 若要创建 Jupyter Notebook,可以使用命令面板中的命令,也可以在工作区中创建新的 .ipynb 文件。You can create a Jupyter Notebook by command from the Command Palette or by creating a new .ipynb file in your workspace. 有关详细信息,请参阅在 Visual Studio Code 中使用 Jupyter NotebookFor more information, see Working with Jupyter Notebooks in Visual Studio Code

  2. 单击“运行单元格”按钮,按提示“设置默认 Spark 池”(强烈建议每次打开笔记本之前都设置默认群集/池),然后重载窗口。Click on Run cell button, follow the prompts to Set the default spark pool (strongly encourage to set default cluster/pool every time before opening a notebook) and then, Reload window.

    设置默认 spark 池并重载

  3. 单击 PySpark 将内核切换为“PySpark/Synapse Pyspark”,然后单击“运行单元格”,一段时间后将显示结果。Click on PySpark to switch kernel to PySpark / Synapse Pyspark, and then click on Run Cell, after a while, the result will be displayed.

    运行 ipynb 的结果

备注

“此扩展不支持 2020.5.78807 版及更高版本的 ms-python”问题已解决。"Ms-python >=2020.5.78807 version is not supported on this extention" has been resolved. 目前可以使用最新的 ms-python 版本。The latest ms-python version can be used for now.

提交 PySpark 批处理作业Submit PySpark batch job

  1. 重新打开 前面所述的 HDexample 文件夹(如果已关闭)。Reopen the HDexample folder that you discussed earlier, if closed.

  2. 遵循 前面所述的步骤创建新的 BatchFile.py 文件。Create a new BatchFile.py file by following the earlier steps.

  3. 将以下代码复制并粘贴到脚本文件中:Copy and paste the following code into the script file:

    from __future__ import print_function
    import sys
    from operator import add
    from pyspark.sql import SparkSession
    if __name__ == "__main__":
        spark = SparkSession\
            .builder\
            .appName("PythonWordCount")\
            .getOrCreate()
    
        lines = spark.read.text('/HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv').rdd.map(lambda r: r[0])
        counts = lines.flatMap(lambda x: x.split(' '))\
                    .map(lambda x: (x, 1))\
                    .reduceByKey(add)
        output = counts.collect()
        for (word, count) in output:
            print("%s: %i" % (word, count))
        spark.stop()
    
  4. 连接到 Azure 帐户或链接某个群集(如果尚未这样做)。Connect to your Azure account, or link a cluster if you haven't yet done so.

  5. 右键单击脚本编辑器,然后选择“Spark:PySpark Batch”或“Synapse:PySpark Batch”。Right-click the script editor, and then select Spark: PySpark Batch, or Synapse: PySpark Batch _.

  6. 选择要将 PySpark 作业提交到的群集/Spark 池:Select a cluster/spark pool to submit your PySpark job to:

    提交 Python 作业结果输出

提交 Python 作业后,提交日志将显示在 Visual Studio Code 的“输出”窗口中。After you submit a Python job, submission logs appear in the _ OUTPUT* window in Visual Studio Code. 同时还会显示 Spark UI URL 和 Yarn UI URL。The Spark UI URL and Yarn UI URL are also shown. 如果将批处理作业提交到 Apache Spark 池,则还会显示 Spark 历史记录 UI URL 和 Spark 作业应用程序 UI URL。If you submit the batch job to an Apache Spark pool, the Spark history UI URL and the Spark Job Application UI URL are also shown. 你可以在 Web 浏览器中打开 URL 以跟踪作业状态。You can open the URL in a web browser to track the job status.

与 HDInsight Identity Broker (HIB) 集成Integrate with HDInsight Identity Broker (HIB)

连接带有 ID 代理的 HDInsight ESP 群集 (HIB)Connect to your HDInsight ESP cluster with ID Broker (HIB)

可以按照常规步骤登录到 Azure 订阅,以连接带有 ID 代理的 HDInsight ESP 群集 (HIB)。You can follow the normal steps to sign in to Azure subscription to connect to your HDInsight ESP cluster with ID Broker (HIB). 登录后,将在 Azure 资源管理器中看到群集列表。After sign-in, you'll see the cluster list in Azure Explorer. 有关详细信息,请参阅连接到 HDInsight 群集For more instructions, see Connect to your HDInsight cluster.

在带有 ID 代理的 HDInsight ESP 群集 (HIB) 上运行 Hive/PySpark 作业Run a Hive/PySpark job on an HDInsight ESP cluster with ID Broker (HIB)

若要运行 hive 作业,可以按照常规步骤将作业提交到带有 ID 代理的 HDInsight ESP 群集 (HIB)。For run a hive job, you can follow the normal steps to submit job to HDInsight ESP cluster with ID Broker (HIB). 有关更多说明,请参阅提交交互式 Hive 查询和 Hive 批处理脚本Refer to Submit interactive Hive queries and Hive batch scripts for more instructions.

若要运行交互式 PySpark 作业,可以按照常规步骤将作业提交到带有 ID 代理的 HDInsight ESP 群集 (HIB)。For run a interactive PySpark job, you can follow the normal steps to submit job to HDInsight ESP cluster with ID Broker (HIB). 有关更多说明,请参阅提交交互式 PySpark 查询Refer to Submit interactive PySpark queries for more instructions.

若要运行 PySpark 批处理作业,可以按照常规步骤将作业提交到带有 ID 代理的 HDInsight ESP 群集 (HIB)。For run a PySpark batch job, you can follow the normal steps to submit job to HDInsight ESP cluster with ID Broker (HIB). 有关更多说明,请参阅提交 PySpark 批处理作业Refer to Submit PySpark batch job for more instructions.

Apache Livy 配置Apache Livy configuration

支持 Apache Livy 配置。Apache Livy configuration is supported. 可在工作空间文件夹中的 .VSCode\settings.json 内配置 Apache Livy。You can configure it in the .VSCode\settings.json file in the workspace folder. 目前,Livy 配置仅支持 Python 脚本。Currently, Livy configuration only supports Python script. 有关详细信息,请参阅 Livy 自述文件For more information, see Livy README.

如何触发 Livy 配置How to trigger Livy configuration

方法 1Method 1

  1. 从菜单栏中,导航到“文件” > “首选项” > “设置” 。From the menu bar, navigate to File > Preferences > Settings.
  2. 在“搜索设置”框中,输入“HDInsight Job Sumission: Livy Conf”。In the Search settings box, enter HDInsight Job Submission: Livy Conf.
  3. 选择“在 settings.json 中编辑”以获取相关搜索结果。Select Edit in settings.json for the relevant search result.

方法 2Method 2

提交一个文件,注意 .vscode 文件夹会自动添加到工作文件夹。Submit a file, and notice that the .vscode folder is automatically added to the work folder. 可以通过选择“.vscode\settings.json”来查看 Livy 配置。You can see the Livy configuration by selecting .vscode\settings.json.

  • 项目设置:The project settings:

    HDInsight Apache Livy 配置

    备注

    对于 driverMemoryexecutorMemory 设置,请设置值和单位。For the driverMemory and executorMemory settings, set the value and unit. 例如:1g 或 1024m。For example: 1g or 1024m.

  • 支持的 Livy 配置:Supported Livy configurations:

    POST /batches POST /batches
    请求正文Request body

    namename descriptiondescription typetype
    filefile 包含要执行的应用程序的文件File containing the application to execute Path(必需)Path (required)
    proxyUserproxyUser 运行作业时要模拟的用户User to impersonate when running the job StringString
    classNameclassName 应用程序 Java/Spark 主类Application Java/Spark main class StringString
    argsargs 应用程序的命令行参数Command-line arguments for the application 字符串列表List of strings
    jarsjars 此会话中要使用的 JarJars to be used in this session 字符串列表List of strings
    pyFilespyFiles 将在本次会话中使用的 Python 文件Python files to be used in this session 字符串列表List of strings
    filesfiles 此会话中要使用的文件Files to be used in this session 字符串列表List of strings
    driverMemorydriverMemory 用于驱动程序进程的内存量Amount of memory to use for the driver process StringString
    driverCoresdriverCores 用于驱动程序进程的内核数Number of cores to use for the driver process intInt
    executorMemoryexecutorMemory 每个执行程序进程使用的内存量Amount of memory to use per executor process StringString
    executorCoresexecutorCores 每个执行程序使用的内核数Number of cores to use for each executor intInt
    numExecutorsnumExecutors 为此会话启动的执行程序数Number of executors to launch for this session intInt
    archivesarchives 将在本次会话中使用的存档Archives to be used in this session 字符串列表List of strings
    队列queue 要提交到的 YARN 队列的名称Name of the YARN queue to be submitted to StringString
    namename 此会话的名称Name of this session StringString
    confconf Spark 配置属性Spark configuration properties key=val 的映射Map of key=val

    响应正文 创建的 Batch 对象。Response body The created Batch object.

    namename descriptiondescription typetype
    IDID 会话 IDSession ID intInt
    appIdappId 此会话的应用程序 IDApplication ID of this session StringString
    appInfoappInfo 详细的应用程序信息Detailed application info key=val 的映射Map of key=val
    loglog 日志行Log lines 字符串列表List of strings
    statestate 批处理状态Batch state StringString

    备注

    提交脚本时,分配的 Livy 配置将显示在输出窗格中。The assigned Livy config is displayed in the output pane when you submit the script.

通过资源管理器与 Azure HDInsight 集成Integrate with Azure HDInsight from Explorer

可以通过 Azure HDInsight 资源管理器直接在群集中预览 Hive 表:You can preview Hive Table in your clusters directly through the Azure HDInsight explorer:

  1. 连接到 Azure 帐户(如果尚未这样做)。Connect to your Azure account if you haven't yet done so.

  2. 选择最左侧列中的“Azure”图标。Select the Azure icon from leftmost column.

  3. 在左窗格中,展开“AZURE:HDINSIGHT”。From the left pane, expand AZURE: HDINSIGHT. 此时会列出可用的订阅和群集。The available subscriptions and clusters are listed.

  4. 展开群集以查看 Hive 元数据数据库和表架构。Expand the cluster to view the Hive metadata database and table schema.

  5. 右键单击 Hive 表。Right-click the Hive table. 例如:hivesampletableFor example: hivesampletable. 选择“预览”。Select Preview.

    适用于 Visual Studio code 的 Spark 和 Hive - 预览 Hive 表

  6. 此时会打开“预览结果”窗口:The Preview Results window opens:

    适用于 visual studio code 的 Spark 和 Hive - 预览结果窗口

  • “结果”面板RESULTS panel

    可以将整个结果作为 CSV、JSON、Excel 保存到本地路径,也可以只选择多个行。You can save the whole result as a CSV, JSON, or Excel file to a local path, or just select multiple lines.

  • “消息”面板MESSAGES panel

    1. 如果表中的行数大于 100,将显示消息:“显示了 Hive 表的前 100 行”。When the number of rows in the table is greater than 100, you see the following message: "The first 100 rows are displayed for Hive table."

    2. 如果表中的行数小于或等于 100,将显示以下消息:“显示了 Hive 表的 60 行”。When the number of rows in the table is less than or equal to 100, you see the following message: "60 rows are displayed for Hive table."

    3. 如果表中没有任何内容,将显示以下消息:“0 rows are displayed for Hive table.When there's no content in the table, you see the following message: "0 rows are displayed for Hive table."

      备注

      在 Linux 中,安装 xclip 用于复制表数据。In Linux, install xclip to enable copy-table data.

      Linux 中适用于 Visual Studio Code 的 Spark 和 Hive

    其他功能Additional features

适用于 Visual Studio Code 的 Spark 和 Hive 还支持以下功能:Spark & Hive for Visual Studio Code also supports the following features:

  • IntelliSense 自动完成IntelliSense autocomplete. 弹出关键字、方法、变量和其他编程元素的相关建议。Suggestions pop up for keywords, methods, variables, and other programming elements. 不同图标表示不同类型的对象:Different icons represent different types of objects:

    适用于 Visual Studio Code 的 Spark 和 Hive 工具 - IntelliSense 对象类型

  • IntelliSense 错误标记IntelliSense error marker. 语言服务会为 Hive 脚本的编辑错误添加下划线。The language service underlines editing errors in the Hive script.

  • 语法突出显示Syntax highlights. 语言服务使用不同颜色来区分变量、关键字、数据类型、函数和其他编程元素:The language service uses different colors to differentiate variables, keywords, data type, functions, and other programming elements:

    适用于 Visual Studio Code 的 Spark 和 Hive 工具 - 语法突出显示

仅限读取者角色Reader-only role

分配有群集“仅限读取者”角色的用户无法将作业提交到 HDInsight 群集,也无法查看 Hive 数据库。Users who are assigned the reader-only role for the cluster can't submit jobs to the HDInsight cluster, nor view the Hive database. 需在 Azure 门户中联系群集管理员将你的角色升级到 HDInsight 群集操作员Contact the cluster administrator to upgrade your role to HDInsight Cluster Operator in the Azure portal. 如果你有有效的 Ambari 凭据,可遵循以下指导手动链接群集。If you have valid Ambari credentials, you can manually link the cluster by using the following guidance.

浏览 HDInsight 群集Browse the HDInsight cluster

选择 Azure HDInsight 资源管理器展开 HDInsight 群集时,如果你对群集拥有“仅限读取者”角色,系统会提示你链接群集。When you select the Azure HDInsight explorer to expand an HDInsight cluster, you're prompted to link the cluster if you have the reader-only role for the cluster. 使用 Ambari 凭据通过以下方法链接到群集。Use the following method to link to the cluster by using your Ambari credentials.

将作业提交到 HDInsight 群集Submit the job to the HDInsight cluster

将作业提交到 HDInsight 群集时,如果你对群集拥有“仅限读取者”角色,系统会提示你链接群集。When submitting job to an HDInsight cluster, you're prompted to link the cluster if you're in the reader-only role for the cluster. 使用 Ambari 凭据通过以下步骤链接到群集。Use the following steps to link to the cluster by using Ambari credentials.

  1. 输入有效的 Ambari 用户名。Enter a valid Ambari username.
  2. 输入有效的密码。Enter a valid password.

适用于 Visual Studio Code 的 Spark 和 Hive 工具 - 用户名

适用于 Visual Studio Code 的 Spark 和 Hive 工具 - 密码

备注

可以使用 Spark / Hive: List Cluster 来检查链接的群集:You can use Spark / Hive: List Cluster to check the linked cluster:

适用于 Visual Studio Code 的 Spark 和 Hive 工具 - 链接读取器

Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2

浏览 Data Lake Storage Gen2 帐户Browse a Data Lake Storage Gen2 account

选择 Azure HDInsight 资源管理器,展开 Data Lake Storage Gen2 帐户。Select the Azure HDInsight explorer to expand a Data Lake Storage Gen2 account. 如果 Azure 帐户对 Gen2 存储没有访问权限,则系统会提示你输入存储访问密钥。You're prompted to enter the storage access key if your Azure account has no access to Gen2 storage. 验证访问密钥后,Data Lake Storage Gen2 帐户将自动展开。After the access key is validated, the Data Lake Storage Gen2 account is auto-expanded.

将作业提交到使用 Data Lake Storage Gen2 的 HDInsight 群集Submit jobs to an HDInsight cluster with Data Lake Storage Gen2

使用 Data Lake Storage Gen2 将作业提交到 HDInsight 群集。Submit a job to an HDInsight cluster using Data Lake Storage Gen2. 如果 Azure 帐户对 Gen2 存储没有写入权限,则系统会提示你输入存储访问密钥。You're prompted to enter the storage access key if your Azure account has no write access to Gen2 storage. 验证访问密钥后,作业将成功提交。After the access key is validated, the job will be successfully submitted.

适用于 Visual Studio Code 的 Spark 和 Hive 工具 - 访问密钥

备注

可以在 Azure 门户中获取存储帐户的访问密钥。You can get the access key for the storage account from the Azure portal. 有关详细信息,请参阅管理存储帐户访问密钥For more information, see Manage storage account access keys.

  1. 在菜单栏中,转到“视图” > “命令面板”,然后输入“Spark/Hive: Unlink a Cluster”。From the menu bar, go to View > Command Palette, and then enter Spark / Hive: Unlink a Cluster.

  2. 选择要取消链接的群集。Select a cluster to unlink.

  3. 查看“输出”视图以进行验证。See the OUTPUT view for verification.

注销Sign out

在菜单栏中,转到“视图” > “命令面板”,然后输入“Azure: 注销”。From the menu bar, go to View > Command Palette, and then enter Azure: Sign Out.

改观的问题Issues Changed

由于“此扩展不支持 2020.5.78807 版及更高版本的 ms-python”问题已解决,请立即将 ms-python 更新到最新版本 。For this issue "ms-python >=2020.5.78807 version is not supported on this extension" has been resolved, please update the ms-python to the latest version now.

后续步骤Next steps

有关演示了如何使用适用于 Visual Studio Code 的 Spark 和 Hive 的视频,请观看适用于 Visual Studio Code 的 Spark 和 HiveFor a video that demonstrates using Spark & Hive for Visual Studio Code, see Spark & Hive for Visual Studio Code.