使用适用于 Visual Studio Code 的 Spark 和 Hive 工具Use Spark & Hive Tools for Visual Studio Code

了解如何使用适用于 Visual Studio Code 的 Apache Spark 和 Hive 工具。Learn how to use Apache Spark & Hive Tools for Visual Studio Code. 使用工具为 Apache Spark 创建和提交 Apache Hive 批处理作业、交互式 Hive 查询和 PySpark 脚本。Use the tools to create and submit Apache Hive batch jobs, interactive Hive queries, and PySpark scripts for Apache Spark. 首先,本文将介绍如何在 Visual Studio Code 中安装 Spark 和 Hive 工具。First we'll describe how to install Spark & Hive Tools in Visual Studio Code. 随后介绍如何将作业提交到 Spark 和 Hive 工具。Then we'll walk through how to submit jobs to Spark & Hive Tools.

可将 Spark 和 Hive 工具安装在 Visual Studio Code 支持的平台上。Spark & Hive Tools can be installed on platforms that are supported by Visual Studio Code. 注意针对不同平台的以下先决条件。Note the following prerequisites for different platforms.

先决条件Prerequisites

完成本文中的步骤需要以下各项:The following items are required for completing the steps in this article:

安装 Spark 和 Hive 工具Install Spark & Hive Tools

满足先决条件后,可遵循以下步骤安装适用于 Visual Studio Code 的 Spark 和 Hive 工具:After you meet the prerequisites, you can install Spark & Hive Tools for Visual Studio Code by following these steps:

  1. 打开 Visual Studio Code。Open Visual Studio Code.

  2. 从菜单栏中,导航到“查看” > “扩展” 。From the menu bar, navigate to View > Extensions.

  3. 在搜索框中,输入“Spark 和 Hive”。In the search box, enter Spark & Hive.

  4. 从搜索结果中选择“Spark 和 Hive 工具”,然后选择“安装”: Select Spark & Hive Tools from the search results, and then select Install:

    适用于 Visual Studio Code 的 Spark 和 Hive Python 安装

  5. 根据需要选择“重载”。Select Reload when necessary.

打开工作文件夹Open a work folder

若要打开工作文件夹并在 Visual Studio Code 中创建文件,请执行以下步骤:To open a work folder and to create a file in Visual Studio Code, follow these steps:

  1. 在菜单栏中,导航到“文件” > “打开文件夹...” > “C:\HD\HDexample”,然后选择“选择文件夹”按钮。 From the menu bar, navigate to File > Open Folder... > C:\HD\HDexample, and then select the Select Folder button. 该文件夹显示在左侧的“资源管理器”视图中。The folder appears in the Explorer view on the left.

  2. 在“资源管理器”视图中选择“HDexample”文件夹,然后选择工作文件夹旁边的“新建文件”图标: In Explorer view, select the HDexample folder, and then select the New File icon next to the work folder:

    visual studio code“新建文件”图标

  3. 为新文件命名,使用 .hql(Hive 查询)或 .py(Spark 脚本)作为文件扩展名。Name the new file by using either the .hql (Hive queries) or the .py (Spark script) file extension. 本示例使用 HelloWorld.hqlThis example uses HelloWorld.hql.

连接到 Azure 帐户Connect to an Azure account

在你可以从 Visual Studio Code 将脚本提交到群集之前,用户可以登录 Azure 订阅,或链接 HDInsight 群集Before you can submit scripts to your clusters from Visual Studio Code, user can either sign in to Azure subscription, or link a HDInsight cluster. 使用 Ambari 用户名/密码连接到 HDInsight 群集。Use the Ambari username/password to connect to your HDInsight cluster. 遵循以下步骤连接到 Azure:Follow these steps to connect to Azure:

  1. 在菜单栏中,导航到“视图” > “命令面板...”,然后输入“Azure: Sign In”:From the menu bar, navigate to View > Command Palette..., and enter Azure: Sign In:

    适用于 Visual Studio Code 的 Spark 和 Hive 工具登录

  2. 按照登录说明登录到 Azure。Follow the sign-in instructions to sign in to Azure. 连接后,Visual Studio Code 窗口底部的状态栏上会显示 Azure 帐户名称。After you're connected, your Azure account name shows on the status bar at the bottom of the Visual Studio Code window.

可以使用 Apache Ambari 管理的用户名链接标准群集。You can link a normal cluster by using an Apache Ambari managed username.

  1. 在菜单栏中,导航到“视图” > “命令面板...”,然后输入“Spark/Hive: Link a Cluster”。From the menu bar, navigate to View > Command Palette..., and enter Spark / Hive: Link a Cluster.

    命令面板链接群集命令

  2. 选择链接的群集类型“Azure HDInsight”。Select linked cluster type Azure HDInsight.

  3. 输入 HDInsight 群集 URL。Enter the HDInsight cluster URL.

  4. 输入 Ambari 用户名,默认为 adminEnter your Ambari username; the default is admin.

  5. 输入 Ambari 密码。Enter your Ambari password.

  6. 选择群集类型。Select the cluster type.

  7. 设置群集的显示名称(可选)。Set the display name of the cluster (optional).

  8. 查看“输出”视图以进行验证。Review OUTPUT view for verification.

    备注

    如果群集已登录到 Azure 订阅中并且已链接群集,则使用链接用户名和密码。The linked username and password are used if the cluster both logged in to the Azure subscription and linked a cluster.

  1. 在菜单栏中,导航到“视图” > “命令面板...”,然后输入“Spark/Hive: Link a Cluster”。From the menu bar, navigate to View > Command Palette..., and enter Spark / Hive: Link a Cluster.

  2. 选择链接的群集类型“通用 Livy 终结点”。Select linked cluster type Generic Livy Endpoint.

  3. 输入通用 livy 终结点。Enter the generic Livy endpoint. 例如:http://10.172.41.42:18080。For example: http://10.172.41.42:18080.

  4. 选择授权类型“基本”或“无”。 Select authorization type Basic or None. 如果选择“基本”:If you select Basic:
     a. a. 输入 Ambari 用户名,默认为 adminEnter your Ambari username; the default is admin.
     b. b. 输入 Ambari 密码。Enter your Ambari password.

  5. 查看“输出”视图以进行验证。Review OUTPUT view for verification.

列出群集List clusters

  1. 在菜单栏中,导航到“视图” > “命令面板...”,然后输入“Spark/Hive: List Cluster”。From the menu bar, navigate to View > Command Palette..., and enter Spark / Hive: List Cluster.

  2. 选择所需的订阅。Select the subscription that you want.

  3. 检查“输出”视图。Review the OUTPUT view. 此视图显示你的链接群集(或多个群集),以及你的 Azure 订阅下的所有群集:This view shows your linked cluster (or clusters) and all the clusters under your Azure subscription:

    设置默认群集配置

设置默认群集Set the default cluster

  1. 重新打开前面所述的 HDexample 文件夹(如果已关闭)。Reopen the HDexample folder that was discussed earlier, if closed.

  2. 选择前面创建的 HelloWorld.hql 文件。Select the HelloWorld.hql file that was created earlier. 它将在脚本编辑器中打开。It opens in the script editor.

  3. 右键单击脚本编辑器,然后选择“Spark/ Hive:Set Default Cluster”。Right-click the script editor, and then select Spark / Hive: Set Default Cluster.

  4. 连接到 Azure 帐户或链接某个群集(如果尚未这样做)。Connect to your Azure account, or link a cluster if you haven't yet done so.

  5. 选择一个群集作为当前脚本文件的默认群集。Select a cluster as the default cluster for the current script file. 工具将自动更新 .VSCode\settings.json 配置文件:The tools automatically update the .VSCode\settings.json configuration file:

    设置默认群集配置

提交交互式 Hive 查询和 Hive 批处理脚本Submit interactive Hive queries and Hive batch scripts

通过 Visual Studio Code 的 Spark 和 Hive 工具,可将交互式 Hive 查询和 Hive 批处理脚本提交到群集。With Spark & Hive Tools for Visual Studio Code, you can submit interactive Hive queries and Hive batch scripts to your clusters.

  1. 重新打开前面所述的 HDexample 文件夹(如果已关闭)。Reopen the HDexample folder that was discussed earlier, if closed.

  2. 选择前面创建的 HelloWorld.hql 文件。Select the HelloWorld.hql file that was created earlier. 它将在脚本编辑器中打开。It opens in the script editor.

  3. 将以下代码复制并粘贴到 Hive 文件中,然后保存该文件:Copy and paste the following code into your Hive file, and then save it:

    SELECT * FROM hivesampletable;
    
  4. 连接到 Azure 帐户或链接某个群集(如果尚未这样做)。Connect to your Azure account, or link a cluster if you haven't yet done so.

  5. 右键单击脚本编辑器,然后选择“Hive:Interactive”以提交查询,或使用 Ctrl+Alt+I 快捷键。Right-click the script editor and select Hive: Interactive to submit the query, or use the Ctrl+Alt+I keyboard shortcut. 选择“Hive:批处理”以提交脚本,或使用 Ctrl+Alt+H 快捷键。Select Hive: Batch to submit the script, or use the Ctrl+Alt+H keyboard shortcut.

  6. 如果尚未指定默认群集,请选择群集。If you haven't specified a default cluster, select a cluster. 工具还允许使用上下文菜单提交代码块而非整个脚本文件。The tools also let you submit a block of code instead of the whole script file by using the context menu. 不久之后,查询结果将显示在新选项卡中:After a few moments, the query results appear in a new tab:

    交互式 Apache Hive 查询结果

    • “结果”面板:可以将整个结果作为 CSV、JSON、Excel 保存到本地路径,也可以只选择多个行。RESULTS panel: You can save the whole result as a CSV, JSON, or Excel file to a local path or just select multiple lines.

    • “消息”面板:选择号会跳转到运行的脚本的第一行。MESSAGES panel: When you select a Line number, it jumps to the first line of the running script.

提交交互式 PySpark 查询Submit interactive PySpark queries

用户可通过以下方式执行 PySpark Interactive 命令:Users can perform PySpark interactive in the following ways:

在 PY 文件中使用 PySpark Interactive 命令Using the PySpark interactive command in PY file

使用 PySpark Interactive 命令提交查询时,请执行以下步骤:Using the PySpark interactive command to submit the queries, follow these steps:

  1. 重新打开前面所述的 HDexample 文件夹(如果已关闭)。Reopen the HDexample folder that was discussed earlier, if closed.

  2. 遵循前面所述的步骤创建新的 HelloWorld.py 文件。Create a new HelloWorld.py file, following the earlier steps.

  3. 将以下代码复制并粘贴到脚本文件中:Copy and paste the following code into the script file:

    from operator import add
    lines = spark.read.text("/HdiSamples/HdiSamples/FoodInspectionData/README").rdd.map(lambda r: r[0])
    counters = lines.flatMap(lambda x: x.split(' ')) \
                 .map(lambda x: (x, 1)) \
                 .reduceByKey(add)
    
    coll = counters.collect()
    sortedCollection = sorted(coll, key = lambda r: r[1], reverse = True)
    
    for i in range(0, 5):
         print(sortedCollection[i])
    
  4. 有关安装 PySpark 内核的提示显示在窗口右下角。The prompt to install PySpark kernel is displayed in the lower right corner of the window. 可以单击“安装”按钮继续进行 PySpark 安装,也可以单击“跳过”按钮跳过此步骤。You can click on Install button to proceed for the PySpark installations; or click on Skip button to skip this step.

    安装 PySpark 内核

  5. 如果以后需要安装它,可以导航到“文件” > “首选项” > “设置”,然后在设置中取消选中“Hdinsight:允许跳过 Pyspark 安装”。If you need to install it later, you can navigate to File > Preference > Settings, then uncheck Hdinsight: Enable Skip Pyspark Installation in the settings.

    安装 PySpark 内核

  6. 如果在步骤 4 中安装成功,则会在窗口右下角显示“已成功安装 PySpark”消息框。If the installation is successful in step 4, the "PySpark installed successfully" message box is displayed in the lower right corner of the window. 单击“重载”按钮可重载此窗口。Click on Reload button to reload the window. 已成功安装 PySparkpyspark installed successfully

  7. 连接到 Azure 帐户或链接某个群集(如果尚未这样做)。Connect to your Azure account, or link a cluster if you haven't yet done so.

  8. 选择所有代码,右键单击脚本编辑器并选择“Spark:PySpark Interactive”以提交查询。Select all the code, right-click the script editor, and select Spark: PySpark Interactive to submit the query. 或者,使用 Ctrl+Alt+I 快捷键。Or, use the Ctrl+Alt+I shortcut.

    Pyspark Interactive 上下文菜单

  9. 如果尚未指定默认群集,请选择群集。Select the cluster, if you haven't specified a default cluster. 不久之后,新选项卡中会显示“Python Interactive 结果”。单击 PySpark 可将内核切换到 PySpark,代码将成功运行。After a few moments, the Python Interactive results appear in a new tab. Click on PySpark to switch the kernel to PySpark, and the code will run successfully. 工具还允许使用上下文菜单提交代码块而非整个脚本文件:The tools also let you submit a block of code instead of the whole script file by using the context menu:

    pyspark interactive - python interactive 窗口

  10. 输入 %%info,然后按 Shift+Enter 查看作业信息(可选):Enter %%info, and then press Shift+Enter to view the job information (optional):

    查看作业信息

  11. 工具还支持 Spark SQL 查询:The tool also supports the Spark SQL query:

    Pyspark Interactive - 查看结果

使用 #%% 注释在 PY 文件中执行交互式查询Perform interactive query in PY file using a #%% comment

  1. #%% 添加到 PY 代码之前以获取笔记本体验。Add #%% before the Py code to get notebook experience.

    添加 #%%

  2. 单击“运行单元格”。Click on Run Cell. 不久之后,Python Interactive 结果会显示在一个新选项卡中。After a few moments, the Python Interactive results appear in a new tab.

    “运行单元格”命令的结果

    备注

    当内核或设置出现混乱情况时,请使用“Python:选择解释器来启动 Jupyter 服务器”命令和“重启 IPython 内核”,然后重载 VSCode,即可解决问题。When the kernel or settings mess up, use the Python: Select Interpreter to start Jupyter server command and Restart IPython kernel, then reload the VSCode, it can be solved.

利用 Python 扩展中的 IPYNB 支持Leverage IPYNB support from Python extension

  1. 若要创建 Jupyter Notebook,可以使用命令面板中的命令,也可以在工作区中创建新的 .ipynb 文件。You can create a Jupyter Notebook by command from the Command Palette or by creating a new .ipynb file in your workspace. 有关详细信息,请参阅在 Visual Studio Code 中使用 Jupyter NotebookFor more information, see Working with Jupyter Notebooks in Visual Studio Code

  2. 单击“PySpark”将内核切换到 PySpark,然后单击“运行单元格”,一段时间后,就会显示结果。 Click on PySpark to switch kernel to PySpark, and then click on Run Cell, after a while, the result will be displayed.

    运行 ipynb 的结果

备注

此扩展不支持 2020.5.78807 以上(含)的 Ms-python 版本,这是一个已知问题Ms-python >=2020.5.78807 version is not supported on this extention is a known issue.

提交 PySpark 批处理作业Submit PySpark batch job

  1. 重新打开前面所述的 HDexample 文件夹(如果已关闭)。Reopen the HDexample folder that you discussed earlier, if closed.

  2. 遵循前面所述的步骤创建新的 BatchFile.py 文件。Create a new BatchFile.py file by following the earlier steps.

  3. 将以下代码复制并粘贴到脚本文件中:Copy and paste the following code into the script file:

    from __future__ import print_function
    import sys
    from operator import add
    from pyspark.sql import SparkSession
    if __name__ == "__main__":
        spark = SparkSession\
            .builder\
            .appName("PythonWordCount")\
            .getOrCreate()
    
        lines = spark.read.text('/HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv').rdd.map(lambda r: r[0])
        counts = lines.flatMap(lambda x: x.split(' '))\
                    .map(lambda x: (x, 1))\
                    .reduceByKey(add)
        output = counts.collect()
        for (word, count) in output:
            print("%s: %i" % (word, count))
        spark.stop()
    
  4. 连接到 Azure 帐户或链接某个群集(如果尚未这样做)。Connect to your Azure account, or link a cluster if you haven't yet done so.

  5. 右键单击脚本编辑器,然后选择“Spark: PySpark Batch”,或使用 Ctrl+Alt+H 快捷键。Right-click the script editor, and then select Spark: PySpark Batch, or use the Ctrl+Alt+H keyboard shortcut.

  6. 选择要将 PySpark 作业提交到的群集:Select a cluster to submit your PySpark job to:

    提交 python 作业结果

提交 Python 作业后,提交日志将显示在 Visual Studio Code 的“输出”窗口中。After you submit a Python job, submission logs appear in the OUTPUT window in Visual Studio Code. 同时还会显示 Spark UI URL 和 Yarn UI URL。The Spark UI URL and Yarn UI URL are also shown. 你可以在 Web 浏览器中打开 URL 以跟踪作业状态。You can open the URL in a web browser to track the job status.

Apache Livy 配置Apache Livy configuration

支持 Apache Livy 配置。Apache Livy configuration is supported. 可在工作空间文件夹中的 .VSCode\settings.json 内配置 Apache Livy。You can configure it in the .VSCode\settings.json file in the workspace folder. 目前,Livy 配置仅支持 Python 脚本。Currently, Livy configuration only supports Python script. 有关详细信息,请参阅 Livy 自述文件For more information, see Livy README.

如何触发 Livy 配置How to trigger Livy configuration

方法 1Method 1

  1. 从菜单栏中,导航到“文件” > “首选项” > “设置” 。From the menu bar, navigate to File > Preferences > Settings.
  2. 在“搜索设置”框中,输入“HDInsight Job Sumission: Livy Conf”。In the Search settings box, enter HDInsight Job Submission: Livy Conf.
  3. 选择“在 settings.json 中编辑”以获取相关搜索结果。Select Edit in settings.json for the relevant search result.

方法 2Method 2
提交一个文件,然后观察 .vscode 文件夹是否已自动添加到工作文件夹。Submit a file, and notice that the .vscode folder is automatically added to the work folder. 可以通过选择“.vscode\settings.json”来查看 Livy 配置。You can see the Livy configuration by selecting .vscode\settings.json.

  • 项目设置:The project settings:

    Livy 配置

    备注

    对于 driverMemoryexecutorMemory 设置,请设置值和单位。For the driverMemory and executorMemory settings, set the value and unit. 例如:1g 或 1024m。For example: 1g or 1024m.

  • 支持的 Livy 配置:Supported Livy configurations:

    POST /batches POST /batches
    请求正文Request body

    namename descriptiondescription typetype
    filefile 包含要执行的应用程序的文件File containing the application to execute Path(必需)Path (required)
    proxyUserproxyUser 运行作业时要模拟的用户User to impersonate when running the job StringString
    classNameclassName 应用程序 Java/Spark 主类Application Java/Spark main class StringString
    argsargs 应用程序的命令行参数Command-line arguments for the application 字符串列表List of strings
    jarsjars 此会话中要使用的 JarJars to be used in this session 字符串列表List of strings
    pyFilespyFiles 将在本次会话中使用的 Python 文件Python files to be used in this session 字符串列表List of strings
    filesfiles 此会话中要使用的文件Files to be used in this session 字符串列表List of strings
    driverMemorydriverMemory 用于驱动程序进程的内存量Amount of memory to use for the driver process StringString
    driverCoresdriverCores 用于驱动程序进程的内核数Number of cores to use for the driver process intInt
    executorMemoryexecutorMemory 每个执行程序进程使用的内存量Amount of memory to use per executor process StringString
    executorCoresexecutorCores 每个执行程序使用的内核数Number of cores to use for each executor intInt
    numExecutorsnumExecutors 为此会话启动的执行程序数Number of executors to launch for this session intInt
    archivesarchives 将在本次会话中使用的存档Archives to be used in this session 字符串列表List of strings
    队列queue 要提交到的 YARN 队列的名称Name of the YARN queue to be submitted to StringString
    namename 此会话的名称Name of this session StringString
    confconf Spark 配置属性Spark configuration properties key=val 的映射Map of key=val

    响应正文Response body
    创建的批处理对象。The created Batch object.

    namename descriptiondescription typetype
    IDID 会话 IDSession ID intInt
    appIdappId 此会话的应用程序 IDApplication ID of this session StringString
    appInfoappInfo 详细的应用程序信息Detailed application info key=val 的映射Map of key=val
    loglog 日志行Log lines 字符串列表List of strings
    statestate 批处理状态Batch state StringString

    备注

    提交脚本时,分配的 Livy 配置将显示在输出窗格中。The assigned Livy config is displayed in the output pane when you submit the script.

通过资源管理器与 Azure HDInsight 集成Integrate with Azure HDInsight from Explorer

可以通过 Azure HDInsight 资源管理器直接在群集中预览 Hive 表:You can preview Hive Table in your clusters directly through the Azure HDInsight explorer:

  1. 连接到 Azure 帐户(如果尚未这样做)。Connect to your Azure account if you haven't yet done so.

  2. 选择最左侧列中的“Azure”图标。Select the Azure icon from leftmost column.

  3. 在左窗格中,展开“AZURE:HDINSIGHT”。From the left pane, expand AZURE: HDINSIGHT. 此时会列出可用的订阅和群集。The available subscriptions and clusters are listed.

  4. 展开群集以查看 Hive 元数据数据库和表架构。Expand the cluster to view the Hive metadata database and table schema.

  5. 右键单击 Hive 表。Right-click the Hive table. 例如:hivesampletableFor example: hivesampletable. 选择“预览”。Select Preview.

    适用于 Visual Studio code 的 Spark 和 Hive - 预览 Hive 表

  6. 此时会打开“预览结果”窗口:The Preview Results window opens:

    适用于 visual studio code 的 Spark 和 Hive - 预览结果窗口

  • “结果”面板RESULTS panel

    可以将整个结果作为 CSV、JSON、Excel 保存到本地路径,也可以只选择多个行。You can save the whole result as a CSV, JSON, or Excel file to a local path, or just select multiple lines.

  • “消息”面板MESSAGES panel

    1. 如果表中的行数大于 100,将显示消息:“显示了 Hive 表的前 100 行”。When the number of rows in the table is greater than 100, you see the following message: "The first 100 rows are displayed for Hive table."

    2. 如果表中的行数小于或等于 100,将显示以下消息:“显示了 Hive 表的 60 行”。When the number of rows in the table is less than or equal to 100, you see the following message: "60 rows are displayed for Hive table."

    3. 如果表中没有任何内容,将显示以下消息:“0 rows are displayed for Hive table.When there's no content in the table, you see the following message: "0 rows are displayed for Hive table."

      备注

      在 Linux 中,安装 xclip 用于复制表数据。In Linux, install xclip to enable copy-table data.

      Linux 中适用于 Visual Studio Code 的 Spark 和 Hive

    其他功能Additional features

适用于 Visual Studio Code 的 Spark 和 Hive 还支持以下功能:Spark & Hive for Visual Studio Code also supports the following features:

  • IntelliSense 自动完成IntelliSense autocomplete. 弹出关键字、方法、变量和其他编程元素的相关建议。Suggestions pop up for keywords, methods, variables, and other programming elements. 不同图标表示不同类型的对象:Different icons represent different types of objects:

    适用于 Visual Studio Code 的 Spark 和 Hive 工具 - IntelliSense 对象类型

  • IntelliSense 错误标记IntelliSense error marker. 语言服务会为 Hive 脚本的编辑错误添加下划线。The language service underlines editing errors in the Hive script.

  • 语法突出显示Syntax highlights. 语言服务使用不同颜色来区分变量、关键字、数据类型、函数和其他编程元素:The language service uses different colors to differentiate variables, keywords, data type, functions, and other programming elements:

    适用于 Visual Studio Code 的 Spark 和 Hive 工具 - 语法突出显示

仅限读取者角色Reader-only role

分配有群集“仅限读取者”角色的用户无法将作业提交到 HDInsight 群集,也无法查看 Hive 数据库。Users who are assigned the reader-only role for the cluster can't submit jobs to the HDInsight cluster, nor view the Hive database. 需在 Azure 门户中联系群集管理员将你的角色升级到 HDInsight 群集操作员Contact the cluster administrator to upgrade your role to HDInsight Cluster Operator in the Azure portal. 如果你有有效的 Ambari 凭据,可遵循以下指导手动链接群集。If you have valid Ambari credentials, you can manually link the cluster by using the following guidance.

浏览 HDInsight 群集Browse the HDInsight cluster

选择 Azure HDInsight 资源管理器展开 HDInsight 群集时,如果你对群集拥有“仅限读取者”角色,系统会提示你链接群集。When you select the Azure HDInsight explorer to expand an HDInsight cluster, you're prompted to link the cluster if you have the reader-only role for the cluster. 使用 Ambari 凭据通过以下方法链接到群集。Use the following method to link to the cluster by using your Ambari credentials.

将作业提交到 HDInsight 群集Submit the job to the HDInsight cluster

将作业提交到 HDInsight 群集时,如果你对群集拥有“仅限读取者”角色,系统会提示你链接群集。When submitting job to an HDInsight cluster, you're prompted to link the cluster if you're in the reader-only role for the cluster. 使用 Ambari 凭据通过以下步骤链接到群集。Use the following steps to link to the cluster by using Ambari credentials.

  1. 输入有效的 Ambari 用户名。Enter a valid Ambari username.
  2. 输入有效的密码。Enter a valid password.

适用于 Visual Studio Code 的 Spark 和 Hive 工具 - 用户名

适用于 Visual Studio Code 的 Spark 和 Hive 工具 - 密码

备注

可以使用 Spark / Hive: List Cluster 来检查链接的群集:You can use Spark / Hive: List Cluster to check the linked cluster:

适用于 Visual Studio Code 的 Spark 和 Hive 工具 - 链接读取器

Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2

浏览 Data Lake Storage Gen2 帐户Browse a Data Lake Storage Gen2 account

选择 Azure HDInsight 资源管理器,展开 Data Lake Storage Gen2 帐户。Select the Azure HDInsight explorer to expand a Data Lake Storage Gen2 account. 如果 Azure 帐户对 Gen2 存储没有访问权限,则系统会提示你输入存储访问密钥。You're prompted to enter the storage access key if your Azure account has no access to Gen2 storage. 验证访问密钥后,Data Lake Storage Gen2 帐户将自动展开。After the access key is validated, the Data Lake Storage Gen2 account is auto-expanded.

将作业提交到使用 Data Lake Storage Gen2 的 HDInsight 群集Submit jobs to an HDInsight cluster with Data Lake Storage Gen2

使用 Data Lake Storage Gen2 将作业提交到 HDInsight 群集。Submit a job to an HDInsight cluster using Data Lake Storage Gen2. 如果 Azure 帐户对 Gen2 存储没有写入权限,则系统会提示你输入存储访问密钥。You're prompted to enter the storage access key if your Azure account has no write access to Gen2 storage. 验证访问密钥后,作业将成功提交。After the access key is validated, the job will be successfully submitted.

适用于 Visual Studio Code 的 Spark 和 Hive 工具 - 访问密钥

备注

可以在 Azure 门户中获取存储帐户的访问密钥。You can get the access key for the storage account from the Azure portal. 有关详细信息,请参阅管理存储帐户访问密钥For more information, see Manage storage account access keys.

  1. 在菜单栏中,转到“视图” > “命令面板”,然后输入“Spark/Hive: Unlink a Cluster”。From the menu bar, go to View > Command Palette, and then enter Spark / Hive: Unlink a Cluster.

  2. 选择要取消链接的群集。Select a cluster to unlink.

  3. 查看“输出”视图以进行验证。See the OUTPUT view for verification.

注销Sign out

在菜单栏中,转到“视图” > “命令面板”,然后输入“Azure: 注销”。From the menu bar, go to View > Command Palette, and then enter Azure: Sign Out.

已知问题Known Issues

此扩展不支持 2020.5.78807 以上(含)的 ms-python 版本ms-python >=2020.5.78807 version is not supported on this extention

“无法连接到 Jupyter 笔记本。”"Failed to connect to Jupyter notebook." 是 2020.5.78807 以上(含)的 python 版本的一个已知问题。is a known issue for python version >=2020.5.78807. 建议用户使用 2020.4.76186 版 ms-python 以避免此问题。It is recommended that users use the 2020.4.76186 version of ms-python to avoid this issue.

已知问题