将针对 Visual Studio 的 Azure Data Lake 工具与 Hortonworks 沙盒配合使用Use the Azure Data Lake tools for Visual Studio with the Hortonworks Sandbox

Azure Data Lake 包含用于处理常规 Apache Hadoop 群集的工具。Azure Data Lake includes tools for working with generic Apache Hadoop clusters. 本文档提供将 Data Lake 工具与本地虚拟机上运行的 Hortonworks 沙盒配合使用所要执行的步骤。This document provides the steps needed to use the Data Lake tools with the Hortonworks Sandbox running in a local virtual machine.

借助 Hortonworks 沙盒可以在开发环境本地使用 Hadoop。Using the Hortonworks Sandbox allows you to work with Hadoop locally on your development environment. 开发一个解决方案后,若要大规模部署该解决方案,可以转移到 HDInsight 群集。After you have developed a solution and want to deploy it at scale, you can then move to an HDInsight cluster.

先决条件Prerequisites

配置沙盒的密码Configure passwords for the sandbox

确保 Hortonworks 沙盒正在运行。Make sure that the Hortonworks Sandbox is running. 然后按照 Hortonworks 沙盒入门文档中的步骤进行操作。Then follow the steps in the Get started in the Hortonworks Sandbox document. 这些步骤配置 SSH root 帐户和 Apache Ambari admin 帐户的密码。These steps configure the password for the SSH root account, and the Apache Ambari admin account. 从 Visual Studio 连接到沙盒时,将使用这些密码。These passwords are used when you connect to the sandbox from Visual Studio.

将工具连接到沙盒Connect the tools to the sandbox

  1. 打开 Visual Studio,选择“视图”,然后选择“服务器资源管理器”。 Open Visual Studio, select View, and then select Server Explorer.

  2. 在“服务器资源管理器”中,右键单击“HDInsight”项,然后选择“连接到 HDInsight Emulator”。 From Server Explorer, right-click the HDInsight entry, and then select Connect to HDInsight Emulator.

    服务器资源管理器的屏幕截图,其中已突出显示“连接到 HDInsight Emulator”

  3. 在“连接到 HDInsight Emulator”对话框中,输入为 Ambari 配置的密码。 From the Connect to HDInsight Emulator dialog box, enter the password that you configured for Ambari.

    对话框屏幕截图,其中突出显示了 ambari 密码文本框

    选择“下一步”继续。 Select Next to continue.

  4. 使用“密码” 字段输入为 root 帐户配置的密码。Use the Password field to enter the password you configured for the root account. 将其他字段保留默认值。Leave the other fields at the default value.

    对话框屏幕截图,其中突出显示了 root 密码文本框

    选择“下一步”继续。 Select Next to continue.

  5. 等待服务验证完成。Wait for validation of the services to finish. 在某些情况下,验证可能失败,并提示更新配置。In some cases, validation may fail and prompt you to update the configuration. 如果验证失败,请选择“更新” ,然后等待服务的配置和验证完成。If validation fails, select Update, and wait for the configuration and verification for the service to finish.

    对话框屏幕截图,其中突出显示了“更新”按钮

    备注

    更新过程使用 Ambari 将 Hortonworks 沙盒配置修改为用于 Visual Studio 的 Data Lake 工具所需的配置。The update process uses Ambari to modify the Hortonworks Sandbox configuration to what is expected by the Data Lake tools for Visual Studio.

  6. 验证完成后,请选择“完成” 以完成配置。After validation has finished, select Finish to complete configuration. 对话框屏幕截图,其中突出显示了“完成”按钮Screenshot of dialog box, with Finish button highlighted

    备注

    根据开发环境的速度以及分配给虚拟机的内存量,可能需要几分钟时间才能完成服务的配置和验证。Depending on the speed of your development environment, and the amount of memory allocated to the virtual machine, it can take several minutes to configure and validate the services.

完成这些步骤后,服务器资源管理器中“HDInsight”部分下面会出现“HDInsight 本地群集”项。 After following these steps, you now have an HDInsight local cluster entry in Server Explorer, under the HDInsight section.

编写 Apache Hive 查询Write an Apache Hive query

Hive 提供类似于 SQL 的查询语言 (HiveQL) 来处理结构化数据。Hive provides a SQL-like query language (HiveQL) for working with structured data. 按照以下步骤了解如何针对本地群集运行按需查询。Use the following steps to learn how to run on-demand queries against the local cluster.

  1. 在“服务器资源管理器” 中,右键单击前面添加的本地群集所对应的项,然后选择“编写 Hive 查询” 。In Server Explorer, right-click the entry for the local cluster that you added previously, and then select Write a Hive Query.

    服务器资源管理器的屏幕截图,其中突出显示了“编写 Hive 查询”

    此时会显示一个新的查询窗口。A new query window appears. 在其中可以快速编写查询并将其提交到本地群集。Here you can quickly write and submit a query to the local cluster.

  2. 在新查询窗口中输入以下命令:In the new query window, enter the following command:

     select count(*) from sample_08;
    

    要运行查询,请选择窗口顶部的“提交” 。To run the query, select Submit at the top of the window. 将其他值(“Batch” 和服务器名称)保留为默认值。Leave the other values (Batch and server name) at the default values.

    查询窗口的屏幕截图,其中突出显示了“提交”按钮

    还可以使用“提交” 旁边的下拉菜单选择“高级” 。You can also use the drop-down menu next to Submit to select Advanced. 使用高级选项,可以在提交作业时提供其他选项。Advanced options allow you to provide additional options when you submit the job.

    “提交脚本”对话框 hive 的屏幕截图

  3. 提交查询后,会显示作业状态。After you submit the query, the job status appears. 作业状态显示 Hadoop 处理作业时有关作业的信息。The job status displays information about the job as it is processed by Hadoop. “作业状态” 提供作业的状态。Job State provides the status of the job. 状态会定期更新,也可以使用刷新图标手动刷新状态。The state is updated periodically, or you can use the refresh icon to refresh the state manually.

    “作业视图”对话框的屏幕截图,其中突出显示了“作业状态”

    “作业状态” 更改为“已完成” 后,将显示有向无环图 (DAG)。After the Job State changes to Finished, a Directed Acyclic Graph (DAG) is displayed. 下图描述了处理 Hive 查询时 Tez 确定的执行路径。This diagram describes the execution path that was determined by Tez when processing the Hive query. Tez 是本地群集上的 Hive 使用的默认执行引擎。Tez is the default execution engine for Hive on the local cluster.

    备注

    使用基于 Linux 的 HDInsight 群集时,Apache Tez 也是默认引擎。Apache Tez is also the default when you are using Linux-based HDInsight clusters. 在基于 Windows 的 HDInsight 上,它不是默认引擎。It is not the default on Windows-based HDInsight. 若要在这种群集上使用 Tez,必须在 Hive 查询的开头添加代码行 set hive.execution.engine = tez;To use it there, you must add the line set hive.execution.engine = tez; to the beginning of your Hive query.

    使用“作业输出” 链接查看输出。Use the Job Output link to view the output. 在本例中,输出为 823,即 sample_08 表中的行数。In this case, it is 823, the number of rows in the sample_08 table. 可以使用“作业日志” 和“下载 YARN 日志” 链接查看有关作业的诊断信息。You can view diagnostics information about the job by using the Job Log and Download YARN Log links.

  4. 还可以交互方式运行 Hive 作业,方法是将“Batch” 字段更改为“交互” 。You can also run Hive jobs interactively by changing the Batch field to Interactive. 然后选择“执行”。 Then select Execute.

    突出显示“交互式”和“执行”按钮的屏幕截图

    交互式查询会将处理期间生成的输出日志流式传输到“HiveServer2 输出” 窗口。An interactive query streams the output log generated during processing to the HiveServer2 Output window.

    备注

    此信息与完成作业后使用“作业日志” 链接所看到的信息相同。The information is the same that is available from the Job Log link after a job has finished.

    输出日志的屏幕截图

创建 Hive 项目Create a Hive project

还创建包含多个 Hive 脚本的项目。You can also create a project that contains multiple Hive scripts. 当具有相关脚本或希望将脚本存储在版本控制系统中时,请使用该项目。Use a project when you have related scripts or want to store scripts in a version control system.

  1. 在 Visual Studio 中,依次选择“文件” 、“新建” 、“项目” 。In Visual Studio, select File, New, and then Project.

  2. 在项目列表中,依次展开“模板” 、“Azure Data Lake” ,然后选择“HIVE (HDInsight)” 。From the list of projects, expand Templates, expand Azure Data Lake, and then select HIVE (HDInsight). 在模板列表中,选择“Hive 示例” 。From the list of templates, select Hive Sample. 输入名称和位置,然后选择“确定” 。Enter a name and location, and then select OK.

    “新建项目”窗口的屏幕截图,其中已突出显示“Azure Data Lake”、“HIVE”、“Hive 示例”和“确定”

Hive 示例项目包含两个脚本:WebLogAnalysis.hqlSensorDataAnalysis.hqlThe Hive Sample project contains two scripts, WebLogAnalysis.hql and SensorDataAnalysis.hql. 可以使用窗口顶部的同一个“提交”按钮提交这些脚本 。You can submit these scripts by using the same Submit button at the top of the window.

创建 Apache Pig 项目Create an Apache Pig project

Hive 提供了类似 SQL 的语言来处理结构化数据,而 Pig 的工作方式则是对数据执行转换。While Hive provides a SQL-like language for working with structured data, Pig works by performing transformations on data. Pig 提供了一种语言 (Pig Latin),可用于开发转换管道。Pig provides a language (Pig Latin) that allows you to develop a pipeline of transformations. 若要在本地群集上使用 Pig,请执行以下步骤:To use Pig with the local cluster, follow these steps:

  1. 打开 Visual Studio,依次选择“文件” 、“新建” 、“项目” 。Open Visual Studio, and select File, New, and then Project. 在项目列表中,依次展开“模板” 、“Azure Data Lake” ,然后选择“Pig (HDInsight)” 。From the list of projects, expand Templates, expand Azure Data Lake, and then select Pig (HDInsight). 在模板列表中,选择“Pig 应用程序” 。From the list of templates, select Pig Application. 输入名称和位置,然后选择“确定” 。Enter a name, location, and then select OK.

    “新建项目”窗口的屏幕截图,其中已突出显示“Azure Data Lake”、“Pig”、“Pig 应用程序”和“确定”

  2. 输入以下文本作为使用此项目创建的 script.pig 文件内容。Enter the following text as the contents of the script.pig file that was created with this project.

     a = LOAD '/demo/data/Website/Website-Logs' AS (
         log_id:int,
         ip_address:chararray,
         date:chararray,
         time:chararray,
         landing_page:chararray,
         source:chararray);
     b = FILTER a BY (log_id > 100);
     c = GROUP b BY ip_address;
     DUMP c;
    

    尽管 Pig 使用的语言与 Hive 不同,但通过“提交” 按钮运行作业的方式在这两种语言之间是一致的。While Pig uses a different language than Hive, how you run the jobs is consistent between both languages, through the Submit button. 选择“提交” 旁边的下拉列表会显示 Pig 的高级提交对话框。Selecting the drop-down beside Submit displays an advanced submit dialog box for Pig.

    “提交脚本”对话框 pig 的屏幕截图

  3. 显示的作业状态和输出也与 Hive 查询相同。The job status and output is also displayed, the same as a Hive query.

    已完成的 Pig 作业的屏幕截图

查看作业View jobs

使用 Data Lake 工具还可以轻松查看有关 Hadoop 上运行的作业的信息。Data Lake tools also allow you to easily view information about jobs that have been run on Hadoop. 使用以下步骤可以查看已在本地群集上运行的作业。Use the following steps to see the jobs that have been run on the local cluster.

  1. 在“服务器资源管理器” 中,右键单击本地群集,然后选择“查看作业” 。From Server Explorer, right-click the local cluster, and then select View Jobs. 此时会显示已提交到群集的作业列表。A list of jobs that have been submitted to the cluster is displayed.

    服务器资源管理器的屏幕截图,其中突出显示了“查看作业”

  2. 在作业列表中,选择一个作业查看其详细信息。From the list of jobs, select one to view the job details.

    作业浏览器的屏幕截图,其中突出显示了一个作业

    显示的信息类似于运行 Hive 或 Pig 查询后看到的信息,包括用于查看输出和日志信息的链接。The information displayed is similar to what you see after running a Hive or Pig query, including links to view the output and log information.

  3. 还可以在此处修改和重新提交作业。You can also modify and resubmit the job from here.

查看 Hive 数据库View Hive databases

  1. 在“服务器资源管理器” 中,展开“HDInsight 本地群集” 项,然后展开“Hive 数据库” 。In Server Explorer, expand the HDInsight local cluster entry, and then expand Hive Databases. 此时将显示本地群集上的“默认” 和“xademo” 数据库。The Default and xademo databases on the local cluster are displayed. 展开一个数据库可显示该数据库中的表。Expanding a database shows the tables within the database.

    服务器资源管理器的屏幕截图,其中已展开数据库

  2. 展开一个表可显示该表的列。Expanding a table displays the columns for that table. 若要快速查看数据,请右键单击某个表并选择“查看前 100 行” 。To quickly view the data, right-click a table, and select View Top 100 Rows.

    服务器资源管理器的屏幕截图,其中已扩展表并选择了“查看前 100 行”

数据库和表属性Database and table properties

可以查看数据库或表的属性。You can view the properties of a database or table. 会在属性窗口中显示选定项的详细信息。Selecting Properties displays details for the selected item in the properties window. 有关示例,请参阅以下屏幕截图中显示的信息:For example, see the information shown in the following screenshot:

“属性”窗口的屏幕截图

创建表Create a table

若要创建表,请右键单击某个数据库,然后选择“创建表” 。To create a table, right-click a database, and then select Create Table.

服务器资源管理器的屏幕截图,其中突出显示了“创建表”

然后,可以使用表单创建表。You can then create the table using a form. 在以下屏幕截图的底部,可以看到用于创建表的原始 HiveQL。At the bottom of the following screenshot, you can see the raw HiveQL that is used to create the table.

用于创建表的窗体的屏幕截图

后续步骤Next steps