快速入门:使用 Azure 门户在 Azure HDInsight 中创建 Apache Hadoop 群集Quickstart: Create Apache Hadoop cluster in Azure HDInsight using Azure portal

本文介绍如何使用 Azure 门户在 HDInsight 中创建 Apache Hadoop 群集,以及如何在 HDInsight 中运行 Apache Hive 作业。In this article, you learn how to create Apache Hadoop clusters in HDInsight using Azure portal, and then run Apache Hive jobs in HDInsight. 大部分 Hadoop 作业都是批处理作业。Most of Hadoop jobs are batch jobs. 用户可以创建群集、运行某些作业,并删除该群集。You create a cluster, run some jobs, and then delete the cluster. 在本文中,将会执行所有这三项任务。In this article, you perform all the three tasks. 有关可用配置的详细说明,请参阅在 HDInsight 中设置群集For in-depth explanations of available configurations, see Set up clusters in HDInsight. 有关使用门户创建群集的详细信息,请参阅在门户中创建群集For more information regarding the use of the portal to create clusters, see Create clusters in the portal.

在此快速入门中,使用 Azure 门户创建 HDInsight Hadoop 群集。In this quickstart, you use the Azure portal to create an HDInsight Hadoop cluster. 还可以使用 Azure 资源管理器模板创建群集。You can also create a cluster using the Azure Resource Manager template.

目前,HDInsight 附带六个不同的群集类型Currently HDInsight comes with six different cluster types. 每个群集类型都支持一组不同的组件。Each cluster type supports a different set of components. 所有群集类型都支持 Hive。All cluster types support Hive. 有关 HDInsight 中受支持组件的列表,请参阅 HDInsight 提供的 Apache Hadoop 群集版本中有哪些新增功能?For a list of supported components in HDInsight, see What's new in the Apache Hadoop cluster versions provided by HDInsight?

如果没有 Azure 订阅,请在开始前创建一个试用帐户If you don't have an Azure subscription, create a trial account before you begin.

创建 Apache Hadoop 群集Create an Apache Hadoop cluster

在本部分中,使用 Azure 门户在 HDInsight 中创建 Hadoop 群集。In this section, you create a Hadoop cluster in HDInsight using the Azure portal.

  1. 登录到 Azure 门户Sign in to the Azure portal.

  2. 在顶部菜单中,选择“+ 创建资源” 。From the top menu, select + Create a resource.

    创建资源 HDInsight 群集Create a resource HDInsight cluster

  3. 选择“分析” > “Azure HDInsight”,转到“创建 HDInsight 群集” 页。Select Analytics > Azure HDInsight to go to the Create HDInsight cluster page.

  4. 在“基本信息”选项卡中提供以下信息: From the Basics tab, provide the following information:

    属性Property 说明Description
    订阅Subscription 从下拉列表中选择用于此群集的 Azure 订阅。From the drop-down list, select the Azure subscription that's used for the cluster.
    资源组Resource group 从下拉列表中选择现有资源组,或选择“新建” 。From the drop-down list, select your existing resource group, or select Create new.
    群集名称Cluster name 输入任何全局唯一的名称。Enter a globally unique name. 该名称最多可以有 59 个字符,包括字母、数字和连字符。The name can consist of up to 59 characters including letters, numbers, and hyphens. 名称的第一个和最后一个字符不能为连字符。The first and last characters of the name can't be hyphens.
    区域Region 从下拉列表中,选择在其中创建群集的区域。From the drop-down list, select a region where the cluster is created. 为获得更佳性能,请选择离你较近的位置。Choose a location closer to you for better performance.
    群集类型Cluster type 选择“选择群集类型” 。Select Select cluster type. 然后选择 Hadoop 作为群集类型。Then select Hadoop as the cluster type.
    版本Version 从下拉列表中,选择一个版本From the drop-down list, select a version. 如果不知道要选择哪个版本,请使用默认版本。Use the default version if you don't know what to choose.
    群集登录用户名和密码Cluster login username and password 默认登录名为“admin” 。密码长度不得少于 10 个字符,且至少必须包含一个数字、一个大写字母和一个小写字母、一个非字母数字字符(' " ` )字符除外)。The default login name is admin. The password must be at least 10 characters in length and must contain at least one digit, one uppercase, and one lower case letter, one non-alphanumeric character (except characters ' " ` ). 请确保不提供常见密码,如“Pass@word1” 。Make sure you do not provide common passwords such as "Pass@word1".
    安全外壳 (SSH) 用户名Secure Shell (SSH) username 默认用户名为“sshuser” 。The default username is sshuser. 可以提供其他名称作为 SSH 用户名。You can provide another name for the SSH username.
    对 SSH 使用群集登录密码Use cluster login password for SSH 选中此复选框,让 SSH 用户使用与提供给群集登录用户的密码相同的密码。Select this check box to use the same password for SSH user as the one you provided for the cluster login user.

    HDInsight Linux 入门之提供群集基本值HDInsight Linux get started provide cluster basic values

    选择页面底部的“下一步: 存储 >>”以前进到存储设置。Select the Next: Storage >> to advance to the storage settings.

  5. 在“存储” 选项卡中,提供以下值:From the Storage tab, provide the following values:

    属性Property 说明Description
    主存储类型Primary storage type 使用默认值“Azure 存储”。 Use the default value Azure Storage.
    选择方法Selection method 使用默认值“从列表中选择”。 Use the default value Select from list.
    主存储帐户Primary storage account 使用下拉列表选择现有存储帐户,或选择“新建” 。Use the drop-down list to select an existing storage account, or select Create new. 如果创建新帐户,名称的长度必须在 3 到 24 个字符之间,并且只能包含数字和小写字母If you create a new account, the name must be between 3 and 24 characters in length, and can include numbers and lowercase letters only
    容器Container 使用自动填充的值。Use the autopopulated value.

    HDInsight Linux 入门之提供群集存储值HDInsight Linux get started provide cluster storage values

    选择“查看 + 创建”选项卡。 Select the Review + create tab.

  6. 在“查看 + 创建” 选项卡中,验证你在前面的步骤中选择的值。From the Review + create tab, verify the values you selected in the earlier steps.

    HDInsight Linux 入门之群集摘要HDInsight Linux get started cluster summary

  7. 选择“创建” 。Select Create. 创建群集大约需要 20 分钟时间。It takes about 20 minutes to create a cluster.

创建群集后,Azure 门户中会显示群集概述页。Once the cluster is created, you see the cluster overview page in the Azure portal.

HDInsight Linux 入门之群集设置HDInsight Linux get started cluster settings

运行 Apache Hive 查询Run Apache Hive queries

Apache Hive 是 HDInsight 中最流行的组件。Apache Hive is the most popular component used in HDInsight. 有多种方法可以在 HDInsight 中运行 Hive 作业。There are many ways to run Hive jobs in HDInsight. 本快速入门使用门户中的 Ambari Hive 视图。In this quickstart, you use the Ambari Hive view from the portal. 有关提交 Hive 作业的其他方法,请参阅在 HDInsight 中使用 HiveFor other methods for submitting Hive jobs, see Use Hive in HDInsight.

备注

Apache Hive 视图在 HDInsight 4.0 中不可用。Apache Hive View is not available in HDInsight 4.0.

  1. 若要打开 Ambari,请从之前的屏幕截图中,选择“群集仪表板” 。To open Ambari, from the previous screenshot, select Cluster Dashboard. 还可以浏览到 https://ClusterName.azurehdinsight.cn,其中 ClusterName 是你在上一部分中创建的群集。You can also browse to https://ClusterName.azurehdinsight.cn, where ClusterName is the cluster you created in the previous section.

    HDInsight Linux 入门之群集仪表板HDInsight Linux get started cluster dashboard

  2. 输入在创建群集时指定的 Hadoop 用户名和密码。Enter the Hadoop username and password that you specified while creating the cluster. 默认的用户名为 adminThe default username is admin.

  3. 打开 Hive 视图 ,如以下屏幕截图中所示:Open Hive View as shown in the following screenshot:

    从 Ambari 中选择 Hive 视图Selecting Hive View from Ambari

  4. 在“查询” 选项卡中,将以下 HiveQL 语句粘贴到工作表中:In the QUERY tab, paste the following HiveQL statements into the worksheet:

    SHOW TABLES;
    

    HDInsight Hive 视图HDInsight Hive views

  5. 选择“执行” 。Select Execute. “结果” 选项卡将显示在“查询” 选项卡下面,并显示有关作业的信息。A RESULTS tab appears beneath the QUERY tab and displays information about the job.

    完成查询后,“查询” 选项卡将显示操作结果。Once the query has finished, the QUERY tab displays the results of the operation. 此时会看到一个名为 hivesampletable 的表。You shall see one table called hivesampletable. 所有 HDInsight 群集都随附了此示例 Hive 表。This sample Hive table comes with all the HDInsight clusters.

    HDInsight Hive 视图结果HDInsight Hive view results

  6. 重复执行步骤 4 和步骤 5,以运行以下查询:Repeat step 4 and step 5 to run the following query:

    SELECT * FROM hivesampletable;
    
  7. 还可以保存查询的结果。You can also save the results of the query. 选择右侧的菜单按钮,并指定是要将结果下载为 CSV 文件,还是要将其存储到与群集关联的存储帐户。Select the menu button on the right, and specify whether you want to download the results as a CSV file or store it to the storage account associated with the cluster.

    保存 Hive 查询的结果Save result of Hive query

完成 Hive 作业后,可以将结果导出到 Azure SQL 数据库或 SQL Server 数据库,还可以使用 Excel 将结果可视化After you have completed a Hive job, you can export the results to Azure SQL database or SQL Server database, you can also visualize the results using Excel. 有关在 HDInsight 中使用 Hive 的详细信息,请参阅将 Apache Hive 和 HiveQL 与 HDInsight 中的 Apache Hadoop 配合使用以分析示例 Apache log4j 文件For more information about using Hive in HDInsight, see Use Apache Hive and HiveQL with Apache Hadoop in HDInsight to analyze a sample Apache log4j file.

清理资源Clean up resources

完成本快速入门后,可以删除群集。After you complete the quickstart, you may want to delete the cluster. 有了 HDInsight,便可以将数据存储在 Azure 存储中,因此可以在群集不用时安全地删除群集。With HDInsight, your data is stored in Azure Storage, so you can safely delete a cluster when it is not in use. 此外,还需要支付 HDInsight 群集费用,即使未使用。You are also charged for an HDInsight cluster, even when it is not in use. 由于群集费用高于存储空间费用数倍,因此在不使用群集时将其删除可以节省费用。Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they are not in use.

备注

如果立即继续学习下一篇文章,以了解如何使用 Hadoop on HDInsight 运行 ETL 操作,建议保持群集运行 。If you are immediately proceeding to the next article to learn how to run ETL operations using Hadoop on HDInsight, you may want to keep the cluster running. 这是因为该教程中必须再次创建 Hadoop 群集。This is because in the tutorial you have to create a Hadoop cluster again. 但是,如果不立即学习下一篇文章,则必须立即删除该群集。However, if you are not going through the next article right away, you must delete the cluster now.

删除群集和/或默认存储帐户To delete the cluster and/or the default storage account

  1. 返回到包含 Azure 门户的浏览器选项卡。Go back to the browser tab where you have the Azure portal. 你应该在群集概览页上。You shall be on the cluster overview page. 如果仅希望删除群集但保留默认的存储帐户,请选择“删除” 。If you only want to delete the cluster but retain the default storage account, select Delete.

    HDInsight 删除群集HDInsight delete cluster

  2. 如果希望删除群集和默认存储帐户,请选择资源组名称(之前的屏幕截图中已突出显示),打开资源组页。If you want to delete the cluster as well as the default storage account, select the resource group name (highlighted in the previous screenshot) to open the resource group page.

  3. 选择“删除资源组”,删除资源组(包括群集和默认存储帐户) 。Select Delete resource group to delete the resource group, which contains the cluster and the default storage account. 注意,删除资源组会删除存储帐户。Note deleting the resource group deletes the storage account. 如果想要保留存储帐户,请选择仅删除群集。If you want to keep the storage account, choose to delete the cluster only.

后续步骤Next steps

本快速入门介绍了如何使用资源管理器模板创建基于 Linux 的 HDInsight 群集,以及如何执行基本 Hive 查询。In this quickstart, you learned how to create a Linux-based HDInsight cluster using a Resource Manager template, and how to perform basic Hive queries. 下一篇文章将介绍如何使用 Hadoop on HDInsight 执行提取、转换和加载 (ETL) 操作。In the next article, you learn how to perform an extract, transform, and load (ETL) operation using Hadoop on HDInsight.

使用 HDInsight 上的交互式查询提取、转换和加载数据 [1]:../HDInsight/apache-hadoop-visual-studio-tools-get-started.mdExtract, transform, and load data using Interactive Query on HDInsight [1]: ../HDInsight/apache-hadoop-visual-studio-tools-get-started.md