教程:在 Azure HDInsight 中使用 Apache HBaseTutorial: Use Apache HBase in Azure HDInsight

本教程演示如何在 Azure HDInsight 中创建 Apache HBase 群集、创建 HBase 表以及使用 Apache Hive 查询表。This tutorial demonstrates how to create an Apache HBase cluster in Azure HDInsight, create HBase tables, and query tables by using Apache Hive. 有关 HBase 的一般信息,请参阅 HDInsight HBase 概述For general HBase information, see HDInsight HBase overview.

本教程介绍如何执行下列操作:In this tutorial, you learn how to:

  • 创建 Apache HBase 群集Create Apache HBase cluster
  • 创建 HBase 表和插入数据Create HBase tables and insert data
  • 使用 Apache Hive 查询 Apache HBaseUse Apache Hive to query Apache HBase
  • 通过 Curl 使用 HBase REST APIUse HBase REST APIs using Curl
  • 检查群集状态Check cluster status

先决条件Prerequisites

创建 Apache HBase 群集Create Apache HBase cluster

以下过程使用 Azure 资源管理器模板创建 HBase 群集。The following procedure uses an Azure Resource Manager template to create an HBase cluster. 该模板还创建相关的默认 Azure 存储帐户。The template also creates the dependent default Azure Storage account. 若要了解该过程与其他群集创建方法中使用的参数,请参阅 在 HDInsight 中创建基于 Linux 的 Hadoop 群集To understand the parameters used in the procedure and other cluster creation methods, see Create Linux-based Hadoop clusters in HDInsight.

  1. 选择下面的图像可在 Azure 门户中打开模板。Select the following image to open the template in the Azure portal. 模板位于 Azure 快速入门模板中。The template is located in Azure quickstart templates.

    Deploy to Azure

  2. 在“自定义部署”对话框中输入以下值:From the Custom deployment dialog, enter the following values:

    属性Property 说明Description
    订阅Subscription 选择用于创建群集的 Azure 订阅。Select your Azure subscription that is used to create the cluster.
    资源组Resource group 创建 Azure 资源管理组,或使用现有的组。Create an Azure Resource management group or use an existing one.
    位置Location 指定资源组的位置。Specify the location of the resource group.
    ClusterNameClusterName 输入 HBase 群集的名称。Enter a name for the HBase cluster.
    群集登录名和密码Cluster login name and password 默认登录名为“admin”。The default login name is admin.
    SSH 用户名和密码SSH username and password 默认用户名为“sshuser”。The default username is sshuser.

    其他参数是可选的。Other parameters are optional.

    每个群集都有一个 Azure 存储帐户依赖项。Each cluster has an Azure Storage account dependency. 删除群集后,数据保留在存储帐户中。After you delete a cluster, the data stays in the storage account. 群集的默认存储帐户名为群集名称后接“store”。The cluster default storage account name is the cluster name with "store" appended. 该名称已在模板 variables 节中硬编码。It's hardcoded in the template variables section.

  3. 选择“我同意上述条款和条件”,并选择“购买”。 Select I agree to the terms and conditions stated above, and then select Purchase. 创建群集大约需要 20 分钟时间。It takes about 20 minutes to create a cluster.

备注

删除 HBase 群集后,可使用同一默认 Blob 容器创建另一 HBase 群集。After an HBase cluster is deleted, you can create another HBase cluster by using the same default blob container. 新群集会选取在原始群集中创建的 HBase 表。The new cluster picks up the HBase tables you created in the original cluster. 为了避免不一致,建议在删除群集之前先禁用 HBase 表。To avoid inconsistencies, we recommend that you disable the HBase tables before you delete the cluster.

创建表和插入数据Create tables and insert data

可以使用 SSH 连接到 HBase 群集,并使用 Apache HBase Shell 来创建 HBase 表以及插入和查询数据。You can use SSH to connect to HBase clusters and then use Apache HBase Shell to create HBase tables, insert data, and query data.

对大多数用户而言,数据以表格形式显示:For most people, data appears in the tabular format:

HDInsight HBase 表格数据

在 HBase(Cloud BigTable 的一种实现)中,相同的数据看起来类似于:In HBase (an implementation of Cloud BigTable), the same data looks like:

HDInsight HBase BigTable 数据

使用 HBase shellTo use the HBase shell

  1. 使用 ssh 命令连接到 HBase 群集。Use ssh command to connect to your HBase cluster. 编辑以下命令,将 CLUSTERNAME 替换为群集的名称,然后输入该命令:Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:

    ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.cn
    
  2. 使用 hbase shell 命令来启动 HBase 交互式 shell。Use hbase shell command to start the HBase interactive shell. 在 SSH 连接中输入以下命令。Enter the following command in your SSH connection:

    hbase shell
    
  3. 使用 create 命令,创建包含双列系列的 HBase 表。Use create command to create an HBase table with two-column families. 表名和列名区分大小写。The table and column names are case-sensitive. 输入以下命令:Enter the following command:

    create 'Contacts', 'Personal', 'Office'
    
  4. 使用 list 命令,列出 HBase 中的所有表。Use list command to list all tables in HBase. 输入以下命令:Enter the following command:

    list
    
  5. 使用 put 命令,将指定列中的值插入特定表中的指定行。Use put command to insert values at a specified column in a specified row in a particular table. 输入以下命令:Enter the following commands:

    put 'Contacts', '1000', 'Personal:Name', 'John Dole'
    put 'Contacts', '1000', 'Personal:Phone', '1-425-000-0001'
    put 'Contacts', '1000', 'Office:Phone', '1-425-000-0002'
    put 'Contacts', '1000', 'Office:Address', '1111 San Gabriel Dr.'
    
  6. 使用 scan 命令扫描并返回 Contacts 表数据。Use scan command to scan and return the Contacts table data. 输入以下命令:Enter the following command:

    scan 'Contacts'
    

    HDInsight Hadoop HBase shell

  7. 使用 get 命令提取某个行的内容。Use get command to fetch contents of a row. 输入以下命令:Enter the following command:

    get 'Contacts', '1000'
    

    看到的结果与使用 scan 命令的结果类似,因为只有一个行。You see similar results as using the scan command because there is only one row.

    有关 HBase 表架构的详细信息,请参阅 Apache HBase 架构设计简介For more information about the HBase table schema, see Introduction to Apache HBase Schema Design. 有关 HBase 命令的详细信息,请参阅 Apache HBase 参考指南For more HBase commands, see Apache HBase reference guide.

  8. 使用 exit 命令来停止 HBase 交互式 shell。Use exit command to stop the HBase interactive shell. 输入以下命令:Enter the following command:

    exit
    

在联系人 HBase 表中批量加载数据To bulk load data into the contacts HBase table

HBase 提供了多种方法用于将数据载入表中。HBase includes several methods of loading data into tables. 有关详细信息,请参阅 批量加载For more information, see Bulk loading.

可在公共 Blob 容器 wasb://hbasecontacts@hditutorialdata.blob.core.chinacloudapi.cn/contacts.txt 中找到示例数据文件。A sample data file can be found in a public blob container, wasb://hbasecontacts@hditutorialdata.blob.core.chinacloudapi.cn/contacts.txt. 该数据文件的内容为:The content of the data file is:

8396    Calvin Raji      230-555-0191    230-555-0191    5415 San Gabriel Dr.
16600   Karen Wu         646-555-0113    230-555-0192    9265 La Paz
4324    Karl Xie         508-555-0163    230-555-0193    4912 La Vuelta
16891   Jonn Jackson     674-555-0110    230-555-0194    40 Ellis St.
3273    Miguel Miller    397-555-0155    230-555-0195    6696 Anchor Drive
3588    Osa Agbonile     592-555-0152    230-555-0196    1873 Lion Circle
10272   Julia Lee        870-555-0110    230-555-0197    3148 Rose Street
4868    Jose Hayes       599-555-0171    230-555-0198    793 Crawford Street
4761    Caleb Alexander  670-555-0141    230-555-0199    4775 Kentucky Dr.
16443   Terry Chander    998-555-0171    230-555-0200    771 Northridge Drive

可以选择创建一个文本文件并将该文件上传到自己的存储帐户。You can optionally create a text file and upload the file to your own storage account. 有关说明,请参阅在 HDInsight 中为 Apache Hadoop 作业上传数据For the instructions, see Upload data for Apache Hadoop jobs in HDInsight.

此过程会使用你在上一过程中创建的 Contacts HBase 表。This procedure uses the Contacts HBase table you created in the last procedure.

  1. 通过打开的 ssh 连接运行以下命令,将数据文件转换成 StoreFiles 并将其存储在 Dimporttsv.bulk.output 指定的相对路径。From your open ssh connection, run the following command to transform the data file to StoreFiles and store at a relative path specified by Dimporttsv.bulk.output.

    hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,Personal:Name,Personal:Phone,Office:Phone,Office:Address" -Dimporttsv.bulk.output="/example/data/storeDataFileOutput" Contacts wasb://hbasecontacts@hditutorialdata.blob.core.chinacloudapi.cn/contacts.txt
    
  2. 运行以下命令,将数据从 /example/data/storeDataFileOutput 上传到 HBase 表:Run the following command to upload the data from /example/data/storeDataFileOutput to the HBase table:

    hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /example/data/storeDataFileOutput Contacts
    
  3. 可以打开 HBase shell,并使用 scan 命令来列出表内容。You can open the HBase shell, and use the scan command to list the table contents.

使用 Apache Hive 查询 Apache HBaseUse Apache Hive to query Apache HBase

可以使用 Apache Hive 查询 HBase 表中的数据。You can query data in HBase tables by using Apache Hive. 本部分将创建要映射到 HBase 表的 Hive 表,并使用该表来查询 HBase 表中的数据。In this section, you create a Hive table that maps to the HBase table and uses it to query the data in your HBase table.

  1. 通过打开的 ssh 连接,使用以下命令启动 Beeline:From your open ssh connection, use the following command to start Beeline:

    beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http' -n admin
    

    有关 Beeline 的详细信息,请参阅通过 Beeline 将 Hive 与 HDInsight 中的 Hadoop 配合使用For more information about Beeline, see Use Hive with Hadoop in HDInsight with Beeline.

  2. 运行以下 HiveQL 脚本,创建映射到 HBase 表的 Hive 表。Run the following HiveQL script to create a Hive table that maps to the HBase table. 确保在运行以下语句前,已使用 HBase shell 创建了本文中前面引用的示例表。Make sure that you have created the sample table referenced earlier in this article by using the HBase shell before you run this statement.

    CREATE EXTERNAL TABLE hbasecontacts(rowkey STRING, name STRING, homephone STRING, officephone STRING, officeaddress STRING)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,Personal:Name,Personal:Phone,Office:Phone,Office:Address')
    TBLPROPERTIES ('hbase.table.name' = 'Contacts');
    
  3. 运行以下 HiveQL 脚本,以查询 HBase 表中的数据:Run the following HiveQL script to query the data in the HBase table:

    SELECT count(rowkey) AS rk_count FROM hbasecontacts;
    
  4. 若要退出 Beeline,请使用 !exitTo exit Beeline, use !exit.

  5. 若要退出 ssh 连接,请使用 exitTo exit your ssh connection, use exit.

通过 Curl 使用 HBase REST APIUse HBase REST APIs using Curl

REST API 通过 基本身份验证进行保护。The REST API is secured via basic authentication. 始终应该使用安全 HTTP (HTTPS) 来发出请求,确保安全地将凭据发送到服务器。You shall always make requests by using Secure HTTP (HTTPS) to help ensure that your credentials are securely sent to the server.

  1. 为便于使用,请设置环境变量。Set environment variable for ease of use. 通过将 MYPASSWORD 替换为群集登录密码来编辑以下命令。Edit the commands below by replacing MYPASSWORD with the cluster login password. MYCLUSTERNAME 替换为 HBase 群集的名称。Replace MYCLUSTERNAME with the name of your HBase cluster. 然后输入命令。Then enter the commands.

    export password='MYPASSWORD'
    export clustername=MYCLUSTERNAME
    
  2. 使用以下命令列出现有的 HBase 表:Use the following command to list the existing HBase tables:

    curl -u admin:$password \
    -G https://$clustername.azurehdinsight.cn/hbaserest/
    
  3. 使用以下命令创建包含两个列系列的新 HBase 表:Use the following command to create a new HBase table with two-column families:

    curl -u admin:$password \
    -X PUT "https://$clustername.azurehdinsight.cn/hbaserest/Contacts1/schema" \
    -H "Accept: application/json" \
    -H "Content-Type: application/json" \
    -d "{\"@name\":\"Contact1\",\"ColumnSchema\":[{\"name\":\"Personal\"},{\"name\":\"Office\"}]}" \
    -v
    

    架构以 JSON 格式提供。The schema is provided in the JSon format.

  4. 使用以下命令插入一些数据:Use the following command to insert some data:

    curl -u admin:$password \
    -X PUT "https://$clustername.azurehdinsight.cn/hbaserest/Contacts1/false-row-key" \
    -H "Accept: application/json" \
    -H "Content-Type: application/json" \
    -d "{\"Row\":[{\"key\":\"MTAwMA==\",\"Cell\": [{\"column\":\"UGVyc29uYWw6TmFtZQ==\", \"$\":\"Sm9obiBEb2xl\"}]}]}" \
    -v
    

    对 -d 开关中指定的值进行 Base64 编码。Base64 encode the values specified in the -d switch. 在此示例中:In the example:

    • MTAwMA==:1000MTAwMA==: 1000

    • UGVyc29uYWw6TmFtZQ==:Personal:NameUGVyc29uYWw6TmFtZQ==: Personal:Name

    • Sm9obiBEb2xl:John DoleSm9obiBEb2xl: John Dole

      false-row-key 允许插入多个(批处理)值。false-row-key allows you to insert multiple (batched) values.

  5. 使用以下命令获取行:Use the following command to get a row:

    curl -u admin:$password \
    GET "https://$clustername.azurehdinsight.cn/hbaserest/Contacts1/1000" \
    -H "Accept: application/json" \
    -v
    

有关 HBase Rest 的详细信息,请参阅 Apache HBase 参考指南For more information about HBase Rest, see Apache HBase Reference Guide.

备注

Thrift 不受 HDInsight 中的 HBase 支持。Thrift is not supported by HBase in HDInsight.

使用 Curl 或者与 WebHCat 进行任何其他形式的 REST 通信时,必须提供 HDInsight 群集管理员用户名和密码对请求进行身份验证。When using Curl or any other REST communication with WebHCat, you must authenticate the requests by providing the user name and password for the HDInsight cluster administrator. 此外,还必须使用群集名称作为用来向服务器发送请求的统一资源标识符 (URI) 的一部分:You must also use the cluster name as part of the Uniform Resource Identifier (URI) used to send the requests to the server:

   curl -u <UserName>:<Password> \
   -G https://<ClusterName>.azurehdinsight.cn/templeton/v1/status

应会收到类似于以下响应的响应:You should receive a response similar to the following response:

   {"status":"ok","version":"v1"}

检查群集状态Check cluster status

HDInsight 中的 HBase 随附了一个 Web UI 用于监视群集。HBase in HDInsight ships with a Web UI for monitoring clusters. 使用该 Web UI 可以请求有关区域的统计或信息。Using the Web UI, you can request statistics or information about regions.

访问 HBase Master UITo access the HBase Master UI

  1. 登录到 Ambari Web UI(网址是 https://CLUSTERNAME.azurehdinsight.cn,其中,CLUSTERNAME 是 HBase 群集的名称)。Sign into the Ambari Web UI at https://CLUSTERNAME.azurehdinsight.cn where CLUSTERNAME is the name of your HBase cluster.

  2. 从左侧菜单中选择“HBase”。Select HBase from the left menu.

  3. 选择页面顶部的“快速链接”,指向活动的 Zookeeper 节点链接,然后选择“HBase Master UI”。Select Quick links on the top of the page, point to the active Zookeeper node link, and then select HBase Master UI. 该 UI 会在另一个浏览器标签页中打开:The UI is opened in another browser tab:

    HDInsight HBase HMaster UI

    HBase Master UI 包含以下部分:The HBase Master UI contains the following sections:

    • 区域服务器region servers
    • 备份主机backup masters
    • tables
    • 任务tasks
    • 软件属性software attributes

清理资源Clean up resources

为了避免不一致,建议在删除群集之前先禁用 HBase 表。To avoid inconsistencies, we recommend that you disable the HBase tables before you delete the cluster. 可以使用 HBase 命令 disable 'Contacts'You can use the HBase command disable 'Contacts'. 如果不打算继续使用此应用程序,请使用以下步骤删除创建的 HBase 群集:If you're not going to continue to use this application, delete the HBase cluster that you created with the following steps:

  1. 登录到 Azure 门户Sign in to the Azure portal.
  2. 在顶部的“搜索”框中,键入 HDInsightIn the Search box at the top, type HDInsight.
  3. 选择“服务”下的“HDInsight 群集” 。Select HDInsight clusters under Services.
  4. 在显示的 HDInsight 群集列表中,单击为本教程创建的群集旁边的“...”。In the list of HDInsight clusters that appears, click the ... next to the cluster that you created for this tutorial.
  5. 单击“删除” 。Click Delete. 单击 “是”Click Yes.

后续步骤Next steps

本教程介绍了如何创建 Apache HBase 群集、如何创建表以及如何从 HBase shell 查看这些表中的数据。In this tutorial, you learned how to create an Apache HBase cluster and how to create tables and view the data in those tables from the HBase shell. 此外,学习了如何对 HBase 表中的数据使用 Hive 查询,以及如何使用 HBase C# REST API 创建 HBase 表并从该表中检索数据。You also learned how to use a Hive query on data in HBase tables and how to use the HBase C# REST APIs to create an HBase table and retrieve data from the table. 若要了解更多信息,请参阅以下文章:To learn more, see: