HDInsight 中的 Apache HBase 入门示例Get started with an Apache HBase example in HDInsight

了解如何使用 Apache Hive 在 HDInsight 中创建 Apache HBase 群集、创建 HBase 表和查询表。Learn how to create an Apache HBase cluster in HDInsight, create HBase tables, and query tables by using Apache Hive. 有关 HBase 的一般信息,请参阅 HDInsight HBase 概述For general HBase information, see HDInsight HBase overview.

Warning

HDInsight 群集是基于分钟按比例计费,而不管用户是否使用它们。Billing for HDInsight clusters is prorated per minute, whether you use them or not. 请务必在使用完群集之后将其删除。Be sure to delete your cluster after you finish using it. 请参阅如何删除 HDInsight 群集See how to delete an HDInsight cluster.

先决条件Prerequisites

开始使用本 HBase 示例前,必须具有以下项目:Before you begin trying this HBase example, you must have the following items:

创建 Apache HBase 群集Create Apache HBase cluster

以下过程使用 Azure 资源管理器模板创建 HBase 群集以及相关的默认 Azure 存储帐户。The following procedure uses an Azure Resource Manager template to create a HBase cluster and the dependent default Azure Storage account. 若要了解该过程与其他群集创建方法中使用的参数,请参阅 在 HDInsight 中创建基于 Linux 的 Hadoop 群集To understand the parameters used in the procedure and other cluster creation methods, see Create Linux-based Hadoop clusters in HDInsight. 有关使用 Data Lake Storage Gen2 的详细信息,请参阅快速入门:在 HDInsight 中设置群集For more information on using Data Lake Storage Gen2, see Quickstart: Set up clusters in HDInsight.

  1. 单击下面的图像可在 Azure 门户中打开模板。Click the following image to open the template in the Azure portal. 该模板位于 Azure 快速启动模板中。The template is located in Azure QuickStart templates.

    Deploy to Azure

    Note

    必须修改从 GitHub 存储库“azure-quickstart-templates”下载的模板,以适应 Azure 中国云环境。Templates you downloaded from the GitHub Repo "azure-quickstart-templates" must be modified in order to fit in the Azure China Cloud Environment. 例如,替换一些终结点 - 将“blob.core.chinacloudapi.cn”替换为“blob.core.chinacloudapi.cn”,将“cloudapp.azure.com”替换为“chinacloudapp.cn”;将允许的位置更改为“中国北部”和“中国东部”;将 HDInsight Linux 版本更改为 Azure 中国区支持的版本:3.5。For example, replace some endpoints -- "blob.core.chinacloudapi.cn" by "blob.core.chinacloudapi.cn", "cloudapp.azure.com" by "chinacloudapp.cn"; change the allowed location to "China North" and "China East"; change the HDInsight Linux version to Azure China supported one, 3.5.

  2. 在“自定义部署” 边栏选项卡中,输入以下信息:From the Custom deployment blade, enter the following values:

    • 订阅:选择用于创建群集的 Azure 订阅。Subscription: Select your Azure subscription that is used to create the cluster.

    • 资源组:创建 Azure 资源管理组,或使用现有的组。Resource group: Create an Azure Resource Management group or use an existing one.

    • 位置:指定资源组的位置。Location: Specify the location of the resource group.

    • ClusterName:输入 HBase 群集的名称。ClusterName: Enter a name for the HBase cluster.

    • 群集登录名和密码:默认登录名为“admin”。Cluster login name and password: The default login name is admin.

    • SSH 用户名和密码:默认用户名为“sshuser”。SSH username and password: The default username is sshuser. 可以重命名它。You can rename it.

      其他参数是可选的。Other parameters are optional.

      每个群集都有一个 Azure 存储帐户依赖项。Each cluster has an Azure Storage account dependency. 删除群集后,数据将保留在存储帐户中。After you delete a cluster, the data retains in the storage account. 群集的默认存储帐户名为群集名称后接“store”。The cluster default storage account name is the cluster name with "store" appended. 该名称已在模板 variables 节中硬编码。It is hardcoded in the template variables section.

  3. 选择“我同意上述条款和条件”,并单击“购买”。Select I agree to the terms and conditions stated above, and then click Purchase. 创建群集大约需要 20 分钟时间。It takes about 20 minutes to create a cluster.

Note

删除 HBase 群集后,可使用同一默认 Blob 容器创建另一 HBase 群集。After an HBase cluster is deleted, you can create another HBase cluster by using the same default blob container. 新群集会选取在原始群集中创建的 HBase 表。The new cluster picks up the HBase tables you created in the original cluster. 为了避免不一致,建议在删除群集之前先禁用 HBase 表。To avoid inconsistencies, we recommend that you disable the HBase tables before you delete the cluster.

创建表和插入数据Create tables and insert data

可以使用 SSH 连接到 HBase 群集,并使用 Apache HBase Shell 来创建 HBase 表以及插入和查询数据。You can use SSH to connect to HBase clusters and then use Apache HBase Shell to create HBase tables, insert data, and query data. 有关详细信息,请参阅 将 SSH 与 HDInsight 配合使用For more information, see Use SSH with HDInsight.

对大多数用户而言,数据以表格形式显示:For most people, data appears in the tabular format:

HDInsight HBase 表格数据

在 HBase(Cloud BigTable 的一种实现)中,相同的数据看起来类似于:In HBase (an implementation of Cloud BigTable), the same data looks like:

HDInsight HBase BigTable 数据

使用 HBase shellTo use the HBase shell

  1. 从 SSH 运行以下 HBase 命令:From SSH, run the following HBase command:

    hbase shell
    
  2. 创建包含两个列系列的 HBase:Create an HBase with two-column families:

    create 'Contacts', 'Personal', 'Office'
    list
    
  3. 插入一些数据:Insert some data:

    put 'Contacts', '1000', 'Personal:Name', 'John Dole'
    put 'Contacts', '1000', 'Personal:Phone', '1-425-000-0001'
    put 'Contacts', '1000', 'Office:Phone', '1-425-000-0002'
    put 'Contacts', '1000', 'Office:Address', '1111 San Gabriel Dr.'
    scan 'Contacts'
    

    HDInsight Hadoop HBase shell

  4. 获取单个行Get a single row

    get 'Contacts', '1000'
    

    将会看到与使用扫描命令相同的结果,因为只有一个行。You shall see the same results as using the scan command because there is only one row.

    有关 HBase 表架构的详细信息,请参阅 Apache HBase 架构设计简介For more information about the HBase table schema, see Introduction to Apache HBase Schema Design. 有关 HBase 命令的详细信息,请参阅 Apache HBase 参考指南For more HBase commands, see Apache HBase reference guide.

  5. 退出 shellExit the shell

    exit
    

在联系人 HBase 表中批量加载数据To bulk load data into the contacts HBase table

HBase 提供了多种方法用于将数据载入表中。HBase includes several methods of loading data into tables. 有关详细信息,请参阅 批量加载For more information, see Bulk loading.

可在公共 Blob 容器 wasb://hbasecontacts@hditutorialdata.blob.core.chinacloudapi.cn/contacts.txt 中找到示例数据文件。A sample data file can be found in a public blob container, wasb://hbasecontacts@hditutorialdata.blob.core.chinacloudapi.cn/contacts.txt. 该数据文件的内容为:The content of the data file is:

8396    Calvin Raji      230-555-0191    230-555-0191    5415 San Gabriel Dr.
16600   Karen Wu         646-555-0113    230-555-0192    9265 La Paz
4324    Karl Xie         508-555-0163    230-555-0193    4912 La Vuelta
16891   Jonn Jackson     674-555-0110    230-555-0194    40 Ellis St.
3273    Miguel Miller    397-555-0155    230-555-0195    6696 Anchor Drive
3588    Osa Agbonile     592-555-0152    230-555-0196    1873 Lion Circle
10272   Julia Lee        870-555-0110    230-555-0197    3148 Rose Street
4868    Jose Hayes       599-555-0171    230-555-0198    793 Crawford Street
4761    Caleb Alexander  670-555-0141    230-555-0199    4775 Kentucky Dr.
16443   Terry Chander    998-555-0171    230-555-0200    771 Northridge Drive

可以选择创建一个文本文件并将该文件上传到自己的存储帐户。You can optionally create a text file and upload the file to your own storage account. 有关说明,请参阅在 HDInsight 中为 Apache Hadoop 作业上传数据For the instructions, see Upload data for Apache Hadoop jobs in HDInsight.

Note

此过程使用在上一个过程中创建的“联系人”HBase 表。This procedure uses the Contacts HBase table you have created in the last procedure.

  1. 从 SSH 运行以下命令,将数据文件转换成 StoreFiles 并将其存储在 Dimporttsv.bulk.output 指定的相对路径。From SSH, run the following command to transform the data file to StoreFiles and store at a relative path specified by Dimporttsv.bulk.output. 如果在 HBase Shell 中操作,请使用退出命令退出。If you are in HBase Shell, use the exit command to exit.

    hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,Personal:Name,Personal:Phone,Office:Phone,Office:Address" -Dimporttsv.bulk.output="/example/data/storeDataFileOutput" Contacts wasb://hbasecontacts@hditutorialdata.blob.core.chinacloudapi.cn/contacts.txt
    
  2. 运行以下命令,将数据从 /example/data/storeDataFileOutput 上传到 HBase 表:Run the following command to upload the data from /example/data/storeDataFileOutput to the HBase table:

    hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /example/data/storeDataFileOutput Contacts
    
  3. 可以打开 HBase Shell,并使用扫描命令来列出表内容。You can open the HBase shell, and use the scan command to list the table content.

使用 Apache Hive 查询 Apache HBaseUse Apache Hive to query Apache HBase

可以使用 Apache Hive 查询 HBase 表中的数据。You can query data in HBase tables by using Apache Hive. 本部分将创建要映射到 HBase 表的 Hive 表,并使用该表来查询 HBase 表中的数据。In this section, you create a Hive table that maps to the HBase table and uses it to query the data in your HBase table.

  1. 打开 PuTTY并连接到群集。Open PuTTY, and connect to the cluster. 参阅前一过程中的说明。See the instructions in the previous procedure.

  2. 在 SSH 会话中,使用以下命令启动 Beeline:From the SSH session, use the following command to start Beeline:

    beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http' -n admin
    

    有关 Beeline 的详细信息,请参阅通过 Beeline 将 Hive 与 HDInsight 中的 Hadoop 配合使用For more information about Beeline, see Use Hive with Hadoop in HDInsight with Beeline.

  3. 运行以下 HiveQL 脚本,创建映射到 HBase 表的 Hive 表。Run the following HiveQL script to create a Hive table that maps to the HBase table. 确保已创建本教程中前面引用的示例表,方法是在运行此语句前使用 HBase shell。Make sure that you have created the sample table referenced earlier in this tutorial by using the HBase shell before you run this statement.

    CREATE EXTERNAL TABLE hbasecontacts(rowkey STRING, name STRING, homephone STRING, officephone STRING, officeaddress STRING)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,Personal:Name,Personal:Phone,Office:Phone,Office:Address')
    TBLPROPERTIES ('hbase.table.name' = 'Contacts');
    
  4. 运行以下 HiveQL 脚本,以查询 HBase 表中的数据:Run the following HiveQL script to query the data in the HBase table:

    SELECT count(rowkey) FROM hbasecontacts;
    

通过 Curl 使用 HBase REST APIUse HBase REST APIs using Curl

REST API 通过 基本身份验证进行保护。The REST API is secured via basic authentication. 始终应该使用安全 HTTP (HTTPS) 来发出请求,确保安全地将凭据发送到服务器。You shall always make requests by using Secure HTTP (HTTPS) to help ensure that your credentials are securely sent to the server.

  1. 使用以下命令列出现有的 HBase 表:Use the following command to list the existing HBase tables:

    curl -u <UserName>:<Password> \
    -G https://<ClusterName>.azurehdinsight.cn/hbaserest/
    
  2. 使用以下命令创建包含两个列系列的新 HBase 表:Use the following command to create a new HBase table with two-column families:

    curl -u <UserName>:<Password> \
    -X PUT "https://<ClusterName>.azurehdinsight.cn/hbaserest/Contacts1/schema" \
    -H "Accept: application/json" \
    -H "Content-Type: application/json" \
    -d "{\"@name\":\"Contact1\",\"ColumnSchema\":[{\"name\":\"Personal\"},{\"name\":\"Office\"}]}" \
    -v
    

    架构以 JSON 格式提供。The schema is provided in the JSon format.

  3. 使用以下命令插入一些数据:Use the following command to insert some data:

    curl -u <UserName>:<Password> \
    -X PUT "https://<ClusterName>.azurehdinsight.cn/hbaserest/Contacts1/false-row-key" \
    -H "Accept: application/json" \
    -H "Content-Type: application/json" \
    -d "{\"Row\":[{\"key\":\"MTAwMA==\",\"Cell\": [{\"column\":\"UGVyc29uYWw6TmFtZQ==\", \"$\":\"Sm9obiBEb2xl\"}]}]}" \
    -v
    

    必须使用 base64 来为 -d 参数中指定的值编码。You must base64 encode the values specified in the -d switch. 在此示例中:In the example:

    • MTAwMA==:1000MTAwMA==: 1000

    • UGVyc29uYWw6TmFtZQ==:Personal:NameUGVyc29uYWw6TmFtZQ==: Personal:Name

    • Sm9obiBEb2xl:John DoleSm9obiBEb2xl: John Dole

      false-row-key 允许插入多个(批处理)值。false-row-key allows you to insert multiple (batched) values.

  4. 使用以下命令获取行:Use the following command to get a row:

    curl -u <UserName>:<Password> \
    -X GET "https://<ClusterName>.azurehdinsight.cn/hbaserest/Contacts1/1000" \
    -H "Accept: application/json" \
    -v
    

有关 HBase Rest 的详细信息,请参阅 Apache HBase 参考指南For more information about HBase Rest, see Apache HBase Reference Guide.

Note

Thrift 不受 HDInsight 中的 HBase 支持。Thrift is not supported by HBase in HDInsight.

使用 Curl 或者与 WebHCat 进行任何其他形式的 REST 通信时,必须提供 HDInsight 群集管理员用户名和密码对请求进行身份验证。When using Curl or any other REST communication with WebHCat, you must authenticate the requests by providing the user name and password for the HDInsight cluster administrator. 此外,还必须使用群集名称作为用来向服务器发送请求的统一资源标识符 (URI) 的一部分:You must also use the cluster name as part of the Uniform Resource Identifier (URI) used to send the requests to the server:

   curl -u <UserName>:<Password> \
   -G https://<ClusterName>.azurehdinsight.cn/templeton/v1/status

应会收到类似于以下响应的响应:You should receive a response similar to the following response:

   {"status":"ok","version":"v1"}

检查群集状态Check cluster status

HDInsight 中的 HBase 随附了一个 Web UI 用于监视群集。HBase in HDInsight ships with a Web UI for monitoring clusters. 使用该 Web UI 可以请求有关区域的统计或信息。Using the Web UI, you can request statistics or information about regions.

访问 HBase Master UITo access the HBase Master UI

  1. 通过 https://<群集名称>.azurehdinsight.cn 登录到 Ambari Web UI。Sign into the the Ambari Web UI at https://<Clustername>.azurehdinsight.cn.

  2. 在左侧菜单中,单击“HBase” 。Click HBase from the left menu.

  3. 在页面顶部单击“快速链接”,指向活动 Zookeeper 节点链接,并单击“HBase Master UI”。Click Quick links on the top of the page, point to the active Zookeeper node link, and then click HBase Master UI. 该 UI 会在另一个浏览器标签页中打开:The UI is opened in another browser tab:

    HDInsight HBase HMaster UI

    HBase Master UI 包含以下部分:The HBase Master UI contains the following sections:

    • 区域服务器region servers
    • 备份主机backup masters
    • tables
    • 任务tasks
    • 软件属性software attributes

删除群集Delete the cluster

为了避免不一致,建议在删除群集之前先禁用 HBase 表。To avoid inconsistencies, we recommend that you disable the HBase tables before you delete the cluster.

Warning

HDInsight 群集是基于分钟按比例计费,而不管用户是否使用它们。Billing for HDInsight clusters is prorated per minute, whether you use them or not. 请务必在使用完群集之后将其删除。Be sure to delete your cluster after you finish using it. 请参阅如何删除 HDInsight 群集See how to delete an HDInsight cluster.

故障排除Troubleshoot

如果在创建 HDInsight 群集时遇到问题,请参阅访问控制要求If you run into issues with creating HDInsight clusters, see access control requirements.

后续步骤Next steps

本文已介绍如何创建 Apache HBase 群集、如何创建表以及如何从 HBase shell 查看这些表中的数据。In this article, you learned how to create an Apache HBase cluster and how to create tables and view the data in those tables from the HBase shell. 此外,学习了如何对 HBase 表中的数据使用 Hive 查询,以及如何使用 HBase C# REST API 创建 HBase 表并从该表中检索数据。You also learned how to use a Hive query on data in HBase tables and how to use the HBase C# REST APIs to create an HBase table and retrieve data from the table.

若要了解更多信息,请参阅以下文章:To learn more, see:

  • HDInsight HBase 概述:Apache HBase 是构建于 Apache Hadoop 上的 Apache 开源 NoSQL 数据库,用于为大量非结构化和半结构化数据提供随机访问和高度一致性。HDInsight HBase overview: Apache HBase is an Apache, open-source, NoSQL database built on Apache Hadoop that provides random access and strong consistency for large amounts of unstructured and semistructured data.