什么是 Azure HDInsight 中的 Apache HBaseWhat is Apache HBase in Azure HDInsight

Apache HBase 是一种开源 NoSQL 数据库,它构建于 Apache Hadoop 基础之上,并基于 Google BigTable 模型化。Apache HBase is an open-source, NoSQL database that is built on Apache Hadoop and modeled after Google BigTable. HBase 针对按列系列组织的无架构数据库中的大量非结构化和结构化数据提供随机访问和强一致性。HBase provides random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families.

从用户角度来看,HBase 类似于数据库。From user perspective, HBase is similar to a database. 数据存储在表的行和列中,行中的数据按列系列分组。Data is stored in the rows and columns of a table, and data within a row is grouped by column family. HBase 是无架构数据库,也就是说,不必在使用其中数据前定义列和列中所存储数据类型。HBase is a schemaless database in the sense that neither the columns nor the type of data stored in them need to be defined before using them. 开放源代码可进行线性伸缩,以处理上千节点上数 PB 的数据。The open-source code scales linearly to handle petabytes of data on thousands of nodes. 开放源代码可依赖数据冗余、批处理以及 Hadoop 生态系统中的分布式应用程序提供的其他功能。It can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem.

如何在 Azure HDInsight 中实现 Apache HBase?How is Apache HBase implemented in Azure HDInsight?

HDInsight HBase 以集成到 Azure 环境中的托管群集形式提供。HDInsight HBase is offered as a managed cluster that is integrated into the Azure environment. 这些群集配置为在 Azure 存储中直接存储数据,这样就减少了延迟,并提高了选择性能和价格的弹性。The clusters are configured to store data directly in Azure Storage, which provides low latency and increased elasticity in performance and cost choices. 这样,客户便可构建用于处理大型数据集的交互式网站,构建用于存储数百万个终结点的传感器数据与遥测数据的服务,并通过 Hadoop 作业来分析这些数据。This enables customers to build interactive websites that work with large datasets, to build services that store sensor and telemetry data from millions of end points, and to analyze this data with Hadoop jobs. HBase 和 Hadoop 是在 Azure 中构建大数据项目的良好起点,特别是可以支持实时应用程序来处理大数据集。HBase and Hadoop are good starting points for big data project in Azure; in particular, they can enable real-time applications to work with large datasets.

HDInsight 实施利用 HBase 的横向扩展架构来提供表自动分片、使读写操作保持高度的一致性,以及支持自动故障转移。The HDInsight implementation leverages the scale-out architecture of HBase to provide automatic sharding of tables, strong consistency for reads and writes, and automatic failover. 性能可通过对读取使用内存中缓存并对写入使用高吞吐量流式处理来提高。Performance is enhanced by in-memory caching for reads and high-throughput streaming for writes. 可以在虚拟网络内部创建 HBase 群集。HBase cluster can be created inside virtual network. 有关详细信息,请参阅在 Azure 虚拟网络上创建 HDInsight 群集For details, see Create HDInsight clusters on Azure Virtual Network.

如何在 HDInsight HBase 中管理数据?How is data managed in HDInsight HBase?

数据可以在 HBase 中通过使用 HBase shell 中的 creategetputscan 命令来管理。Data can be managed in HBase by using the create, get, put, and scan commands from the HBase shell. 数据通过使用 put 写入到数据库,并通过使用 get 读取。Data is written to the database by using put and read by using get. scan 命令用于从表中的多行获得数据。The scan command is used to obtain data from multiple rows in a table. Data 也可以使用 HBase C# API 进行管理,该 API 在 HBase REST API 顶部提供客户端库。Data can also be managed using the HBase C# API, which provides a client library on top of the HBase REST API. HBase 数据库还可以通过使用 Apache Hive 进行查询。An HBase database can also be queried by using Apache Hive. 有关这些编程模型的简介,请参阅开始在 HDInsight 中将 Apache HBase 与 Apache Hadoop 配合使用For an introduction to these programming models, see Get started using Apache HBase with Apache Hadoop in HDInsight. 共同的处理器也适用,这样,便可在托管数据库的节点中处理数据。Co-processors are also available, which allow data processing in the nodes that host the database.


Thrift 不受 HDInsight 中的 HBase 支持。Thrift is not supported by HBase in HDInsight.

方案:Apache HBase 用例Scenarios: Use cases for Apache HBase

BigTable(以及延伸开来的 HBase)是从 Web 搜索创建的典型用例。The canonical use case for which BigTable (and by extension, HBase) was created from web search. 搜索引擎构建索引,将词语映射到包含这些词语的网页。Search engines build indexes that map terms to the web pages that contain them. 然而,还有 HBase 适用的许多其他用例,本部分中列出了其中几个。But there are many other use cases that HBase is suitable for—several of which are itemized in this section.

  • 键值存储Key-value store

    HBase 可用作键值存储,适用于管理消息系统。HBase can be used as a key-value store, and it is suitable for managing message systems. Facebook 使用 HBase 作为消息系统,适用于存储和管理 Internet 通信。Facebook uses HBase for their messaging system, and it is ideal for storing and managing Internet communications. WebTable 使用 HBase 搜索和管理从网页中提取的表。WebTable uses HBase to search for and manage tables that are extracted from webpages.

  • 传感器数据Sensor data

    HBase 用于捕获从各种源逐步收集的数据。HBase is useful for capturing data that is collected incrementally from various sources. 这包括社交分析、时间序列、使交互式仪表板与趋势和计数器保持同步,以及管理审计日志系统。This includes social analytics, time series, keeping interactive dashboards up-to-date with trends and counters, and managing audit log systems. 具体示例包括:Bloomberg 交易终端以及开放时间序列数据库 (Open Time Series Database, OpenTSDB),后者用于存储所收集的服务器系统运行状况指标并对其进行访问。Examples include Bloomberg trader terminal and the Open Time Series Database (OpenTSDB), which stores and provides access to metrics collected about the health of server systems.

  • 实时查询Real-time query

    Apache Phoenix 是 Apache HBase 的 SQL 查询引擎。Apache Phoenix is a SQL query engine for Apache HBase. 该引擎以 JDBC 驱动程序的形式供用户访问,并且支持使用 SQL 来查询和管理 HBase 表。It is accessed as a JDBC driver, and it enables querying and managing HBase tables by using SQL.

  • HBase 即平台HBase as a platform

    应用程序可以将 HBase 作为数据存储库而在其上运行。Applications can run on top of HBase by using it as a datastore. 具体示例包括 Phoenix、OpenTSDB、Kiji 和 Titan。Examples include Phoenix, OpenTSDB, Kiji, and Titan. 应用程序也可以与 HBase 集成。Applications can also integrate with HBase. 示例包括 Apache HiveApache PigSolrApache StormApache FlumeApache ImpalaApache SparkGangliaApache DrillExamples include Apache Hive, Apache Pig, Solr, Apache Storm, Apache Flume, Apache Impala, Apache Spark , Ganglia, and Apache Drill.

后续步骤Next steps