Azure HDInsight 中的 Apache PhoenixApache Phoenix in Azure HDInsight

Apache Phoenix 是构建在 Apache HBase 基础之上的开源大规模并行关系数据库层。Apache Phoenix is an open source, massively parallel relational database layer built on Apache HBase. Phoenix 允许通过 HBase 使用类似于 SQL 的查询。Phoenix allows you to use SQL-like queries over HBase. Phoenix 在幕后使用 JDBC 驱动程序,可让用户创建、删除和更改 SQL 表、索引、视图与序列,以及单独或批量更新插入行。Phoenix uses JDBC drivers underneath to enable users to create, delete, alter SQL tables, indexes, views and sequences, and upsert rows individually and in bulk. Phoenix 使用 noSQL 本机编译而不是 MapReduce 来编译查询,可让用户在 HBase 的顶层创建低延迟的应用程序。Phoenix uses noSQL native compilation rather than using MapReduce to compile queries, enabling the creation of low-latency applications on top of HBase. Phoenix 添加了协处理器,支持在服务器的地址空间中运行客户端提供的代码,执行与数据共置的代码。Phoenix adds coprocessors to support running client-supplied code in the address space of the server, executing the code colocated with the data. 此方法可将客户端/服务器数据传输延迟降到最低。This approach minimizes client/server data transfer.

非开发人员可以借助 Apache Phoenix 创建大数据查询,并在其中使用类似于 SQL 的语法,而无需编程。Apache Phoenix opens up big data queries to non-developers who can use a SQL-like syntax rather than programming. Apache Hive 和 Apache Spark SQL 等其他工具不同,Phoenix 已针对 HBase 高度优化。Phoenix is highly optimized for HBase, unlike other tools such as Apache Hive and Apache Spark SQL. 开发人员可以利用它来编写高性能的查询,同时大大减少代码量。The benefit to developers is writing highly performant queries with much less code.

提交 SQL 查询时,Phoenix 会将该查询编译到 HBase 本机调用,然后并行运行扫描(或计划)以进行优化。When you submit a SQL query, Phoenix compiles the query to HBase native calls and runs the scan (or plan) in parallel for optimization. 此抽象层使得开发人员无需编写 MapReduce 作业,让他们专注于围绕 Phoenix 的大数据存储构建应用程序的业务逻辑和工作流。This layer of abstraction frees the developer from writing MapReduce jobs, to focus instead on the business logic and the workflow of their application around Phoenix's big data storage.

查询性能优化和其他功能Query performance optimization and other features

Apache Phoenix 为 HBase 查询带来了多种性能增强和功能。Apache Phoenix adds several performance enhancements and features to HBase queries.

辅助索引Secondary indexes

HBase 使用根据主行键按字典顺序排序的单个索引。HBase has a single index that is lexicographically sorted on the primary row key. 只能通过行键访问这些记录。These records can only be accessed through the row key. 通过除行键以外的任何列访问记录都需要扫描所有数据,同时应用所需的筛选器。Accessing records through any column other than the row key requires scanning all of the data while applying the required filter. 在辅助索引中,已编制索引的列或表达式构成了备用行键,可用于针对该索引执行查找和范围扫描。In a secondary index, the columns or expressions that are indexed form an alternate row key, allowing lookups and range scans on that index.

使用 CREATE INDEX 命令创建辅助索引:Create a secondary index with the CREATE INDEX command:

CREATE INDEX ix_purchasetype on SALTEDWEBLOGS (purchasetype, transactiondate) INCLUDE (bookname, quantity);

与执行单一索引查询相比,此方法可以大幅提升性能。This approach can yield a significant performance increase over executing single-indexed queries. 这种类型的辅助索引是涵盖索引,包含查询中包括的所有列。This type of secondary index is a covering index, containing all of the columns included in the query. 因此,不需要执行表查找,索引能够满足整个查询的需求。Therefore, the table lookup is not required and the index satisfies the entire query.

视图Views

Phoenix 视图可以克服一项 HBase 限制:创建 100 个以上的物理表时,性能开始下降。Phoenix views provide a way to overcome an HBase limitation, where performance begins to degrade when you create more than about 100 physical tables. Phoenix 视图可让多个虚拟表共享一个 HBase 基础物理表。 Phoenix views enable multiple virtual tables to share one underlying physical HBase table.

创建 Phoenix 视图的过程类似于使用标准的 SQL 视图语法。Creating a Phoenix view is similar to using standard SQL view syntax. 两者的一项差别在于,除了继承自基表的列以外,还可为视图定义列。One difference is that you can define columns for your view, in addition to the columns inherited from its base table. 此外,可以添加新的 KeyValue 列。You can also add new KeyValue columns.

例如,下面是一个包含以下定义的、名为 product_metrics 的物理表:For example, here is a physical table named product_metrics with the following definition:

CREATE  TABLE product_metrics (
    metric_type CHAR(1),
    created_by VARCHAR, 
    created_date DATE, 
    metric_id INTEGER
    CONSTRAINT pk PRIMARY KEY (metric_type, created_by, created_date, metric_id));

基于此表定义视图,并在其中包含附加列:Define a view over this table, with additional columns:

CREATE VIEW mobile_product_metrics (carrier VARCHAR, dropped_calls BIGINT) AS
SELECT * FROM product_metrics
WHERE metric_type = 'm';

以后若要添加更多的列,可以使用 ALTER VIEW 语句。To add more columns later, use the ALTER VIEW statement.

跳过扫描Skip scan

跳过扫描使用组合索引的一个或多个列来查找非重复值。Skip scan uses one or more columns of a composite index to find distinct values. 与范围扫描不同,跳过扫描实施行内扫描,因此可以提高性能Unlike a range scan, skip scan implements intra-row scanning, yielding improved performance. 扫描时,将会连同索引一起跳过第一个匹配值,直到找到下一个值。While scanning, the first matched value is skipped along with the index until the next value is found.

跳过扫描使用 HBase 筛选器的 SEEK_NEXT_USING_HINT 枚举。A skip scan uses the SEEK_NEXT_USING_HINT enumeration of the HBase filter. 跳过扫描使用 SEEK_NEXT_USING_HINT 来跟踪在每个列中搜索的键集或键范围。Using SEEK_NEXT_USING_HINT, the skip scan keeps track of which set of keys, or ranges of keys, are being searched for in each column. 然后,跳过扫描在评估筛选器期间使用传递给它的键,并确定该键是否为组合之一。The skip scan then takes a key that was passed to it during filter evaluation, and determines whether it is one of the combinations. 如果不是,则跳过扫描会评估要跳转到的下一个最高键。If not, the skip scan evaluates the next highest key to jump to.

事务Transactions

尽管 HBase 提供行级事务,但 Phoenix 与 Tephra 集成,添加了跨行和跨表事务支持,并具有完整的 ACID 语义。While HBase provides row-level transactions, Phoenix integrates with Tephra to add cross-row and cross-table transaction support with full ACID semantics.

与传统的 SQL 事务一样,通过 Phoenix 事务管理器提供的事务可以确保成功更新插入数据原子单位;如果任何已启用事务的表中的更新插入操作失败,还可以回滚事务。As with traditional SQL transactions, transactions provided through the Phoenix transaction manager allow you to ensure an atomic unit of data is successfully upserted, rolling back the transaction if the upsert operation fails on any transaction-enabled table.

若要启用 Phoenix 事务,请参阅Apache Phoenix 事务文档To enable Phoenix transactions, see the Apache Phoenix transaction documentation.

若要创建启用事务的新表,请在 CREATE 语句中将 TRANSACTIONAL 属性设置为 trueTo create a new table with transactions enabled, set the TRANSACTIONAL property to true in a CREATE statement:

CREATE TABLE my_table (k BIGINT PRIMARY KEY, v VARCHAR) TRANSACTIONAL=true;

若要将现有表更改为事务表,请在 ALTER 语句中使用相同的属性:To alter an existing table to be transactional, use the same property in an ALTER statement:

ALTER TABLE my_other_table SET TRANSACTIONAL=true;

备注

无法将事务表切换回到非事务表。You cannot switch a transactional table back to being non-transactional.

加盐表Salted Tables

将包含有序键的记录写入 HBase 时,可能会发生区域服务器热点。 Region server hotspotting can occur when writing records with sequential keys to HBase. 即使群集中包含多个区域服务器,也只会在一个服务器中进行写入。Though you may have multiple region servers in your cluster, your writes are all occurring on just one. 这种集中化会产生热点问题,即,写入工作负荷不会分散在所有可用的区域服务器之间,而是只有一个服务器处理该负载。This concentration creates the hotspotting issue where, instead of your write workload being distributed across all of the available region servers, just one is handling the load. 由于每个区域具有预定义的最大大小,当某个区域达到该大小限制时,它将会拆分为两个较小区域。Since each region has a predefined maximum size, when a region reaches that size limit, it is split into two small regions. 在这种情况下,其中一个新区域会接收所有新记录,因而变成了新的热点。When that happens, one of these new regions takes all new records, becoming the new hotspot.

若要缓解此问题并提高性能,请预先拆分表,以便均衡使用所有的区域服务器。To mitigate this problem and achieve better performance, pre-split tables so that all of the region servers are equally used. Phoenix 提供加盐表,以透明方式将加盐字节添加到特定表的行键。 Phoenix provides salted tables, transparently adding the salting byte to the row key for a particular table. 该表已在加盐字节边界上预先拆分,确保在表的初始阶段,在区域服务器之间均衡分配负载。The table is pre-split on the salt byte boundaries to ensure equal load distribution among region servers during the initial phase of the table. 此方法可在所有可用的区域服务器之间分配写入工作负荷,从而提高了写入和读取性能。This approach distributes the write workload across all of the available region servers, improving the write and read performance. 若要给表加盐,请在创建表时指定 SALT_BUCKETS 表属性:To salt a table, specify the SALT_BUCKETS table property when the table is created:

CREATE TABLE Saltedweblogs (
    transactionid varchar(500) Primary Key,
    transactiondate Date NULL,
    customerid varchar(50) NULL,
    bookid varchar(50) NULL,
    purchasetype varchar(50) NULL,
    orderid varchar(50) NULL,
    bookname varchar(50) NULL,
    categoryname varchar(50) NULL,
    invoicenumber varchar(50) NULL,
    invoicestatus varchar(50) NULL,
    city varchar(50) NULL,
    state varchar(50) NULL,
    paymentamount DOUBLE NULL,
    quantity INTEGER NULL,
    shippingamount DOUBLE NULL) SALT_BUCKETS=4;

使用 Apache Ambari 启用和优化 PhoenixEnable and tune Phoenix with Apache Ambari

HDInsight HBase 群集提供 Ambari UI 用于进行配置更改。An HDInsight HBase cluster includes the Ambari UI for making configuration changes.

  1. 若要启用或禁用 Phoenix 并控制 Phoenix 的查询超时设置,请使用 Hadoop 用户凭据登录到 Ambari Web UI (https://YOUR_CLUSTER_NAME.azurehdinsight.cn)。To enable or disable Phoenix, and to control Phoenix's query timeout settings, log in to the Ambari Web UI (https://YOUR_CLUSTER_NAME.azurehdinsight.cn) using your Hadoop user credentials.

  2. 在左侧菜单中的服务列表内选择“HBase”,然后选择“配置”选项卡。 Select HBase from the list of services in the left-hand menu, then select the Configs tab.

    Ambari HBase 配置

  3. 找到“Phoenix SQL”配置部分,启用或禁用 Phoenix,并设置查询超时。 Find the Phoenix SQL configuration section to enable or disable phoenix, and set the query timeout.

    Ambari“Phoenix SQL”配置部分

另请参阅See also