在 Azure HDInsight 中将 Apache Spark 和 Apache Hive 与 Hive Warehouse Connector 集成Integrate Apache Spark and Apache Hive with Hive Warehouse Connector in Azure HDInsight

Apache Hive Warehouse Connector (HWC) 是一个库,可让你更轻松地使用 Apache Spark 和 Apache Hive。The Apache Hive Warehouse Connector (HWC) is a library that allows you to work more easily with Apache Spark and Apache Hive. 它支持在 Spark DataFrames 和 Hive 表之间移动数据等任务。It supports tasks such as moving data between Spark DataFrames and Hive tables. 此外,通过将 Spark 流数据定向到 Hive 表中。Also, by directing Spark streaming data into Hive tables. Hive Warehouse Connector 的工作方式类似于 Spark 与 Hive 之间的桥梁。Hive Warehouse Connector works like a bridge between Spark and Hive. 它还支持将 Scala、Java 和 Python 作为开发编程语言。It also supports Scala, Java, and Python as programming languages for development.

通过 Hive Warehouse Connector 可利用 Hive 和 Spark 的独特功能构建功能强大的大数据应用程序。The Hive Warehouse Connector allows you to take advantage of the unique features of Hive and Spark to build powerful big-data applications.

Apache Hive 为原子性、一致性、隔离性和持久性 (ACID) 数据库事务提供支持。Apache Hive offers support for database transactions that are Atomic, Consistent, Isolated, and Durable (ACID). 有关 Hive 中的 ACID 和事务的详细信息,请参阅 Hive 事务For more information on ACID and transactions in Hive, see Hive Transactions. Hive 还通过 Apache Ranger 以及在 Apache Spark 中不可用的低延迟分析处理 (LLAP) 提供详细安全控制。Hive also offers detailed security controls through Apache Ranger and Low Latency Analytical Processing (LLAP) not available in Apache Spark.

Apache Spark 具有一个结构化流 API,可提供 Apache Hive 中不可用的流式处理功能。Apache Spark, has a Structured Streaming API that gives streaming capabilities not available in Apache Hive. 从 HDInsight 4.0 开始,Apache Spark 2.3.1 和 Apache Hive 3.1.0 具有单独的元存储。Beginning with HDInsight 4.0, Apache Spark 2.3.1 and Apache Hive 3.1.0 have separate metastores. 单独的元存储可能会使互操作性难以实现。The separate metastores can make interoperability difficult. 通过 Hive Warehouse Connector 可更轻松地将 Spark 和 Hive 一起使用。The Hive Warehouse Connector makes it easier to use Spark and Hive together. HWC 库将数据从 LLAP 守护程序并行加载到 Spark 执行程序。The HWC library loads data from LLAP daemons to Spark executors in parallel. 与从 Spark 到 Hive 的标准 JDBC 连接相比,此过程可更高效且更具适应性。This process makes it more efficient and adaptable than a standard JDBC connection from Spark to Hive.

hive warehouse connector 体系结构

Hive Warehouse Connector 支持的部分操作包括:Some of the operations supported by the Hive Warehouse Connector are:

  • 描述表Describing a table
  • 为 ORC 格式的数据创建表Creating a table for ORC-formatted data
  • 选择 Hive 数据和检索数据帧Selecting Hive data and retrieving a DataFrame
  • 将数据帧批量写入到 HiveWriting a DataFrame to Hive in batch
  • 执行 Hive 更新语句Executing a Hive update statement
  • 从 Hive 读取表数据、在 Spark 中转换数据,然后将数据写入到新的 Hive 表Reading table data from Hive, transforming it in Spark, and writing it to a new Hive table
  • 使用 HiveStreaming 将数据帧或 Spark 流写入到 HiveWriting a DataFrame or Spark stream to Hive using HiveStreaming

Hive Warehouse Connector 设置Hive Warehouse Connector setup

重要

  • 不支持将 Spark 2.4 企业安全性套餐群集上安装的 HiveServer2 Interactive 实例与 Hive Warehouse Connector 一起使用。The HiveServer2 Interactive instance installed on Spark 2.4 Enterprise Security Package clusters is not supported for use with the Hive Warehouse Connector. 相反,必须配置一个独立的 HiveServer2 Interactive 群集来承载 HiveServer2 Interactive 工作负载。Instead, you must configure a separate HiveServer2 Interactive cluster to host your HiveServer2 Interactive workloads. 不支持使用单一 Spark 2.4 群集的 Hive Warehouse Connector 配置。A Hive Warehouse Connector configuration that utilizes a single Spark 2.4 cluster is not supported.
  • 不支持将 Hive Warehouse Connector (HWC) 库用于启用了工作负载管理 (WLM) 功能的 Interactive Query 群集。Hive Warehouse Connector (HWC) Library is not supported for use with Interactive Query Clusters where Workload Management (WLM) feature is enabled.
    在你仅有 Spark 工作负载并想要使用 HWC 库的情况下,请确保 Interactive Query 群集未启用工作负载管理功能(未在 Hive 配置中设置 hive.server2.tez.interactive.queue 配置)。In a scenario where you only have Spark workloads and want to use HWC Library, ensure Interactive Query cluster doesn't have Workload Management feature enabled (hive.server2.tez.interactive.queue configuration is not set in Hive configs).
    对于同时存在 Spark 工作负载 (HWC) 和 LLAP 原生工作负载的情况,你需要创建使用共享的元存储数据库的两个单独的 Interactive Query 群集。For a scenario where both Spark workloads (HWC) and LLAP native workloads exists, You need to create two separate Interactive Query Clusters with shared metastore database. 一个群集用于原生 LLAP 工作负载,可以在其中根据需要启用 WLM 功能;另一个群集用于仅限 HWC 的工作负载,不应当在其中配置 WLM 功能。One cluster for native LLAP workloads where WLM feature can be enabled on need basis and other cluster for HWC only workload where WLM feature shouldn't be configured. 需要注意的是,从两个群集都可以查看 WLM 资源计划,即使只在一个群集中启用了该计划。It is important to note that you can view the WLM resource plans from both the clusters even if it is enabled in only one cluster. 请勿在禁用了 WLM 功能的群集中对资源计划进行任何更改,因为这可能会影响另一个群集中的 WLM 功能。Don't make any changes to resource plans in the cluster where WLM feature is disabled as it might impact the WLM functionality in other cluster.

Hive Warehouse Connector 对于 Spark 和 Interactive Query 工作负责需要单独的群集。Hive Warehouse Connector needs separate clusters for Spark and Interactive Query workloads. 按照以下步骤在 Azure HDInsight 中设置这些群集。Follow these steps to set up these clusters in Azure HDInsight.

创建群集Create clusters

  1. 使用存储帐户和自定义的 Azure 虚拟网络创建 HDInsight Spark 4.0 群集。Create an HDInsight Spark 4.0 cluster with a storage account and a custom Azure virtual network. 有关在 Azure 虚拟网络中创建群集的信息,请参阅将 HDInsight 添加到现有虚拟网络For information on creating a cluster in an Azure virtual network, see Add HDInsight to an existing virtual network.

  2. 使用与 Spark 群集相同的存储帐户和 Azure 虚拟网络创建 HDInsight Interactive Query (LLAP) 4.0 群集。Create an HDInsight Interactive Query (LLAP) 4.0 cluster with the same storage account and Azure virtual network as the Spark cluster.

配置 HWC 设置Configure HWC settings

收集初步信息Gather preliminary information

  1. 在 Web 浏览器中,导航到 https://LLAPCLUSTERNAME.azurehdinsight.cn/#/main/services/HIVE,其中 LLAPCLUSTERNAME 是 Interactive Query 群集的名称。From a web browser, navigate to https://LLAPCLUSTERNAME.azurehdinsight.cn/#/main/services/HIVE where LLAPCLUSTERNAME is the name of your Interactive Query cluster.

  2. 导航到“摘要” > “HiveServer2 交互式 JDBC URL”并记下该值。Navigate to Summary > HiveServer2 Interactive JDBC URL and note the value. 该值可能类似于:jdbc:hive2://zk0-iqgiro.rekufuk2y2ce.bx.internal.chinacloudapp.cn:2181,zk1-iqgiro.rekufuk2y2ce.bx.internal.chinacloudapp.cn:2181,zk4-iqgiro.rekufuk2y2ce.bx.internal.chinacloudapp.cn:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactiveThe value may be similar to: jdbc:hive2://zk0-iqgiro.rekufuk2y2ce.bx.internal.chinacloudapp.cn:2181,zk1-iqgiro.rekufuk2y2ce.bx.internal.chinacloudapp.cn:2181,zk4-iqgiro.rekufuk2y2ce.bx.internal.chinacloudapp.cn:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive.

  3. 导航到“配置” > “高级” > “高级 hive-site” > “hive.zookeeper.quorum”,并记下该值。Navigate to Configs > Advanced > Advanced hive-site > hive.zookeeper.quorum and note the value. 该值可能类似于:zk0-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:2181,zk1-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:2181,zk4-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:2181The value may be similar to: zk0-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:2181,zk1-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:2181,zk4-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:2181.

  4. 导航到“配置” > “高级” > “常规” > “hive.metastore.uris”,并记下该值。Navigate to Configs > Advanced > General > hive.metastore.uris and note the value. 该值可能类似于:thrift://iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:9083,thrift://hn1-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:9083The value may be similar to: thrift://iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:9083,thrift://hn1-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:9083.

  5. 导航到“配置” > “高级” > “高级 hive-interactive-site” > “hive.llap.daemon.service.hosts”,并记下该值。Navigate to Configs > Advanced > Advanced hive-interactive-site > hive.llap.daemon.service.hosts and note the value. 该值可能类似于:@llap0The value may be similar to: @llap0.

配置 Spark 群集设置Configure Spark cluster settings

  1. 在 Web 浏览器中导航到 https://CLUSTERNAME.azurehdinsight.cn/#/main/services/SPARK2/configs,其中 CLUSTERNAME 是 Apache Spark 群集的名称。From a web browser, navigate to https://CLUSTERNAME.azurehdinsight.cn/#/main/services/SPARK2/configs where CLUSTERNAME is the name of your Apache Spark cluster.

  2. 展开“自定义 spark2-defaults”。Expand Custom spark2-defaults.

Apache Ambari Spark2 配置

  1. 选择“添加属性...”,以添加以下配置:Select Add Property... to add the following configurations:

    配置Configuration Value
    spark.datasource.hive.warehouse.load.staging.dir wasbs://STORAGE_CONTAINER_NAME@STORAGE_ACCOUNT_NAME.blob.core.windows.net/tmp.wasbs://STORAGE_CONTAINER_NAME@STORAGE_ACCOUNT_NAME.blob.core.windows.net/tmp.
    设置为合适的 HDFS 兼容分段目录。Set to a suitable HDFS-compatible staging directory. 如果有两个不同的群集,则分段目录应该是 LLAP 群集存储帐户的分段目录中的文件夹,以便 HiveServer2 有权访问它。If you have two different clusters, the staging directory should be a folder in the staging directory of the LLAP cluster's storage account so that HiveServer2 has access to it. STORAGE_ACCOUNT_NAME 替换为群集使用的存储帐户的名称,并将 STORAGE_CONTAINER_NAME 替换为存储容器的名称。Replace STORAGE_ACCOUNT_NAME with the name of the storage account being used by the cluster, and STORAGE_CONTAINER_NAME with the name of the storage container.
    spark.sql.hive.hiveserver2.jdbc.url 之前从“HiveServer2 交互式 JDBC URL”获取的值The value you obtained earlier from HiveServer2 Interactive JDBC URL
    spark.datasource.hive.warehouse.metastoreUri 之前从“hive.metastore.uris”获取的值。The value you obtained earlier from hive.metastore.uris.
    spark.security.credentials.hiveserver2.enabled 对于 YARN 群集模式为 true,对于 YARN 客户端模式为 falsetrue for YARN cluster mode and false for YARN client mode.
    spark.hadoop.hive.zookeeper.quorum 之前从“hive.zookeeper.quorum”获取的值。The value you obtained earlier from hive.zookeeper.quorum.
    spark.hadoop.hive.llap.daemon.service.hosts 之前从“hive.llap.daemon.service.hosts”获取的值。The value you obtained earlier from hive.llap.daemon.service.hosts.
  2. 保存更改并重启所有受影响的组件。Save changes and restart all affected components.

为企业安全性套餐 (ESP) 群集配置 HWCConfigure HWC for Enterprise Security Package (ESP) clusters

企业安全性套餐 (ESP) 为 Azure HDInsight 中的 Apache Hadoop 群集提供企业级功能,如基于 Active Directory 的身份验证、多用户支持以及基于角色的访问控制。The Enterprise Security Package (ESP) provides enterprise-grade capabilities like Active Directory-based authentication, multi-user support, and role-based access control for Apache Hadoop clusters in Azure HDInsight. 有关 ESP 的详细信息,请参阅在 HDInsight 中使用企业安全性套餐For more information on ESP, see Use Enterprise Security Package in HDInsight.

除了上一部分中提到的配置之外,请添加以下配置以在 ESP 群集上使用 HWC。Apart from the configurations mentioned in the previous section, add the following configuration to use HWC on the ESP clusters.

  1. 从 Spark 群集的 Ambari Web UI 导航到“Spark2” > “配置” > “自定义 spark2-defaults”。From Ambari web UI of Spark cluster, navigate to Spark2 > CONFIGS > Custom spark2-defaults.

  2. 更新以下属性。Update the following property.

    配置Configuration Value
    spark.sql.hive.hiveserver2.jdbc.url.principal hive/<llap-headnode>@<AAD-Domain>
    • 在 Web 浏览器中,导航到 https://CLUSTERNAME.azurehdinsight.net/#/main/services/HIVE/summary,其中 CLUSTERNAME 是 Interactive Query 群集的名称。From a web browser, navigate to https://CLUSTERNAME.azurehdinsight.net/#/main/services/HIVE/summary where CLUSTERNAME is the name of your Interactive Query cluster. 单击 HiveServer2 Interactive。Click on HiveServer2 Interactive. 将看到运行 LLAP 的头节点的完全限定的域名 (FQDN),如屏幕截图中所示。You will see the Fully Qualified Domain Name (FQDN) of the head node on which LLAP is running as shown in the screenshot. <llap-headnode> 替换为此值。Replace <llap-headnode> with this value.

      Hive Warehouse Connector 头节点

    • 使用 ssh 命令连接到 Interactive Query 群集。Use ssh command to connect to your Interactive Query cluster. /etc/krb5.conf 文件中查找 default_realm 参数。Look for default_realm parameter in the /etc/krb5.conf file. 以大写字符串的形式使用此值替换 <AAD-DOMAIN>,否则会找不到凭据。Replace <AAD-DOMAIN> with this value as an uppercase string, otherwise the credential won't be found.

      Hive Warehouse Connector AAD 域

    • 例如:hive/hn0-ng36ll.mjry42ikpruuxgs2qy2kpg4q5e.cx.internal.cloudapp.net@PKRSRVUQVMAE6J85.D2.INTERNAL.CLOUDAPP.NETFor instance, hive/hn0-ng36ll.mjry42ikpruuxgs2qy2kpg4q5e.cx.internal.cloudapp.net@PKRSRVUQVMAE6J85.D2.INTERNAL.CLOUDAPP.NET.

  3. 保存更改并根据需要重启组件。Save changes and restart components as needed.

Hive Warehouse Connector 用法Hive Warehouse Connector usage

可以在几种不同方法之间进行选择,以使用 Hive Warehouse Connector 连接到 Interactive Query 群集并执行查询。You can choose between a few different methods to connect to your Interactive Query cluster and execute queries using the Hive Warehouse Connector. 支持的方法包括以下工具:Supported methods include the following tools:

下面是一些从 Spark 连接到 HWC 的示例。Below are some examples to connect to HWC from Spark.

Spark-shellSpark-shell

  1. 使用 ssh 命令连接到 Apache Spark 群集。Use ssh command to connect to your Apache Spark cluster. 编辑以下命令,将 CLUSTERNAME 替换为群集的名称,然后输入该命令:Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:

    ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.cn
    
  2. 在 ssh 会话中,执行以下命令以记下 hive-warehouse-connector-assembly 版本:From your ssh session, execute the following command to note the hive-warehouse-connector-assembly version:

    ls /usr/hdp/current/hive_warehouse_connector
    
  3. 使用上面标识的 hive-warehouse-connector-assembly 版本编辑下面的代码。Edit the code below with the hive-warehouse-connector-assembly version identified above. 然后执行命令启动 spark shell:Then execute the command to start the spark shell:

    spark-shell --master yarn \
    --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-<VERSION>.jar \
    --conf spark.security.credentials.hiveserver2.enabled=false
    
  4. 启动 spark shell 后,可以使用以下命令启动配置 Hive Warehouse Connector 实例:After starting the spark shell, a Hive Warehouse Connector instance can be started using the following commands:

    import com.hortonworks.hwc.HiveWarehouseSession
    val hive = HiveWarehouseSession.session(spark).build()
    

Spark-submitSpark-submit

将 scala/java 代码连同依赖项一起生成到程序集 jar 后,使用以下命令启动 Spark 应用程序。Once you build the scala/java code along with the dependencies into an assembly jar, use the below command to launch a Spark application. <VERSION><APP_JAR_PATH> 替换为实际值。Replace <VERSION>, and <APP_JAR_PATH> with the actual values.

  • YARN 客户端模式YARN Client mode

    spark-submit \
    --class myHwcApp \
    --master yarn \
    --deploy-mode client \
    --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-<VERSION>.jar \
    --conf spark.security.credentials.hiveserver2.enabled=false
    /<APP_JAR_PATH>/myHwcAppProject.jar
    
  • YARN 群集模式YARN Cluster mode

    spark-submit \
    --class myHwcApp \
    --master yarn \
    --deploy-mode cluster \
    --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-<VERSION>.jar \
    --conf spark.security.credentials.hiveserver2.enabled=true
    /<APP_JAR_PATH>/myHwcAppProject.jar
    

对于 Python,还应添加以下配置。For Python, add the following configuration as well.

```python
--py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-<VERSION>.zip
```

对企业安全性套餐 (ESP) 群集运行查询Run queries on Enterprise Security Package (ESP) clusters

在启动 spark-shell 或 spark-submit 之前使用 kinitUse kinit before starting the spark-shell or spark-submit. 将 USERNAME 替换为有权访问群集的域帐户的名称,然后执行以下命令:Replace USERNAME with the name of a domain account with permissions to access the cluster, then execute the following command:

kinit USERNAME

保护 Spark ESP 群集上的数据Securing data on Spark ESP clusters

  1. 输入以下命令,创建包含一些示例数据的 demo 表:Create a table demo with some sample data by entering the following commands:

    create table demo (name string);
    INSERT INTO demo VALUES ('HDinsight');
    INSERT INTO demo VALUES ('Microsoft');
    INSERT INTO demo VALUES ('InteractiveQuery');
    
  2. 使用以下命令查看该表的内容。View the table's contents with the following command. 在应用策略之前,demo 表会显示完整的列。Before applying the policy, the demo table shows the full column.

    hive.executeQuery("SELECT * FROM demo").show()
    

    应用 Ranger 策略之前的演示表

  3. 应用仅显示该列最后四个字符的列掩码策略。Apply a column masking policy that only shows the last four characters of the column.

    1. 转到 Ranger 管理 UI (https://LLAPCLUSTERNAME.azurehdinsight.cn/ranger/)。Go to the Ranger Admin UI at https://LLAPCLUSTERNAME.azurehdinsight.cn/ranger/.

    2. 在“Hive”下单击用于群集的 Hive 服务。Click on the Hive service for your cluster under Hive. ranger 服务管理器ranger service manager

    3. 单击“掩码”选项卡,然后单击“添加新策略” Click on the Masking tab and then Add New Policy

      hive warehouse connector ranger hive 策略列表

    4. 提供所需的策略名称。Provide a desired policy name. 从“选择掩码选项”菜单中选择数据库“默认”、Hive 表“演示”、Hive 列“名称”、用户“rsadmin2”、访问类型“选择”和“部分掩码: 显示最后 4 个”。 Select database: Default, Hive table: demo, Hive column: name, User: rsadmin2, Access Types: select, and Partial mask: show last 4 from the Select Masking Option menu. 单击“添加”。Click Add. 创建策略create policy

  4. 再次查看表的内容。View the table's contents again. 应用 Ranger 策略之后,我们只能看到该列的最后四个字符。After applying the ranger policy, we can see only the last four characters of the column.

    应用 Ranger 策略之后的演示表

后续步骤Next steps