在 Azure HDInsight 中将 Apache Spark 和 Apache Hive 与 Hive Warehouse Connector 集成Integrate Apache Spark and Apache Hive with Hive Warehouse Connector in Azure HDInsight

Apache Hive Warehouse Connector (HWC) 是一个库,可让你更轻松地使用 Apache Spark 和 Apache Hive。The Apache Hive Warehouse Connector (HWC) is a library that allows you to work more easily with Apache Spark and Apache Hive. 它支持在 Spark DataFrames 和 Hive 表之间移动数据等任务。It supports tasks such as moving data between Spark DataFrames and Hive tables. 此外,通过将 Spark 流数据定向到 Hive 表中。Also, by directing Spark streaming data into Hive tables. Hive Warehouse Connector 的工作方式类似于 Spark 与 Hive 之间的桥梁。Hive Warehouse Connector works like a bridge between Spark and Hive. 它还支持将 Scala、Java 和 Python 作为开发编程语言。It also supports Scala, Java, and Python as programming languages for development.

通过 Hive Warehouse Connector 可利用 Hive 和 Spark 的独特功能构建功能强大的大数据应用程序。The Hive Warehouse Connector allows you to take advantage of the unique features of Hive and Spark to build powerful big-data applications.

Apache Hive 为原子性、一致性、隔离性和持久性 (ACID) 数据库事务提供支持。Apache Hive offers support for database transactions that are Atomic, Consistent, Isolated, and Durable (ACID). 有关 Hive 中的 ACID 和事务的详细信息,请参阅 Hive 事务For more information on ACID and transactions in Hive, see Hive Transactions. Hive 还通过 Apache Ranger 以及在 Apache Spark 中不可用的低延迟分析处理 (LLAP) 提供详细安全控制。Hive also offers detailed security controls through Apache Ranger and Low Latency Analytical Processing (LLAP) not available in Apache Spark.

Apache Spark 具有一个结构化流 API,可提供 Apache Hive 中不可用的流式处理功能。Apache Spark, has a Structured Streaming API that gives streaming capabilities not available in Apache Hive. 从 HDInsight 4.0 开始,Apache Spark 2.3.1 和 Apache Hive 3.1.0 具有单独的元存储。Beginning with HDInsight 4.0, Apache Spark 2.3.1 and Apache Hive 3.1.0 have separate metastores. 单独的元存储可能会使互操作性难以实现。The separate metastores can make interoperability difficult. 通过 Hive Warehouse Connector 可更轻松地将 Spark 和 Hive 一起使用。The Hive Warehouse Connector makes it easier to use Spark and Hive together. HWC 库将数据从 LLAP 守护程序并行加载到 Spark 执行程序。The HWC library loads data from LLAP daemons to Spark executors in parallel. 与从 Spark 到 Hive 的标准 JDBC 连接相比,此过程可更高效且更具适应性。This process makes it more efficient and adaptable than a standard JDBC connection from Spark to Hive.

hive warehouse connector 体系结构

Hive Warehouse Connector 支持的部分操作包括:Some of the operations supported by the Hive Warehouse Connector are:

  • 描述表Describing a table
  • 为 ORC 格式的数据创建表Creating a table for ORC-formatted data
  • 选择 Hive 数据和检索数据帧Selecting Hive data and retrieving a DataFrame
  • 将数据帧批量写入到 HiveWriting a DataFrame to Hive in batch
  • 执行 Hive 更新语句Executing a Hive update statement
  • 从 Hive 读取表数据、在 Spark 中转换数据,然后将数据写入到新的 Hive 表Reading table data from Hive, transforming it in Spark, and writing it to a new Hive table
  • 使用 HiveStreaming 将数据帧或 Spark 流写入到 HiveWriting a DataFrame or Spark stream to Hive using HiveStreaming

Hive Warehouse Connector 设置Hive Warehouse Connector setup

Hive Warehouse Connector 对于 Spark 和 Interactive Query 工作负责需要单独的群集。Hive Warehouse Connector needs separate clusters for Spark and Interactive Query workloads. 按照以下步骤在 Azure HDInsight 中设置这些群集。Follow these steps to set up these clusters in Azure HDInsight.

创建群集Create clusters

  1. 使用存储帐户和自定义的 Azure 虚拟网络创建 HDInsight Spark 4.0 群集。Create an HDInsight Spark 4.0 cluster with a storage account and a custom Azure virtual network. 有关在 Azure 虚拟网络中创建群集的信息,请参阅将 HDInsight 添加到现有虚拟网络For information on creating a cluster in an Azure virtual network, see Add HDInsight to an existing virtual network.

  2. 使用与 Spark 群集相同的存储帐户和 Azure 虚拟网络创建 HDInsight Interactive Query (LLAP) 4.0 群集。Create an HDInsight Interactive Query (LLAP) 4.0 cluster with the same storage account and Azure virtual network as the Spark cluster.

配置 HWC 设置Configure HWC settings

收集初步信息Gather preliminary information

  1. 在 Web 浏览器中,导航到 https://LLAPCLUSTERNAME.azurehdinsight.cn/#/main/services/HIVE,其中 LLAPCLUSTERNAME 是 Interactive Query 群集的名称。From a web browser, navigate to https://LLAPCLUSTERNAME.azurehdinsight.cn/#/main/services/HIVE where LLAPCLUSTERNAME is the name of your Interactive Query cluster.

  2. 导航到“摘要” > “HiveServer2 交互式 JDBC URL”并记下该值。Navigate to Summary > HiveServer2 Interactive JDBC URL and note the value. 该值可能类似于:jdbc:hive2://zk0-iqgiro.rekufuk2y2ce.bx.internal.chinacloudapp.cn:2181,zk1-iqgiro.rekufuk2y2ce.bx.internal.chinacloudapp.cn:2181,zk4-iqgiro.rekufuk2y2ce.bx.internal.chinacloudapp.cn:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactiveThe value may be similar to: jdbc:hive2://zk0-iqgiro.rekufuk2y2ce.bx.internal.chinacloudapp.cn:2181,zk1-iqgiro.rekufuk2y2ce.bx.internal.chinacloudapp.cn:2181,zk4-iqgiro.rekufuk2y2ce.bx.internal.chinacloudapp.cn:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive.

  3. 导航到“配置” > “高级” > “高级 hive-site” > “hive.zookeeper.quorum”,并记下该值。Navigate to Configs > Advanced > Advanced hive-site > hive.zookeeper.quorum and note the value. 该值可能类似于:zk0-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:2181,zk1-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:2181,zk4-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:2181The value may be similar to: zk0-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:2181,zk1-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:2181,zk4-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:2181.

  4. 导航到“配置” > “高级” > “常规” > “hive.metastore.uris”,并记下该值。Navigate to Configs > Advanced > General > hive.metastore.uris and note the value. 该值可能类似于:thrift://iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:9083,thrift://hn1-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:9083The value may be similar to: thrift://iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:9083,thrift://hn1-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.chinacloudapp.cn:9083.

  5. 导航到“配置” > “高级” > “高级 hive-interactive-site” > “hive.llap.daemon.service.hosts”,并记下该值。Navigate to Configs > Advanced > Advanced hive-interactive-site > hive.llap.daemon.service.hosts and note the value. 该值可能类似于:@llap0The value may be similar to: @llap0.

配置 Spark 群集设置Configure Spark cluster settings

  1. 在 Web 浏览器中导航到 https://CLUSTERNAME.azurehdinsight.cn/#/main/services/SPARK2/configs,其中 CLUSTERNAME 是 Apache Spark 群集的名称。From a web browser, navigate to https://CLUSTERNAME.azurehdinsight.cn/#/main/services/SPARK2/configs where CLUSTERNAME is the name of your Apache Spark cluster.

  2. 展开“自定义 spark2-defaults”。Expand Custom spark2-defaults.

Apache Ambari Spark2 配置

  1. 选择“添加属性...”,以添加以下配置:Select Add Property... to add the following configurations:

    配置Configuration Value
    spark.datasource.hive.warehouse.load.staging.dir wasbs://STORAGE_CONTAINER_NAME@STORAGE_ACCOUNT_NAME.blob.core.chinacloudapi.cn/tmp.wasbs://STORAGE_CONTAINER_NAME@STORAGE_ACCOUNT_NAME.blob.core.chinacloudapi.cn/tmp.
    设置为合适的 HDFS 兼容分段目录。Set to a suitable HDFS-compatible staging directory. 如果有两个不同的群集,则分段目录应该是 LLAP 群集存储帐户的分段目录中的文件夹,以便 HiveServer2 有权访问它。If you have two different clusters, the staging directory should be a folder in the staging directory of the LLAP cluster's storage account so that HiveServer2 has access to it. STORAGE_ACCOUNT_NAME 替换为群集使用的存储帐户的名称,并将 STORAGE_CONTAINER_NAME 替换为存储容器的名称。Replace STORAGE_ACCOUNT_NAME with the name of the storage account being used by the cluster, and STORAGE_CONTAINER_NAME with the name of the storage container.
    spark.sql.hive.hiveserver2.jdbc.url 之前从“HiveServer2 交互式 JDBC URL”获取的值The value you obtained earlier from HiveServer2 Interactive JDBC URL
    spark.datasource.hive.warehouse.metastoreUri 之前从“hive.metastore.uris”获取的值。The value you obtained earlier from hive.metastore.uris.
    spark.security.credentials.hiveserver2.enabled 对于 YARN 群集模式为 true,对于 YARN 客户端模式为 falsetrue for YARN cluster mode and false for YARN client mode.
    spark.hadoop.hive.zookeeper.quorum 之前从“hive.zookeeper.quorum”获取的值。The value you obtained earlier from hive.zookeeper.quorum.
    spark.hadoop.hive.llap.daemon.service.hosts 之前从“hive.llap.daemon.service.hosts”获取的值。The value you obtained earlier from hive.llap.daemon.service.hosts.
  2. 保存更改并重启所有受影响的组件。Save changes and restart all affected components.

Hive Warehouse Connector 用法Hive Warehouse Connector usage

可以在几种不同方法之间进行选择,以使用 Hive Warehouse Connector 连接到 Interactive Query 群集并执行查询。You can choose between a few different methods to connect to your Interactive Query cluster and execute queries using the Hive Warehouse Connector. 支持的方法包括以下工具:Supported methods include the following tools:

下面是一些从 Spark 连接到 HWC 的示例。Below are some examples to connect to HWC from Spark.

Spark-shellSpark-shell

  1. 使用 ssh 命令连接到 Apache Spark 群集。Use ssh command to connect to your Apache Spark cluster. 编辑以下命令,将 CLUSTERNAME 替换为群集的名称,然后输入该命令:Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:

    ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.cn
    
  2. 在 ssh 会话中,执行以下命令以记下 hive-warehouse-connector-assembly 版本:From your ssh session, execute the following command to note the hive-warehouse-connector-assembly version:

    ls /usr/hdp/current/hive_warehouse_connector
    
  3. 使用上面标识的 hive-warehouse-connector-assembly 版本编辑下面的代码。Edit the code below with the hive-warehouse-connector-assembly version identified above. 然后执行命令启动 spark shell:Then execute the command to start the spark shell:

    spark-shell --master yarn \
    --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-<VERSION>.jar \
    --conf spark.security.credentials.hiveserver2.enabled=false
    
  4. 启动 spark shell 后,可以使用以下命令启动配置 Hive Warehouse Connector 实例:After starting the spark shell, a Hive Warehouse Connector instance can be started using the following commands:

    import com.hortonworks.hwc.HiveWarehouseSession
    val hive = HiveWarehouseSession.session(spark).build()
    

Spark-submitSpark-submit

将 scala/java 代码连同依赖项一起生成到程序集 jar 后,使用以下命令启动 Spark 应用程序。Once you build the scala/java code along with the dependencies into an assembly jar, use the below command to launch a Spark application. <VERSION><APP_JAR_PATH> 替换为实际值。Replace <VERSION>, and <APP_JAR_PATH> with the actual values.

  • YARN 客户端模式YARN Client mode

    spark-submit \
    --class myHwcApp \
    --master yarn \
    --deploy-mode client \
    --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-<VERSION>.jar \
    --conf spark.security.credentials.hiveserver2.enabled=false
    /<APP_JAR_PATH>/myHwcAppProject.jar
    
  • YARN 群集模式YARN Cluster mode

    spark-submit \
    --class myHwcApp \
    --master yarn \
    --deploy-mode cluster \
    --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-<VERSION>.jar \
    --conf spark.security.credentials.hiveserver2.enabled=true
    /<APP_JAR_PATH>/myHwcAppProject.jar
    

对于 Python,还应添加以下配置。For Python, add the following configuration as well.

--py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-<VERSION>.zip

后续步骤Next steps