Azure HDInsight 4.0 概述Azure HDInsight 4.0 overview

Azure HDInsight 是最受 Apache Hadoop 和 Apache Spark 的企业客户青睐的服务之一。Azure HDInsight is one of the most popular services among enterprise customers for Apache Hadoop and Apache Spark. HDInsight 4.0 是 Apache Hadoop 组件的云发行版。HDInsight 4.0 is a cloud distribution of Apache Hadoop components. 本文提供有关最新 Azure HDInsight 版本以及如何升级的信息。This article provides information about the most recent Azure HDInsight release and how to upgrade.

HDInsight 4.0 中有哪些新功能?What's new in HDInsight 4.0?

Apache Hive 3.0 与低延迟分析处理Apache Hive 3.0 and low-latency analytical processing

Apache Hive 低延迟分析处理 (LLAP) 使用持久查询服务器和内存中缓存。Apache Hive low-latency analytical processing (LLAP) uses persistent query servers and in-memory caching. 此过程针对远程云存储中的数据提供快速的 SQL 查询结果。This process delivers quick SQL query results on data in remote cloud storage. Hive LLAP 使用一组永久守护程序来执行 Hive 查询的片段。Hive LLAP uses a set of persistent daemons that execute fragments of Hive queries. LLAP 上的执行查询类似于没有 LLAP 的 Hive,其中工作人员任务在 LLAP 守护程序而不是容器内运行。Query execution on LLAP is similar to Hive without LLAP, with worker tasks running inside LLAP daemons instead of containers.

Hive LLAP 的优势包括:Benefits of Hive LLAP include:

  • 能够在不牺牲性能和适应性的情况下进行深入的 SQL 分析。Ability to do deep SQL analytics without sacrificing performance and adaptability. 例如,复杂联接、子查询、开窗函数、排序、用户定义函数和复杂聚合。Such as complex joins, subqueries, windowing functions, sorting, user-defined functions, and complex aggregations.

  • 可针对准备数据的同一存储中的数据进行交互式查询,无需将数据从存储移动到另一个引擎进行分析处理。Interactive queries against data in the same storage where data is prepared, eliminating the need to move data from storage to another engine for analytical processing.

  • 可以通过缓存查询结果来重用以前计算的查询结果。Caching query results allows previously computed query results to be reused. 此缓存可节省运行查询所需的群集任务时所用的时间和资源。This cache saves time and resources spent running the cluster tasks required for the query.

Hive 动态具体化视图Hive dynamic materialized views

Hive 现在支持动态具体化视图或相关摘要的预先计算。Hive now supports dynamic materialized views, or pre-computation of relevant summaries. 这些视图可加速数据仓库中的查询处理。The views accelerate query processing in data warehouses. 具体化视图可本机存储在 Hive 中,并且可以无缝使用 LLAP 加速。Materialized views can be stored natively in Hive, and can seamlessly use LLAP acceleration.

Hive 事务表Hive transactional tables

HDI 4.0 包括 Apache Hive 3。HDI 4.0 includes Apache Hive 3. Hive 3 要求驻留在 Hive 仓库中的事务表具有原子性、一致性、隔离性和持久合规性。Hive 3 requires atomicity, consistency, isolation, and durability compliance for transactional tables that live in the Hive warehouse. 符合 ACID 的表和表数据由 Hive 访问和管理。ACID-compliant tables and table data are accessed and managed by Hive. 位于创建、检索、更新和删除 (CRUD) 表中的数据必须采用优化行列 (ORC) 文件格式,Data in create, retrieve, update, and delete (CRUD) tables must be in Optimized Row Column (ORC) file format. 但仅限插入的表支持所有文件格式。Insert-only tables support all file formats.

  • ACID v2 在存储格式和执行引擎方面均进行了性能提升。ACID v2 has performance improvements in both storage format and the execution engine.

  • 默认情况下 ACID 会启用,以允许对数据更新的全面支持。ACID is enabled by default to allow full support for data updates.

  • 借助改进的 ACID 功能,可在行级别更新和删除。Improved ACID capabilities allow you to update and delete at row level.

  • 无性能开销。No Performance overhead.

  • 无需 Bucket 存储。No Bucketing required.

  • Spark 可通过 Hive 仓库连接器读取和写入 Hive ACID 表。Spark can read and write to Hive ACID tables via Hive Warehouse Connector.

详细了解 Apache Hive 3Learn more about Apache Hive 3.

Apache SparkApache Spark

Apache Spark 使用 Hive 仓库连接器获取可更新的表和 ACID 事务。Apache Spark gets updatable tables and ACID transactions with Hive Warehouse Connector. 借助单元仓库连接器,可将 Hive 事务表注册为 Spark 中的外部表,以访问完整事务功能。Hive Warehouse Connector allows you to register Hive transactional tables as external tables in Spark to access full transactional functionality. 先前版本仅支持表分区操作。Previous versions only supported table partition manipulation. Hive Warehouse Connector 还支持流式处理数据帧。Hive Warehouse Connector also supports Streaming DataFrames. 此过程从 Spark 的事务和流式处理 Hive 表流式读取数据并将数据流式写入其中。This process streams reads and writes into transactional and streaming Hive tables from Spark.

Spark 执行程序可直接连接到 Hive LLAP 守护程序,并通过事务方式检索和更新数据,从而允许 Hive 控制数据。Spark executors can connect directly to Hive LLAP daemons to retrieve and update data in a transactional manner, allowing Hive to keep control of the data.

HDInsight 4.0 上的 Apache Spark 支持以下方案:Apache Spark on HDInsight 4.0 supports the following scenarios:

  • 在用于报告的同一事务表上运行机器学习模型训练。Run machine learning model training over the same transactional table used for reporting.
  • 使用 ACID 事务安全地从 Spark ML 向 Hive 表添加列。Use ACID transactions to safely add columns from Spark ML to a Hive table.
  • 从 Hive 流式处理表对更改源运行 Spark 流式处理作业。Run a Spark streaming job on the change feed from a Hive streaming table.
  • 直接从 Spark 结构化流式处理作业创建 ORC 文件。Create ORC files directly from a Spark Structured Streaming job.

不必再担心出现那种意外尝试直接从 Spark 访问 Hive 事务表的情况。You no longer have to worry about accidentally trying to access Hive transactional tables directly from Spark. 那样会导致结果不一致、数据重复或数据损坏。Resulting in inconsistent results, duplicate data, or data corruption. 在 HDInsight 4.0 中,Spark 表和 Hive 表保留在单独的元存储中。In HDInsight 4.0, Spark tables and Hive tables are kept in separate Metastores. 使用 Hive 数据仓库连接器将 Hive 事务表显式注册为 Spark 外部表。Use Hive Data Warehouse Connector to explicitly register Hive transactional tables as Spark external tables.

详细了解 Apache SparkLearn more about Apache Spark.

Apache OozieApache Oozie

Apache Oozie 4.3.1 包含在 HDI 4.0 中,并进行了以下更改:Apache Oozie 4.3.1 is included in HDI 4.0 with the following changes:

  • Oozie 不再运行 Hive 操作。Oozie no longer runs Hive actions. 已删除 Hive CLI 并替换为 BeeLine。Hive CLI has been removed and replaced with BeeLine.

  • 可通过在“job.properties”文件中包含排除模式,以从共享 lib 中排除不需要的依赖项。You can exclude unwanted dependencies from share lib by including an exclude pattern in your job.properties file.

详细了解 Apache OozieLearn more about Apache Oozie.

如何升级到 HDInsight 4.0How to upgrade to HDInsight 4.0

在生产环境中实施最新版本前,全面测试组件。Thoroughly test your components before implementing the latest version in a production environment. 可以使用 HDInsight 4.0 来开始升级过程。HDInsight 4.0 is available for you to begin the upgrade process. HDInsight 3.6 是防止意外出错的默认选项。HDInsight 3.6 is the default option to prevent accidental mishaps.

从先前版本的 HDInsight 到 HDInsight 4.0 尚无支持的升级路径。There's no supported upgrade path from previous versions of HDInsight to HDInsight 4.0. 由于元存储和 blob 数据格式已更改,因此 4.0 与先前版本不兼容。Because Metastore and blob data formats have changed, 4.0 isn't compatible with previous versions. 将新的 HDInsight 4.0 环境与当前生产环境保持分开至关重要。It's important you keep your new HDInsight 4.0 environment separate from your current production environment. 如果将 HDInsight 4.0 部署到当前环境,则元存储会被永久升级。If you deploy HDInsight 4.0 to your current environment, your Metastore will be permanently upgraded.

限制Limitations

  • HDInsight 4.0 不支持 MapReduce for Apache Hive。HDInsight 4.0 does not support MapReduce for Apache Hive. 改为使用 Apache Tez。Use Apache Tez instead. 详细了解 Apache TezLearn more about Apache Tez.

  • HDInsight 4.0 不支持 Apache Storm。HDInsight 4.0 does not support Apache Storm.

  • HDInsight 4.0 中不再提供 Hive 视图。Hive View is no longer available in HDInsight 4.0.

  • Spark 和交互式查询群集不支持 Apache Zeppelin 中的 Shell 解释器。Shell interpreter in Apache Zeppelin is not supported in Spark and Interactive Query clusters.

  • 无法在 Spark-LLAP 群集上禁用 LLAP。You can't disable LLAP on a Spark-LLAP cluster. 只能关闭 LLAP。You can only turn LLAP off.

  • Azure Data Lake Storage Gen2 无法在 Spark 群集中保存 Juypter Notebook。Azure Data Lake Storage Gen2 can't save Juypter notebooks in a Spark cluster.

  • 默认情况下,Apache Pig 在 Tez 上运行,但你可以将其更改为 MapreduceApache pig runs on Tez by default, However you can change it to Mapreduce

  • 为了提高行和列安全性而推出的 Spark SQL Ranger 集成已弃用Spark SQL Ranger integration for row and column security is deprecated

  • 由于在 HDInsight 4.0 中提供 Spark 2.4 和 Kafka 2.1,因此不再支持 Spark 2.3 和 Kafka 1.1。Spark 2.4 and Kafka 2.1 are available in HDInsight 4.0, so Spark 2.3 and Kafka 1.1 are no longer supported. 建议在 HDInsight 4.0 中使用 Spark 2.4 和 Kafka 2.3 及更高版本。We recommend using Spark 2.4 & Kafka 2.3 and above in HDInsight 4.0.

后续步骤Next steps