Azure HDInsight 4.0 概述Azure HDInsight 4.0 overview

Azure HDInsight 是 Azure 中最受企业客户青睐的开源 Apache Hadoop 和 Apache Spark 分析服务之一。Azure HDInsight is one of the most popular services among enterprise customers for open-source Apache Hadoop and Apache Spark analytics on Azure. HDInsight 4.0 是 Apache Hadoop 组件的云分发版。HDInsight 4.0 is a cloud distribution of Apache Hadoop components. 本文提供有关最新 Azure HDInsight 版本以及如何升级的信息。This article provides information about the most recent Azure HDInsight release and how to upgrade.

HDInsight 4.0 中有哪些新功能?What's new in HDInsight 4.0?

Apache Hive 3.0 和 LLAPApache Hive 3.0 and LLAP

Apache Hive 低延迟分析处理 (LLAP) 使用持久性查询服务器和内存中缓存,以提供有关远程云存储中数据的快速 SQL 查询结果。Apache Hive low-latency analytical processing (LLAP) uses persistent query servers and in-memory caching to deliver quick SQL query results on data in remote cloud storage. Hive LLAP 利用一组执行的 Hive 查询片段的持久性守护程序。Hive LLAP leverages a set of persistent daemons that execute fragments of Hive queries. LLAP 上的执行查询类似于没有 LLAP 的 Hive,其中工作人员任务在 LLAP 守护程序而不是容器内运行。Query execution on LLAP is similar to Hive without LLAP, with worker tasks running inside LLAP daemons instead of containers.

Hive LLAP 的优势包括:Benefits of Hive LLAP include:

  • 可执行深入 SQL 分析,如复杂联接、子查询、开窗函数、排序、用户定义的函数和复杂聚合,且不会影响性能和可伸缩性。Ability to perform deep SQL analytics, such as complex joins, subqueries, windowing functions, sorting, user-defined functions, and complex aggregations, without sacrificing performance and scalability.

  • 可针对准备数据的同一存储中的数据进行交互式查询,无需将数据从存储移动到另一个引擎进行分析处理。Interactive queries against data in the same storage where data is prepared, eliminating the need to move data from storage to another engine for analytical processing.

  • 通过缓存查询结果,可重复使用以前计算的查询结果,从而节省了运行查询所需集群任务花费的时间和资源。Caching query results allows previously computed query results to be reused, which saves time and resources spent running the cluster tasks required for the query.

Hive 动态具体化视图Hive dynamic materialized views

Hive 现支持动态具体化视图或相关摘要的预先计算,用于加速数据仓库中的查询处理。Hive now supports dynamic materialized views, or pre-computation of relevant summaries, used to accelerate query processing in data warehouses. 具体化视图可本机存储在 Hive 中,并且可以无缝使用 LLAP 加速。Materialized views can be stored natively in Hive, and can seamlessly use LLAP acceleration.

Hive 事务表Hive transactional tables

HDI 4.0 包括 Apache Hive 3,这需要驻留在 Hive 仓库中的事务表的原子性、一致性、隔离性和持久性 (ACID) 符合性。HDI 4.0 includes Apache Hive 3, which requires atomicity, consistency, isolation, and durability (ACID) compliance for transactional tables that reside in the Hive warehouse. 符合 ACID 的表和表数据由 Hive 访问和管理。ACID-compliant tables and table data are accessed and managed by Hive. 创建、检索、更新和删除 (CRUD) 表中的数据必须采用优化行列 (ORC) 文件格式,但仅限插入的表支持所有文件格式。Data in create, retrieve, update, and delete (CRUD) tables must be in Optimized Row Column (ORC) file format, but insert-only tables support all file formats.

  • ACID v2 在存储格式和执行引擎方面均进行了性能提升。ACID v2 has performance improvements in both storage format and the execution engine.

  • 默认情况下 ACID 会启用,以允许对数据更新的全面支持。ACID is enabled by default to allow full support for data updates.

  • 借助改进的 ACID 功能,可在行级别更新和删除。Improved ACID capabilities allow you to update and delete at row level.

  • 无性能开销。No Performance overhead.

  • 无需 Bucket 存储。No Bucketing required.

  • Spark 可通过 Hive 仓库连接器读取和写入 Hive ACID 表。Spark can read and write to Hive ACID tables via Hive Warehouse Connector.

详细了解 Apache Hive 3Learn more about Apache Hive 3.

Apache SparkApache Spark

Apache Spark 使用 Hive 仓库连接器获取可更新的表和 ACID 事务。Apache Spark gets updatable tables and ACID transactions with Hive Warehouse Connector. 借助单元仓库连接器,可将 Hive 事务表注册为 Spark 中的外部表,以访问完整事务功能。Hive Warehouse Connector allows you to register Hive transactional tables as external tables in Spark to access full transactional functionality. 先前版本仅支持表分区操作。Previous versions only supported table partition manipulation. Hive 仓库连接器还支持流式处理 DataFrames,用于将读取和写入从 Spark 中流式传输到事务和流式处理 Hive 表中。Hive Warehouse Connector also supports Streaming DataFrames for streaming reads and writes into transactional and streaming Hive tables from Spark.

Spark 执行程序可直接连接到 Hive LLAP 守护程序,并通过事务方式检索和更新数据,从而允许 Hive 控制数据。Spark executors can connect directly to Hive LLAP daemons to retrieve and update data in a transactional manner, allowing Hive to keep control of the data.

HDInsight 4.0 上的 Apache Spark 支持以下方案:Apache Spark on HDInsight 4.0 supports the following scenarios:

  • 在用于报告的同一事务表上运行机器学习模型训练。Run machine learning model training over the same transactional table used for reporting.
  • 使用 ACID 事务安全地从 Spark ML 向 Hive 表添加列。Use ACID transactions to safely add columns from Spark ML to a Hive table.
  • 从 Hive 流式处理表对更改源运行 Spark 流式处理作业。Run a Spark streaming job on the change feed from a Hive streaming table.
  • 直接从 Spark 结构化流式处理作业创建 ORC 文件。Create ORC files directly from a Spark Structured Streaming job.

不必再担心意外尝试直接从 Spark 访问 Hive 事务表,从而导致结果不一致、数据重复或数据损坏。You no longer have to worry about accidentally trying to access Hive transactional tables directly from Spark, resulting in inconsistent results, duplicate data, or data corruption. 在 HDInsight 4.0 中,Spark 表和 Hive 表保留在单独的元存储中。In HDInsight 4.0, Spark tables and Hive tables are kept in separate Metastores. 使用 Hive 数据仓库连接器将 Hive 事务表显式注册为 Spark 外部表。Use Hive Data Warehouse Connector to explicitly register Hive transactional tables as Spark external tables.

详细了解 Apache SparkLearn more about Apache Spark.

Apache OozieApache Oozie

Apache Oozie 4.3.1 包含在 HDI 4.0 中,并进行了以下更改:Apache Oozie 4.3.1 is included in HDI 4.0 with the following changes:

  • Oozie 不再运行 Hive 操作。Oozie no longer runs Hive actions. 已删除 Hive CLI 并替换为 BeeLine。Hive CLI has been removed and replaced with BeeLine.

  • 可通过在“job.properties”文件中包含排除模式,以从共享 lib 中排除不需要的依赖项。You can exclude unwanted dependencies from share lib by including an exclude pattern in your job.properties file.

详细了解 Apache OozieLearn more about Apache Oozie.

如何升级到 HDInsight 4.0How to upgrade to HDInsight 4.0

对于任何主要版本,在生产环境中实现最新版本前,全面测试组件十分重要。As with any major release, it's important to thoroughly test your components before implementing the latest version in a production environment. HDInsight 4.0 可供你开始升级过程,但 HDInsight 3.6 是默认选项,可防止意外事故。HDInsight 4.0 is available for you to begin the upgrade process, but HDInsight 3.6 is the default option to prevent accidental mishaps.

从先前版本的 HDInsight 到 HDInsight 4.0 尚无支持的升级路径。There is no supported upgrade path from previous versions of HDInsight to HDInsight 4.0. 由于元存储和 blob 数据格式已发生更改,因此 HDInsight 4.0 与先前版本不兼容。Because Metastore and blob data formats have changed, HDInsight 4.0 is not compatible with previous versions. 将新 HDInsight 4.0 环境与当前生产环境保持分离至关重要。It is important that you keep your new HDInsight 4.0 environment separate from your current production environment. 如果将 HDInsight 4.0 部署到当前环境,则元存储将升级,并且无法撤消。If you deploy HDInsight 4.0 to your current environment, your Metastore will be upgraded and cannot be reversed.

限制Limitations

  • HDInsight 4.0 不支持 MapReduce for Apache Hive。HDInsight 4.0 does not support MapReduce for Apache Hive. 改为使用 Apache Tez。Use Apache Tez instead. 详细了解 Apache TezLearn more about Apache Tez.

  • HDInsight 4.0 不支持 Apache Storm。HDInsight 4.0 does not support Apache Storm.

  • HDInsight 4.0 中不再提供 Hive 视图。Hive View is no longer available in HDInsight 4.0.

  • Spark 和交互式查询群集不支持 Apache Zeppelin 中的 Shell 解释器。Shell interpreter in Apache Zeppelin is not supported in Spark and Interactive Query clusters.

  • 无法在 Spark-LLAP 群集上禁用 LLAP。You can't disable LLAP on a Spark-LLAP cluster. 只能关闭 LLAP。You can only turn LLAP off.

  • Azure Data Lake Storage Gen2 无法在 Spark 群集中保存 Juypter Notebook。Azure Data Lake Storage Gen2 can't save Juypter notebooks in a Spark cluster.

后续步骤Next steps