什么是 Azure HDInsight 和 Apache Hadoop 技术堆栈What is Azure HDInsight and the Apache Hadoop technology stack

本文介绍 Azure HDInsight 上的 Apache Hadoop。This article provides an introduction to Apache Hadoop on Azure HDInsight. Azure HDInsight 是适用于企业的分析服务,具有完全托管、全面且开源的特点。Azure HDInsight is a fully managed, full-spectrum, open-source analytics service for enterprises. 可以使用开源框架,例如 Hadoop、Apache Spark、Apache Hive、LLAP、Apache Kafka、Apache Storm、R 等等。You can use open-source frameworks such as Hadoop, Apache Spark, Apache Hive, LLAP, Apache Kafka, Apache Storm, R, and more.

什么是 HDInsight 和 Hadoop 技术堆栈?What is HDInsight and the Hadoop technology stack?

Apache Hadoop 是原始的开源框架,适用于对群集上的大数据集进行分布式处理和分析。Apache Hadoop was the original open-source framework for distributed processing and analysis of big data sets on clusters. Hadoop 技术堆栈包括相关的软件和实用程序,例如 Apache Hive、Apache HBase、Spark、Kafka 等等。The Hadoop technology stack includes related software and utilities, including Apache Hive, Apache HBase, Spark, Kafka, and many others.

Azure HDInsight 是 Hortonworks Data Platform (HDP) 提供的 Hadoop 组件的云发行版。Azure HDInsight is a cloud distribution of the Hadoop components from the Hortonworks Data Platform (HDP). 可以通过 Azure HDInsight 轻松、快速且经济有效地处理大量数据。Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. 可以使用 Hadoop、Spark、Hive、LLAP、Kafka、Storm、R 等最常用的开源框架。You can use the most popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, and more. 可以通过这些框架启用各种各样的方案,例如提取、转换和加载 (ETL);数据仓库操作;机器学习;IoT。With these frameworks, you can enable a broad range of scenarios such as extract, transform, and load (ETL), data warehousing, machine learning, and IoT.

若要查看 HDInsight 上的可用 Hadoop 技术堆栈组件,请参阅可以与 HDInsight 配合使用的组件和版本To see available Hadoop technology stack components on HDInsight, see Components and versions available with HDInsight. 若要详细了解 HDInsight 中的 Hadoop,请参阅 Azure 上介绍了 HDInsight 功能的页面To read more about Hadoop in HDInsight, see the Azure features page for HDInsight.

什么是大数据?What is big data?

与以前相比,大数据的收集量在增加,收集速度在加快,收集格式在增多。Big data is collected in escalating volumes, at higher velocities, and in a greater variety of formats than ever before. 大数据可以是历史数据(即已存储的数据),也可以是实时数据(即从数据源流式传输的数据)。It can be historical (meaning stored) or real time (meaning streamed from the source). 请参阅使用 HDInsight 的方案,了解大数据的最常见用例。See Scenarios for using HDInsight to learn about the most common use cases for big data.

为何应使用 HDInsight 上的 Hadoop?Why should I use Hadoop on HDInsight?

本部分列出了 Azure HDInsight 的功能。This section lists the capabilities of Azure HDInsight.

功能Capability 说明Description
云原生Cloud native 可以使用 Azure HDInsight 在 Azure 上为 Hadoop、 Spark、 交互式查询 (LLAP)、 Kafka、 StormHBase 创建优化群集。Azure HDInsight enables you to create optimized clusters for Hadoop, Spark, Interactive query (LLAP), Kafka, Storm, and HBase on Azure. HDInsight 还在所有生产工作负荷上提供端到端 SLA。HDInsight also provides an end-to-end SLA on all your production workloads.
低成本且可缩放Low-cost and scalable 可以通过 HDInsight 纵向缩放 工作负荷。HDInsight enables you to scale workloads up or down. 可以通过创建按需群集来降低成本,只为自己使用的资源付费。 You can reduce costs by creating clusters on demand and paying only for what you use. 还可以生成数据管道,使作业可操作化。You can also build data pipelines to operationalize your jobs. 使计算和存储分离,提高性能和灵活性。Decoupled compute and storage provide better performance and flexibility.
既安全又合规Secure and compliant HDInsight 允许通过 Azure 虚拟网络加密来保护企业数据资产。HDInsight enables you to protect your enterprise data assets with Azure Virtual Network, encryption. HDInsight 还满足最常用的行业和政府符合性标准HDInsight also meets the most popular industry and government compliance standards.
工作效率Productivity Azure HDInsight 允许将各种适用于 Hadoop 和 Spark 的高效工具与首选的开发环境配合使用。Azure HDInsight enables you to use rich productive tools for Hadoop and Spark with your preferred development environments. 这些开发环境包括 Visual StudioVSCodeEclipseIntelliJ,可以提供 Scala、Python、R、Java 和 .NET 支持。These development environments include Visual Studio, VSCode, Eclipse, and IntelliJ for Scala, Python, R, Java, and .NET support. 数据科学家也可以使用 JupyterZeppelin 等常用 Notebook 进行协作。Data scientists can also collaborate using popular notebooks such as Jupyter and Zeppelin.
可扩展性Extensibility 可以使用脚本操作通过安装的组件(Hue、Presto 等)来扩展 HDInsight 群集,具体方法是:添加边缘节点集成其他大数据认证应用程序You can extend the HDInsight clusters with installed components (Hue, Presto, and so on) by using script actions, by adding edge nodes, or by integrating with other big data certified applications.

使用 HDInsight 的方案Scenarios for using HDInsight

Azure HDInsight 适用于各种方案的大数据处理。Azure HDInsight can be used for a variety of scenarios in big data processing. 大数据可以是历史数据(已收集和存储的数据),也可以是实时数据(直接从源流式传输的数据)。It can be historical data (data that's already collected and stored) or real-time data (data that's directly streamed from the source). 处理此类数据的方案可以汇总成以下类别:The scenarios for processing such data can be summarized in the following categories:

批处理 (ETL)Batch processing (ETL)

提取、转换和加载 (ETL) 是指将非结构化或结构化数据从异类数据源中提取出来,Extract, transform, and load (ETL) is a process where unstructured or structured data is extracted from heterogeneous data sources. 转换成某种结构化格式,然后加载到数据存储中这一过程。It's then transformed into a structured format and loaded into a data store. 可以将转换的数据用于数据科学或数据仓库。You can use the transformed data for data science or data warehousing.

数据仓库Data warehousing

可以使用 HDInsight 对任何格式的结构化或非结构化数据执行 PB 规模的交互式查询。You can use HDInsight to perform interactive queries at petabyte scales over structured or unstructured data in any format. 也可以通过生成模型将其连接到 BI 工具。You can also build models connecting them to BI tools. 有关详细信息,请阅读此客户经历For more information, read this customer story.

HDInsight 体系结构:数据仓库HDInsight architecture: Data warehousing

物联网 (IoT)Internet of Things (IoT)

可以使用 HDInsight 处理从各种设备实时接收的流数据。You can use HDInsight to process streaming data that's received in real time from a variety of devices. 有关详细信息,请阅读 Azure 提供的此博客文章,了解使用 Azure 托管磁盘的 Apache Kafka on HDInsight 公共预览版。For more information, read this blog post from Azure that announces the public preview of Apache Kafka on HDInsight with Azure Managed disks.

HDInsight 体系结构:物联网HDInsight architecture: Internet of Things

数据科学Data science

可以使用 HDInsight 生成从数据中提取关键见解的应用程序。You can use HDInsight to build applications that extract critical insights from data. 也可在此基础上使用 Azure 机器学习来预测业务的未来趋势。You can also use Azure Machine Learning on top of that to predict future trends for your business. 有关详细信息,请阅读此客户经历For more information, read this customer story.

HDInsight 体系结构:数据科学HDInsight architecture: Data science

混合Hybrid

可以使用 HDInsight 将现有的本地大数据基础结构扩展到 Azure,充分利用云的高级分析功能。You can use HDInsight to extend your existing on-premises big data infrastructure to Azure to leverage the advanced analytics capabilities of the cloud.

HDInsight 体系结构:混合HDInsight architecture: Hybrid

HDInsight 中的群集类型Cluster types in HDInsight

HDInsight 包括特定的群集类型和群集自定义功能,例如添加组件、实用程序和语言的功能。HDInsight includes specific cluster types and cluster customization capabilities, such as the capability to add components, utilities, and languages. HDInsight 提供以下群集类型:HDInsight offers the following cluster types:

HDInsight 中的开源组件Open-source components in HDInsight

通过 Azure HDInsight,可以使用 Hadoop、Spark、Hive、LLAP、Kafka、Storm、HBase 和 R 之类的开源框架来创建群集。这些群集默认情况下附带其他开源组件,例如 Apache AmbariAvroApache HiveHCatalogApache MahoutApache Hadoop MapReduceApache Hadoop YARNApache PhoenixApache PigApache SqoopApache TezApache OozieApache ZooKeeperAzure HDInsight enables you to create clusters with open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, HBase, and R. These clusters, by default, come with other open-source components that are included on the cluster such as Apache Ambari, Avro, Apache Hive, HCatalog, Apache Mahout, Apache Hadoop MapReduce, Apache Hadoop YARN, Apache Phoenix, Apache Pig, Apache Sqoop, Apache Tez, Apache Oozie, Apache ZooKeeper.

HDInsight 中的编程语言Programming languages in HDInsight

HDInsight 群集包括 Spark、HBase、Kafka、Hadoop 和其他群集,支持多种编程语言。HDInsight clusters, including Spark, HBase, Kafka, Hadoop, and others, support many programming languages. 某些编程语言默认情况下未安装。Some programming languages aren't installed by default. 对于默认情况下未安装的库、模块或程序包,请使用脚本操作来安装组件For libraries, modules, or packages that are not installed by default, use a script action to install the component.

编程语言Programming language 信息Information
默认编程语言支持Default programming language support 默认情况下,HDInsight 群集支持:By default, HDInsight clusters support:
  • JavaJava
  • PythonPython
  • .NET.NET
  • GoGo
Java 虚拟机 (JVM) 语言Java virtual machine (JVM) languages 除 Java 之外的许多语言都可以在 Java 虚拟机 (JVM) 上运行。Many languages other than Java can run on a Java virtual machine (JVM). 但是,运行这其中的部分语言时,可能必须在群集上安装其他组件。However, if you run some of these languages, you might have to install additional components on the cluster. HDInsight 群集支持以下基于 JVM 的语言:The following JVM-based languages are supported on HDInsight clusters:
  • ClojureClojure
  • Jython (Python for Java)Jython (Python for Java)
  • ScalaScala
Hadoop 特定的语言Hadoop-specific languages HDInsight 群集支持以下特定于 Hadoop 技术堆栈的语言:HDInsight clusters support the following languages that are specific to the Hadoop technology stack:
  • 用于 Pig 作业的 Pig LatinPig Latin for Pig jobs
  • 用于 Hive 作业的 HiveQL 和 SparkSQLHiveQL for Hive jobs and SparkSQL

适用于 HDInsight 的开发工具Development tools for HDInsight

可以使用 HDInsight 开发工具(包括 IntelliJ、Eclipse、Visual Studio Code 和 Visual Studio)通过与 Azure 的无缝集成来创作和提交 HDInsight 数据查询和作业。You can use HDInsight development tools, including IntelliJ, Eclipse, Visual Studio Code, and Visual Studio, to author and submit HDInsight data query and job with seamless integration with Azure.

HDInsight 上的商业智能Business intelligence on HDInsight

大家熟悉的商业智能 (BI) 工具使用 Power Query 外接程序或 Microsoft Hive ODBC 驱动程序来检索、分析和报告与 HDInsight 集成的数据:Familiar business intelligence (BI) tools retrieve, analyze, and report data that is integrated with HDInsight by using either the Power Query add-in or the Microsoft Hive ODBC Driver:

后续步骤Next steps

本文介绍了 Azure HDInsight 及其在 Azure 上提供 Hadoop 和其他群集类型的方式。In this article, you learned what is Azure HDInsight and how it provides Hadoop and other cluster types on Azure. 请继续阅读下一篇文章,了解如何在 HDInsight 中创建 Apache Hadoop 群集。Proceed to the next article to learn how to create an Apache Hadoop cluster in HDInsight.