什么是 Azure HDInsight?What is Azure HDInsight?

Azure HDInsight 是面向企业的云中托管、全方位的开源分析服务。Azure HDInsight is a managed, full-spectrum, open-source analytics service in the cloud for enterprises. 可以使用开源框架,例如 Hadoop、Apache Spark、Apache Hive、LLAP、Apache Kafka、Apache Storm 等等。You can use open-source frameworks such as Hadoop, Apache Spark, Apache Hive, LLAP, Apache Kafka, Apache Storm, and more.

什么是 HDInsight 和 Hadoop 技术堆栈?What is HDInsight and the Hadoop technology stack?

Azure HDInsight 是 Hadoop 组件的云分发版。Azure HDInsight is a cloud distribution of Hadoop components. 可以通过 Azure HDInsight 轻松、快速且经济有效地处理大量数据。Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. 可以使用 Hadoop、Spark、Hive、LLAP、Kafka、Storm 等最常用的开源框架。You can use the most popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, and more. 可以通过这些框架启用各种各样的方案,例如提取、转换和加载 (ETL);数据仓库操作;机器学习;IoT。With these frameworks, you can enable a broad range of scenarios such as extract, transform, and load (ETL), data warehousing, machine learning, and IoT.

若要查看 HDInsight 上的可用 Hadoop 技术堆栈组件,请参阅可以与 HDInsight 配合使用的组件和版本To see available Hadoop technology stack components on HDInsight, see Components and versions available with HDInsight. 若要详细了解 HDInsight 中的 Hadoop,请参阅 Azure 上介绍了 HDInsight 功能的页面To read more about Hadoop in HDInsight, see the Azure features page for HDInsight.

什么是大数据?What is big data?

与以前相比,大数据的收集量在增加,收集速度在加快,收集格式在增多。Big data is collected in escalating volumes, at higher velocities, and in a greater variety of formats than ever before. 大数据可以是历史数据(即已存储的数据),也可以是实时数据(即从数据源流式传输的数据)。It can be historical (meaning stored) or real time (meaning streamed from the source). 请参阅使用 HDInsight 的方案,了解大数据的最常见用例。See Scenarios for using HDInsight to learn about the most common use cases for big data.

为何应使用 Azure HDInsight?Why should I use Azure HDInsight?

本部分列出了 Azure HDInsight 的功能。This section lists the capabilities of Azure HDInsight.

功能Capability 说明Description
云原生Cloud native 可以使用 Azure HDInsight 在 Azure 上为 Hadoop、 Spark、 交互式查询 (LLAP)、 Kafka、 Storm、 HBase 创建优化群集。Azure HDInsight enables you to create optimized clusters for Hadoop, Spark, Interactive query (LLAP), Kafka, Storm, HBase on Azure. HDInsight 还在所有生产工作负荷上提供端到端 SLA。HDInsight also provides an end-to-end SLA on all your production workloads.
低成本且可缩放Low-cost and scalable 可以通过 HDInsight 纵向缩放 工作负荷。HDInsight enables you to scale workloads up or down. 可以通过创建按需群集来降低成本,只为自己使用的资源付费。 You can reduce costs by creating clusters on demand and paying only for what you use. 还可以生成数据管道,使作业可操作化。You can also build data pipelines to operationalize your jobs. 使计算和存储分离,提高性能和灵活性。Decoupled compute and storage provide better performance and flexibility.
既安全又合规Secure and compliant HDInsight 允许通过 Azure 虚拟网络加密以及与 Azure Active Directory 集成来保护企业数据资产。HDInsight enables you to protect your enterprise data assets with Azure Virtual Network, encryption, and integration with Azure Active Directory. HDInsight 还满足最常用的行业和政府符合性标准HDInsight also meets the most popular industry and government compliance standards.
监视Monitoring Azure HDInsight 集成 Azure Monitor 日志,可以通过单个界面来监视所有群集。Azure HDInsight integrates with Azure Monitor logs to provide a single interface with which you can monitor all your clusters.
工作效率Productivity Azure HDInsight 允许将各种适用于 Hadoop 和 Spark 的高效工具与首选的开发环境配合使用。Azure HDInsight enables you to use rich productive tools for Hadoop and Spark with your preferred development environments. 这些开发环境包括 Visual StudioVSCodeEclipseIntelliJ,可以提供 Scala、Python、R、Java 和 .NET 支持。These development environments include Visual Studio, VSCode, Eclipse, and IntelliJ for Scala, Python, R, Java, and .NET support. 数据科学家也可以使用 JupyterZeppelin 等常用 Notebook 进行协作。Data scientists can also collaborate using popular notebooks such as Jupyter and Zeppelin.
可扩展性Extensibility 可以使用脚本操作通过安装的组件(Hue、Presto 等)来扩展 HDInsight 群集,具体方法是:添加边缘节点集成其他大数据认证应用程序You can extend the HDInsight clusters with installed components (Hue, Presto, and so on) by using script actions, by adding edge nodes, or by integrating with other big data certified applications. HDInsight 允许通过单击部署方式无缝集成最常用的大数据解决方案。HDInsight enables seamless integration with the most popular big data solutions with a one-click deployment.

使用 HDInsight 的方案Scenarios for using HDInsight

Azure HDInsight 适用于各种方案的大数据处理。Azure HDInsight can be used for a variety of scenarios in big data processing. 大数据可以是历史数据(已收集和存储的数据),也可以是实时数据(直接从源流式传输的数据)。It can be historical data (data that's already collected and stored) or real-time data (data that's directly streamed from the source). 处理此类数据的方案可以汇总成以下类别:The scenarios for processing such data can be summarized in the following categories:

批处理 (ETL)Batch processing (ETL)

提取、转换和加载 (ETL) 是指将非结构化或结构化数据从异类数据源中提取出来,Extract, transform, and load (ETL) is a process where unstructured or structured data is extracted from heterogeneous data sources. 转换成某种结构化格式,然后加载到数据存储中这一过程。It's then transformed into a structured format and loaded into a data store. 可以将转换的数据用于数据科学或数据仓库。You can use the transformed data for data science or data warehousing.

数据仓库Data warehousing

可以使用 HDInsight 对任何格式的结构化或非结构化数据执行 PB 规模的交互式查询。You can use HDInsight to perform interactive queries at petabyte scales over structured or unstructured data in any format. 也可以通过生成模型将其连接到 BI 工具。You can also build models connecting them to BI tools. 有关详细信息,请阅读此客户经历For more information, read this customer story.

HDInsight 体系结构:数据仓库HDInsight architecture: Data warehousing

物联网 (IoT)Internet of Things (IoT)

可以使用 HDInsight 处理从各种设备实时接收的流数据。You can use HDInsight to process streaming data that's received in real time from different kinds of devices. 有关详细信息,请阅读 Azure 提供的此博客文章,了解使用 Azure 托管磁盘的 Apache Kafka on HDInsight 公共预览版。For more information, read this blog post from Azure that announces the public preview of Apache Kafka on HDInsight with Azure Managed disks.

HDInsight 体系结构:物联网HDInsight architecture: Internet of Things

数据科学Data science

可以使用 HDInsight 生成从数据中提取关键见解的应用程序。You can use HDInsight to build applications that extract critical insights from data. 也可在此基础上使用 Azure 机器学习来预测业务的未来趋势。You can also use Azure Machine Learning on top of that to predict future trends for your business. 有关详细信息,请阅读此客户经历For more information, read this customer story.

HDInsight 体系结构:数据科学HDInsight architecture: Data science

混合Hybrid

可以使用 HDInsight 将现有的本地大数据基础结构扩展到 Azure,充分利用云的高级分析功能。You can use HDInsight to extend your existing on-premises big data infrastructure to Azure to leverage the advanced analytics capabilities of the cloud.

HDInsight 体系结构:混合HDInsight architecture: Hybrid

HDInsight 中的群集类型Cluster types in HDInsight

HDInsight 包括特定的群集类型和群集自定义功能,例如添加组件、实用程序和语言的功能。HDInsight includes specific cluster types and cluster customization capabilities, such as the capability to add components, utilities, and languages. HDInsight 提供以下群集类型:HDInsight offers the following cluster types:

群集类型Cluster Type 说明Description
Apache HadoopApache Hadoop 一个框架,使用 HDFS、YARN 资源管理和简单的 MapReduce 编程模型并行处理和分析批处理数据。A framework that uses HDFS, YARN resource management, and a simple MapReduce programming model to process and analyze batch data in parallel.
Apache SparkApache Spark 一种开源并行处理框架,支持使用内存中处理来提升大数据分析应用程序的性能。An open-source, parallel-processing framework that supports in-memory processing to boost the performance of big-data analysis applications. 请参阅什么是 HDInsight 中的 Apache Spark?See What is Apache Spark in HDInsight?.
Apache HBaseApache HBase 构建于 Hadoop 上的 NoSQL 数据库,用于为大量非结构化和半结构化数据(可能为数十亿行乘以数百万列)提供随机访问和高度一致性。A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data--potentially billions of rows times millions of columns. 请参阅什么是 HBase on HDInsight?See What is HBase on HDInsight?
ML ServicesML Services 用于托管和管理并行分布式 R 进程的服务器。A server for hosting and managing parallel, distributed R processes. 它可让数据科研人员、统计人员和 R 程序员根据需要访问 HDInsight 上可缩放的分布式分析方法。It provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight.
Apache StormApache Storm 分布式实时计算系统,用于快速处理大型数据流。A distributed, real-time computation system for processing large streams of data fast. Storm 以 HDInsight 中的托管群集形式提供。Storm is offered as a managed cluster in HDInsight. 请参阅 使用 Storm 和 Hadoop 分析实时传感器数据See Analyze real-time sensor data using Storm and Hadoop.
Apache 交互式查询Apache Interactive Query 用于实现更快的交互式 Hive 查询的内存中缓存。In-memory caching for interactive and faster Hive queries. 请参阅在 HDInsight 中使用交互式查询See Use Interactive Query in HDInsight.
Apache KafkaApache Kafka 一种开源平台,用于生成流式处理的数据管道和应用程序。An open-source platform that's used for building streaming data pipelines and applications. Kafka 还提供了消息队列功能,可用于发布和订阅数据流。Kafka also provides message-queue functionality that allows you to publish and subscribe to data streams. 请参阅 Apache Kafka on HDInsight 简介See Introduction to Apache Kafka on HDInsight.

HDInsight 中的开源组件Open-source components in HDInsight

通过 Azure HDInsight,可以使用 Hadoop、Spark、Hive、LLAP、Kafka、Storm、HBase 和 R 之类的开源框架来创建群集。这些群集默认情况下附带其他开源组件,例如 Apache AmbariAvroApache HiveHCatalogApache MahoutApache Hadoop MapReduceApache Hadoop YARNApache PhoenixApache PigApache SqoopApache TezApache OozieApache ZooKeeperAzure HDInsight enables you to create clusters with open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, HBase, and R. These clusters, by default, come with other open-source components that are included on the cluster such as Apache Ambari, Avro, Apache Hive, HCatalog, Apache Mahout, Apache Hadoop MapReduce, Apache Hadoop YARN, Apache Phoenix, Apache Pig, Apache Sqoop, Apache Tez, Apache Oozie, Apache ZooKeeper.

HDInsight 中的编程语言Programming languages in HDInsight

HDInsight 群集包括 Spark、HBase、Kafka、Hadoop 和其他群集,支持多种编程语言。HDInsight clusters, including Spark, HBase, Kafka, Hadoop, and others, support many programming languages. 某些编程语言默认情况下未安装。Some programming languages aren't installed by default. 对于默认情况下未安装的库、模块或程序包,请使用脚本操作来安装组件For libraries, modules, or packages that are not installed by default, use a script action to install the component.

编程语言Programming language 信息Information
默认编程语言支持Default programming language support 默认情况下,HDInsight 群集支持:By default, HDInsight clusters support:
  • JavaJava
  • PythonPython
  • .NET.NET
  • GoGo
Java 虚拟机 (JVM) 语言Java virtual machine (JVM) languages 除 Java 之外的许多语言都可以在 Java 虚拟机 (JVM) 上运行。Many languages other than Java can run on a Java virtual machine (JVM). 但是,运行这其中的部分语言时,可能必须在群集上安装其他组件。However, if you run some of these languages, you might have to install additional components on the cluster. HDInsight 群集支持以下基于 JVM 的语言:The following JVM-based languages are supported on HDInsight clusters:
  • ClojureClojure
  • Jython (Python for Java)Jython (Python for Java)
  • ScalaScala
Hadoop 特定的语言Hadoop-specific languages HDInsight 群集支持以下特定于 Hadoop 技术堆栈的语言:HDInsight clusters support the following languages that are specific to the Hadoop technology stack:
  • 用于 Pig 作业的 Pig LatinPig Latin for Pig jobs
  • 用于 Hive 作业的 HiveQL 和 SparkSQLHiveQL for Hive jobs and SparkSQL

适用于 HDInsight 的开发工具Development tools for HDInsight

可以使用 HDInsight 开发工具(包括 IntelliJ、Eclipse、Visual Studio Code 和 Visual Studio)通过与 Azure 的无缝集成来创作和提交 HDInsight 数据查询和作业。You can use HDInsight development tools, including IntelliJ, Eclipse, Visual Studio Code, and Visual Studio, to author and submit HDInsight data query and job with seamless integration with Azure.

HDInsight 上的商业智能Business intelligence on HDInsight

大家熟悉的商业智能 (BI) 工具使用 Power Query 外接程序或 Microsoft Hive ODBC 驱动程序来检索、分析和报告与 HDInsight 集成的数据:Familiar business intelligence (BI) tools retrieve, analyze, and report data that is integrated with HDInsight by using either the Power Query add-in or the Microsoft Hive ODBC Driver:

后续步骤Next steps