HDInsight 中的机器学习Machine learning on HDInsight

可以使用 HDInsight 通过大数据进行机器学习,以便从大量(千万亿字节甚至百亿亿字节)结构化、非结构化和快速移动的数据中获得有价值的见解。HDInsight enables machine learning with big data, providing the ability to obtain valuable insight from large amounts (petabytes, or even exabytes) of structured, unstructured, and fast-moving data. 在 HDInsight 中有几种机器学习选项:SparkML 和 Apache Spark MLlib、R、Apache Hive 和 Microsoft Cognitive Toolkit。There are several machine learning options in HDInsight: SparkML and Apache Spark MLlib, R, Apache Hive, and the Microsoft Cognitive Toolkit.

SparkML 和 MLlibSparkML and MLlib

HDInsight Spark 是 Azure 托管的 Apache Spark 产品/服务,它是统一的开源并行数据处理框架,支持使用内存中处理来大幅提升大数据分析性能。HDInsight Spark is an Azure-hosted offering of Apache Spark, a unified, open source, parallel data processing framework supporting in-memory processing to boost big data analytics. Spark 处理引擎是专为速度、易用性和复杂分析打造的产品。The Spark processing engine is built for speed, ease of use, and sophisticated analytics. Spark 的内存中分布式计算功能使其成为机器学习和图形计算中使用的迭代算法的最佳选择。Spark's in-memory distributed computation capabilities make it a good choice for the iterative algorithms used in machine learning and graph computations. 有两个可缩放的机器学习库向此分布式环境引入了算法建模功能:MLlib 和 SparkML。There are two scalable machine learning libraries that bring algorithmic modeling capabilities to this distributed environment: MLlib and SparkML. MLlib 包含构建在 RDD 基础之上的原始 API。MLlib contains the original API built on top of RDDs. SparkML 是一个较新的包,提供构建在 DataFrames 基础之上的更高级 API,用于构造 ML 管道。SparkML is a newer package that provides a higher-level API built on top of DataFrames for constructing ML pipelines. SparkML 目前尚不支持 MLlib 的所有功能,但正在替换 MLlib 的角色,即充当 Spark 的标准机器学习库。SparkML doesn't yet support all of the features of MLlib, but is replacing MLlib as Spark's standard machine learning library.

MMLSpark 是适用于 Apache Spark 的 Microsoft 机器学习库。The Microsoft Machine Learning library for Apache Spark is MMLSpark. 该库旨在提升数据科学家在 Spark 上的生产力,它不仅可以提高试验成功率,而且还能在极大型数据集上利用前沿的机器学习技术,包括深度学习。This library is designed to make data scientists more productive on Spark, increase the rate of experimentation, and leverage cutting-edge machine learning techniques, including deep learning, on very large datasets. MMLSpark 在生成可缩放 ML 模型(例如编制字符串的索引、强制数据进入机器学习算法预期的布局中、组合特征矢量)时,可以在 SparkML 的低级别 API 基础上提供一个层。MMLSpark provides a layer on top of SparkML's low-level APIs when building scalable ML models, such as indexing strings, coercing data into a layout expected by machine learning algorithms, and assembling feature vectors. MMLSpark 库简化了可在 PySpark 中生成模型的这些任务以及其他常见任务。The MMLSpark library simplifies these and other common tasks for building models in PySpark.


R 目前是世界上最常用的统计编程语言。R is currently the most popular statistical programming language in the world. 它是一种开源数据可视化工具,其社区的用户超过 250 万,并且仍在增长。It's an open-source data visualization tool with a community of over 2.5 million users and growing. R 拥有蓬勃增长的用户群,其用户贡献的程序包超过 8,000 个,是许多需要机器学习的公司的极佳选择。With its thriving user base, and over 8,000 contributed packages, R is a likely choice for many companies who need machine learning. 可以使用 ML Services 创建随时可与大型数据集和模型配合使用的 HDInsight 群集。You can create an HDInsight cluster with ML Services ready to be used with massive datasets and models. 这项功能为数据科学家和统计学家提供了可通过 HDInsight 按需缩放的熟悉 R 界面,并消除了群集设置和维护方面的开销。This capability provides data scientists and statisticians with a familiar R interface that can scale on-demand through HDInsight, without the overhead of cluster setup and maintenance.

通过 R Server 进行预测训练

群集的边缘节点为连接到群集和运行 R 脚本提供了便捷的位置。The edge node of a cluster provides a convenient place to connect to the cluster and to run your R scripts. 还可以使用 ScaleR 的 Hadoop Map Reduce 或 Spark 计算上下文跨群集的节点运行 R 脚本。You can also run R scripts across the nodes of the cluster by using ScaleR’s Hadoop Map Reduce or Spark compute contexts.

在带 Spark 的 HDInsight 上使用 ML Services 时,可以使用 Spark 计算上下文跨群集的节点进行并行训练。With ML Services on HDInsight with Spark, you can parallelize training across the nodes of a cluster by using a Spark compute context. 可以根据需要直接在边缘节点上运行 R 脚本,并行使用所有可用的核心。You can run R scripts directly on the edge node, using all available cores in parallel, as needed. 也可以在边缘节点中运行代码,开始执行分布在群集的所有节点上的处理任务。Alternately, you can run your code from the edge node to kick off processing that is distributed across all nodes in the cluster. 使用带 Spark 的 HDInsight 上的 ML Services,还可以根据需要并行执行开源 R 包中的函数。ML Services on HDInsight with Spark also enables parallelizing functions from open-source R packages, if desired.

Azure 机器学习和 Apache HiveAzure Machine Learning and Apache Hive

Azure 机器学习不仅提供预测分析建模工具,还提供完全托管的服务,可以通过此服务将预测模型部署为随时可用的 Web 服务。Azure Machine Learning provides tools to model predictive analytics, and a fully managed service you can use to deploy your predictive models as ready-to-consume web services. Azure 机器学习是云中的完整预测分析解决方案,可以用来创建、测试、操作和管理预测模型。Azure Machine Learning is a complete predictive analytics solution in the cloud that you can use to create, test, operationalize, and manage predictive models. 可以从大型算法库中进行选择、使用基于 Web 的工作室来构建模型,然后将模型轻松部署为 Web 服务。Select from a large algorithm library, use a web-based studio for building models, and easily deploy your model as a web service.

Microsoft Azure 机器学习概述

使用 Hive 查询,在 HDInsight Hadoop 群集中创建数据特征。Create features for data in an HDInsight Hadoop cluster using Hive queries. 特征工程尝试通过从原始数据创建特征,简化学习过程,从而增加学习算法的预测能力。Feature engineering attempts to increase the predictive power of learning algorithms by creating features from raw data that facilitate the learning process. 可以使用“导入数据”模块从 Azure 机器学习工作室(经典)运行 HiveQL 查询,以及访问在 Hive 中处理和在 Blob 存储中存储的数据。You can run HiveQL queries from Azure Machine Learning Studio (classic), and access data processed in Hive and stored in blob storage, by using the Import Data module.

Microsoft 认知工具包Microsoft Cognitive Toolkit

深度学习是机器学习的一个分支,使用神经网络是受人类大脑的生物学过程启发。Deep learning is a branch of machine learning that uses neural networks, inspired by the biological processes of the human brain. 许多研究人员将深度学习视为有前景的可增强人工智能的方法。Many researchers see deep learning as a promising approach for enhancing artificial intelligence. 深度学习的例子包括口译工具、图像识别系统和计算机推理。Examples of deep learning are spoken language translators, image recognition systems, and machine reasoning.

为了推进自身在深度学习方面的工作,Microsoft 开发了免费、易用的开源 Microsoft 认知工具包To help advance its own work in deep learning, Microsoft developed the free, easy-to-use, open-source Microsoft Cognitive Toolkit. 各种 Microsoft 产品、世界各地需要大规模部署深度学习的公司,以及对最新算法和技术感兴趣的学生都在使用该工具包。This toolkit is being used by a wide variety of Microsoft products, by companies worldwide with a need to deploy deep learning at scale, and by students interested in the latest algorithms and techniques.

另请参阅See also


深度学习资源Deep learning resources