深入探讨 - 高级分析Deep dive - advanced analytics

什么是 HDInsight 的高级分析?What is advanced analytics for HDInsight?

HDInsight 提供可从大量结构化、非结构化和快速移动的数据中获取宝贵见解的功能。HDInsight provides the ability to obtain valuable insight from large amounts of structured, unstructured, and fast-moving data. 高级分析使用高度可缩放的体系结构、统计、机器学习模型和智能仪表板提供有意义的见解。Advanced analytics is the use of highly scalable architectures, statistical and machine learning models, and intelligent dashboards to provide you with meaningful insights. 机器学习(或预测分析)使用可从数据中的关系进行识别和学习的算法进行预测,然后引导你做出决策。 Machine learning, or predictive analytics, uses algorithms that identify and learn from relationships in your data to make predictions and guide your decisions.

高级分析过程Advanced analytics process

过程

在识别业务问题并开始收集和处理数据之后,需要创建一个模型用于表示所要预测的问题。After you've identified the business problem and have started collecting and processing your data, you need to create a model that represents the question you wish to predict. 该模型使用一种或多种机器学习算法做出最符合业务需求的预测类型。Your model will use one or more machine learning algorithms to make the type of prediction that best fits your business needs. 大部分数据应该用于训练模型,剩余的数据用于测试或评估该模型。The majority of your data should be used to train your model, with the rest used to test or evaluate it.

在创建、加载、测试和评估模型之后,下一步就是部署该模型,让它开始为问题提供解答。After you create, load, test, and evaluate your model, the next step is to deploy your model so that it begins supplying answers to your questions. 最后一步是监视模型的性能,并根据需要进行优化。The last step is to monitor your model's performance and tune it as necessary.

常见算法类型Common types of algorithms

高级分析解决方案提供一套机器学习算法。Advanced analytics solutions provide a set of machine learning algorithms. 下面是算法类别和相关常见业务用例的摘要。Here is a summary of the categories of algorithms and associated common business use cases.

机器学习用例

除了选择最合适的算法以外,还要考虑是否需要提供用于训练的数据。Along with selecting the best-fitting algorithm(s), you need to consider whether or not you need to provide data for training. 机器学习算法划分为:Machine learning algorithms are categorized as follows:

  • 监督式 - 首先需要基于一组带有标签的数据训练算法,然后,该算法才能提供结果Supervised - algorithm needs to be trained on a set of labeled data before it can provide results
  • 半监督式 - 算法可以由训练者通过交互式查询以额外的目标进行补充,这些目标在初始训练阶段不可用Semi-supervised - algorithm can be augmented by extra targets through interactive query by a trainer, which were not available during initial stage of training
  • 非监督式 - 算法无需训练数据Unsupervised - algorithm does not require training data
  • 增强式 - 算法使用软件代理来确定特定上下文中的理想行为(通常在机器人中使用)Reinforcement - algorithm uses software agents to determine ideal behavior within a specific context (often used in robotics)
算法类别Algorithm Category 用途Use 学习类型Learning Type 算法Algorithms
分类Classification 将人员或事物分类成组Classify people or things into groups 监督式Supervised 决策树、逻辑回归、神经网络Decision trees, Logistic regression, neural networks
群集功能Clustering 将一组示例分割成地缘组Dividing a set of examples into homogenous groups 非监督式Unsupervised K 平均值聚类K-means clustering
模式检测Pattern detection 标识数据中的常见关联Identify frequent associations in the data 非监督式Unsupervised 关联规则Association rules
回归Regression 预测数字结果Predict numerical outcomes 监督式Supervised 线性回归、神经网络Linear regression, neural networks
增强式Reinforcement 确定机器人的最佳行为Determine optimal behavior for robots 增强式Reinforcement 蒙特卡洛仿真、DeepMindMonte Carlo Simulations, DeepMind

HDInsight 中的机器学习Machine learning on HDInsight

HDInsight 提供多个适用于高级分析工作流的机器学习选项:HDInsight has several machine learning options for an advanced analytics workflow:

  • 机器学习和 Apache SparkMachine Learning and Apache Spark
  • Azure 机器学习和 Apache HiveAzure Machine Learning and Apache Hive
  • Apache Spark 和深度学习Apache Spark and Deep learning

机器学习和 Apache SparkMachine Learning and Apache Spark

HDInsight Spark 是 Azure 托管的 Apache Spark 产品/服务,它是统一的开源并行数据处理框架,使用内存中处理来大幅提升大数据分析性能。HDInsight Spark is an Azure-hosted offering of Apache Spark, a unified, open source, parallel data processing framework that uses in-memory processing to boost Big Data analytics. Spark 处理引擎是专为速度、易用性和复杂分析打造的产品。The Spark processing engine is built for speed, ease of use, and sophisticated analytics. Spark 的内存中分布式计算功能使其成为机器学习和图形计算中使用的迭代算法的最佳选择。Spark's in-memory distributed computation capabilities make it a good choice for the iterative algorithms used in machine learning and graph computations.

有三个可缩放的机器学习库向此分布式环境引入了算法建模功能。There are three scalable machine learning libraries that bring algorithmic modeling capabilities to this distributed environment:

  • MLlib - MLlib 包含构建在 Spark RDD 基础之上的原始 API。MLlib - MLlib contains the original API built on top of Spark RDDs.
  • SparkML - SparkML 是一个较新的包,提供构建在 Spark DataFrames 基础之上的高级 API 用于构造机器学习管道。SparkML - SparkML is a newer package that provides a higher-level API built on top of Spark DataFrames for constructing ML pipelines.
  • MMLSpark - 适用于 Apache Spark 的 Microsoft 机器学习库 (MMLSpark) 旨在提升数据科学家在 Spark 上的生产力,它不仅可以提高试验成功率,而且还能利用前沿的机器学习技术,包括深度学习。MMLSpark - The Microsoft Machine Learning library for Apache Spark (MMLSpark) is designed to make data scientists more productive on Spark, to increase the rate of experimentation, and to leverage cutting-edge machine learning techniques, including deep learning, on very large datasets. MMLSpark 库简化了在 PySpark 中构建模型的常见建模任务。The MMLSpark library simplifies common modeling tasks for building models in PySpark.

Azure 机器学习和 Apache HiveAzure Machine Learning and Apache Hive

Azure 机器学习工作室不仅提供预测分析建模工具,还提供完全托管的服务,可以通过此服务将预测模型部署为随时可用的 Web 服务。Azure Machine Learning Studio provides tools to model predictive analytics, as well as a fully managed service you can use to deploy your predictive models as ready-to-consume web services. Azure 机器学习提供可在云中创建完整预测分析解决方案的工具,用于快速创建、测试、操作和管理预测模型。Azure Machine Learning provides tools for creating complete predictive analytics solutions in the cloud to quickly create, test, operationalize, and manage predictive models. 可以从大型算法库中进行选择、使用基于 Web 的工作室来构建模型,然后将模型轻松部署为 Web 服务。Select from a large algorithm library, use a web-based studio for building models, and easily deploy your model as a web service.

Apache Spark 和深度学习Apache Spark and Deep learning

深度学习是机器学习的一个分支,使用以人类大脑的生物学流程为灵感的深度神经网络 (DNN)。Deep learning is a branch of machine learning that uses deep neural networks (DNNs), inspired by the biological processes of the human brain. 许多研究人员将深度学习视为有前景的人工智能方法。Many researchers see deep learning as a promising approach for artificial intelligence. 深度学习的例子包括口译工具、图像识别系统和计算机推理。Some examples of deep learning are spoken language translators, image recognition systems, and machine reasoning. 为了帮助推进自身在深度学习方面的工作,Microsoft 开发了免费、易用的开源 Microsoft 认知工具包。To help advance its own work in deep learning, Microsoft has developed the free, easy-to-use, open-source Microsoft Cognitive Toolkit. 各种 Microsoft 产品、世界各地需要大规模部署深度学习的公司,以及对最新算法和技术感兴趣的学生都在广泛使用该工具包。The toolkit is being used extensively by a wide variety of Microsoft products, by companies worldwide with a need to deploy deep learning at scale, and by students interested in the latest algorithms and techniques.

方案 - 为图像评分以识别城市发展模式Scenario - Score Images to Identify Patterns in Urban Development

接下来我们探讨一个使用 HDInsight 的高级分析机器学习管道示例。Let's review an example of an advanced analytics machine learning pipeline using HDInsight.

此方案展示了如何使用 HDInsight Spark 群集上的 PySpark,使深度学习框架(Microsoft 的认知工具包 (CNTK))中生成的 DNN 变得可操作,从而为存储在 Azure Blob 存储帐户中的大型图像集合进行评分。In this scenario you will see how DNNs produced in a deep learning framework, Microsoft’s Cognitive Toolkit (CNTK), can be operationalized for scoring large image collections stored in an Azure Blob Storage account using PySpark on an HDInsight Spark cluster. 此方法适用于一般的 DNN 用例和航拍图像分类,并可用于识别最近的城市发展模式。This approach is applied to a common DNN use case, aerial image classification, and can be used to identify recent patterns in urban development. 我们将使用预先训练的图像分类模型。You will use a pre-trained image classification model. 此模型已基于 CIFAR-10 数据集预先训练,并已应用到 10,000 个保留的图像。The model is pre-trained on the CIFAR-10 dataset and has been applied to 10,000 withheld images.

此高级分析方案包括三个关键任务:There are three key tasks in this advanced analytics scenario:

  1. 使用 Apache Spark 2.1.0 分发版创建 Azure HDInsight Hadoop 群集。Create an Azure HDInsight Hadoop cluster with an Apache Spark 2.1.0 distribution.
  2. 运行自定义脚本,在 Azure HDInsight Spark 群集的所有节点上安装 Microsoft 认知工具包。Run a custom script to install Microsoft Cognitive Toolkit on all nodes of an Azure HDInsight Spark cluster.
  3. 将预先构建的 Jupyter Notebook 上传到 HDInsight Spark 群集,以使用 Spark Python API (PySpark) 将定型的 Microsoft 认知工具包深入学习模型应用到 Azure Blob 存储帐户中的文件。Upload a pre-built Jupyter notebook to your HDInsight Spark cluster to apply a trained Microsoft Cognitive Toolkit deep learning model to files in an Azure Blob Storage Account using the Spark Python API (PySpark).

此示例使用 Alex Krizhevsky、Vinod Nair 及 Geoffrey Hinton 编译和分发的 CIFAR-10 图像集。This example uses the CIFAR-10 image set compiled and distributed by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 数据集包含 60,000 个分属 10 个互斥类的 32×32 彩色图像:The CIFAR-10 dataset contains 60,000 32×32 color images belonging to 10 mutually exclusive classes:

映像

有关该数据集的详细信息,请参阅 Alex Krizhevsky 撰写的 Learning Multiple Layers of Features from Tiny Images(从微小图像中学习多层特征)。For more details on the dataset, see Alex Krizhevsky’s Learning Multiple Layers of Features from Tiny Images.

该数据集已分区成由 50,000 个图像组成的训练集,以及由 10,000 个图像组成的测试集。The dataset was partitioned into a training set of 50,000 images and a test set of 10,000 images. 第一个集用于遵循认知工具包 GitHub 存储库中的此教程,使用 Microsoft 认知工具包来训练一个深度达到 20 层的卷积残差网络 (ResNet) 模型。The first set was used to train a twenty-layer-deep convolutional residual network (ResNet) model using Microsoft Cognitive Toolkit by following this tutorial from the Cognitive Toolkit GitHub repository. 剩余的 10,000 个图像用于测试模型的准确性。The remaining 10,000 images were used for testing the model’s accuracy. 分布式计算在此场合下发挥了作用:图像的前处理和评分任务是高度可并行化的任务。This is where distributed computing comes into play: the task of pre-processing and scoring the images is highly parallelizable. 借助手头保存的已训练模型:With the saved trained model in hand, we used:

  • 我们使用了 PySpark 将图像和训练模型分发到群集的工作节点。PySpark to distribute the images and trained model to the cluster’s worker nodes.
  • 使用了 Python 在 HDInsight Spark 群集的每个节点上进行图像前处理。Python to pre-process the images on each node of the HDInsight Spark cluster.
  • 使用了认知工具包在每个节点上加载模型,并对经过前处理的图像进行评分。Cognitive Toolkit to load the model and score the pre-processed images on each node.
  • 使用了 Jupyter Notebook 运行 PySpark 脚本、聚合结果,并使用了 Matplotlib 将模型性能可视化。Jupyter Notebooks to run the PySpark script, aggregate the results, and use Matplotlib to visualize the model performance.

在包含 4 个工作节点的群集上,10,000 个图像的前处理/评分花费了不到 1 分钟。The entire preprocessing/scoring of the 10,000 images takes less than one minute on a cluster with 4 worker nodes. 该模型可准确预测大约 9,100 个 (91%) 图像的标签。The model accurately predicts the labels of ~9,100 (91%) images. 混淆矩阵可演示最常见的分类错误。A confusion matrix illustrates the most common classification errors. 例如,以下矩阵显示,与其他标签对相比,发生将狗标记成猫(以及将猫标记成狗)的错误的频率较高。For example, the matrix shows that mislabeling dogs as cats and vice versa occurs more frequently than for other label pairs.

结果

试试吧!Try it Out!

请遵循此教程以端到端的方式实施此解决方案:设置 HDInsight Spark 群集、安装认知工具包,然后运行可对 10,000 个 CIFAR 图像评分的 Jupyter Notebook。Follow this tutorial to implement this solution end-to-end: setup an HDInsight Spark cluster, install Cognitive Toolkit, and run the Jupyter Notebook that scores 10,000 CIFAR images.

后续步骤Next steps

Spark 和 MLLibSpark and MLLib

深度学习、认知工具包和其他技术Deep Learning, Cognitive Toolkit, and others