在 Azure 上使用 Hive 进行 HDInsight Hadoop 数据科学演练HDInsight Hadoop data science walkthroughs using Hive on Azure

这些演练在 HDInsight Hadoop 群集中使用 Hive 执行预测分析。These walkthroughs use Hive with an HDInsight Hadoop cluster to do predictive analytics. 它们遵循 Team Data Science Process 中所述的步骤。They follow the steps outlined in the Team Data Science Process. 有关 Team Data Science Process 的概述,请参阅数据科学过程For an overview of the Team Data Science Process, see Data Science Process. 有关 Azure HDInsight 的简介,请参阅 Azure HDInsight、Hadoop 技术堆栈和 Hadoop 群集简介For an introduction to Azure HDInsight, see Introduction to Azure HDInsight, the Hadoop technology stack, and Hadoop clusters.

其他执行 Team Data Science Process 的数据科学演练按所使用的平台分组。Additional data science walkthroughs that execute the Team Data Science Process are grouped by the platform that they use. 有关这些示例的明细,请参阅执行 Team Data Science Process 的演练See Walkthroughs executing the Team Data Science Process for an itemization of these examples.

在 HDInsight Hadoop 中使用 Hive 预测出租车小费Predict taxi tips using Hive with HDInsight Hadoop

使用 HDInsight Hadoop 群集演练使用纽约出租车中的数据来预测:The Use HDInsight Hadoop clusters walkthrough uses data from New York taxis to predict:

  • 是否支付了小费Whether a tip is paid
  • 小费金额的分布The distribution of tip amounts

该方案是在 Azure HDInsight Hadoop 群集中使用 Hive 实现的。The scenario is implemented using Hive with an Azure HDInsight Hadoop cluster. 本演练介绍如何存储、探索和特征化纽约市出租车行程与车费数据集中的工程数据。You learn how to store, explore, and feature engineer data from a publicly available NYC taxi trip and fare dataset. 还可以使用 Azure 机器学习来生成和部署模型。You also use Azure Machine Learning to build and deploy the models.

在 HDInsight Hadoop 中使用 Hive 来预测广告点击量Predict advertisement clicks using Hive with HDInsight Hadoop

使用 Azure HDInsight Hadoop 群集处理 1-TB 数据集演练使用公用 Criteo 点击量数据集来预测是否支付了小费以及预期的金额范围。The Use Azure HDInsight Hadoop Clusters on a 1-TB dataset walkthrough uses a publicly available Criteo click dataset to predict whether a tip is paid and the range of amounts expected. 该方案的实现方式是在 Azure HDInsight Hadoop 群集中使用 Hive 来存储、探索及特征化工程数据与下游样本数据。The scenario is implemented using Hive with an Azure HDInsight Hadoop cluster to store, explore, feature engineer, and down sample data. 它使用 Azure 机器学习来生成、训练和评分一个用于预测用户是否点击了某个广告的二元分类模型。It uses Azure Machine Learning to build, train, and score a binary classification model predicting whether a user clicks on an advertisement. 演练的最后展示了如何将这些模型之一发布为 Web 服务。The walkthrough concludes showing how to publish one of these models as a Web service.

后续步骤Next steps

有关构成 Team Data Science Process 的关键组件的讨论,请参阅 Team Data Science Process 概述For a discussion of the key components that comprise the Team Data Science Process, see Team Data Science Process overview.

有关可用于构建数据科学项目的 Team Data Science Process 生命周期的讨论,请参阅 Team Data Science Process 生命周期For a discussion of the Team Data Science Process lifecycle that you can use to structure your data science projects, see Team Data Science Process lifecycle. 生命周期概述了执行项目时,其从开始到结束所遵循的步骤。The lifecycle outlines the steps, from start to finish, that projects usually follow when they are executed.