教程:使用 Power BI 在 HDInsight 中分析 Apache Spark 数据Tutorial: Analyze Apache Spark data using Power BI in HDInsight

在本教程中,了解如何使用 Microsoft Power BIAzure HDInsight 中可视化 Apache Spark 群集中的数据。In this tutorial, you learn how to use Microsoft Power BI to visualize data in an Apache Spark cluster in Azure HDInsight.

本教程介绍如何执行下列操作:In this tutorial, you learn how to:

  • 使用 Power BI 可视化 Spark 数据Visualize Spark data using Power BI

如果没有 Azure 订阅,可以在开始前创建一个试用帐户If you don't have an Azure subscription, create a trial account before you begin.


验证数据Verify the data

上一教程中创建的 Jupyter Notebook 包含用于创建 hvac 表的代码。The Jupyter Notebook that you created in the previous tutorial includes code to create an hvac table. 此表基于在所有 HDInsight Spark 群集上均可用的 CSV 文件(位于 \HdiSamples\HdiSamples\SensorSampleData\hvac\hvac.csv)。This table is based on the CSV file available on all HDInsight Spark clusters at \HdiSamples\HdiSamples\SensorSampleData\hvac\hvac.csv. 使用以下过程验证数据。Use the following procedure to verify the data.

  1. 从 Jupyter 笔记本中,粘贴以下代码,然后按 Shift+Enter 。From the Jupyter notebook, paste the following code, and then press SHIFT + ENTER. 该代码验证表是否存在。The code verifies the existence of the tables.


    输出如下所示:The output looks like:

    在 Spark 中显示表

    如果在开始本教程之前关闭笔记本,这会清除 hvactemptable,使其不包含在输出中。If you closed the notebook before starting this tutorial, hvactemptable is cleaned up, so it's not included in the output. 只有元存储中存储的 Hive 表(由“isTemporary”列下的“False”表示)可从 BI 工具中进行访问 。Only Hive tables that are stored in the metastore (indicated by False under the isTemporary column) can be accessed from the BI tools. 在本教程中,将连接到创建的 hvac 表 。In this tutorial, you connect to the hvac table that you created.

  2. 将以下代码粘贴到空白单元格中,然后按 Shift+Enter 。Paste the following code in an empty cell, and then press SHIFT + ENTER. 该代码验证表中的数据。The code verifies the data in the table.

    SELECT * FROM hvac LIMIT 10

    输出如下所示:The output looks like:

    在 Spark 中显示 hvac 表中的行

  3. 请在 Notebook 的“文件”菜单中选择“关闭并停止” 。From the File menu on the notebook, select Close and Halt. 关闭 Notebook 以释放资源。Shut down the notebook to release the resources.

可视化数据Visualize the data

在本部分,将使用 Power BI 从 Spark 群集数据创建可视化效果、报表和仪表板。In this section, you use Power BI to create visualizations, reports, and dashboards from the Spark cluster data.

在 Power BI Desktop 中创建报表Create a report in Power BI Desktop

使用 Spark 的前几步是连接到 Power BI Desktop 中的集群,从集群中加载数据,并根据此数据创建基本的可视化效果。The first steps in working with Spark are to connect to the cluster in Power BI Desktop, load data from the cluster, and create a basic visualization based on that data.


本文中所示的连接器当前处于预览状态。The connector demonstrated in this article is currently in preview. 通过 Power BI 社区站点或 Power BI Ideas(Power BI 想法)提供反馈。Provide any feedback you have through the Power BI Community site or Power BI Ideas.

  1. 打开 Power BI Desktop。Open Power BI Desktop. 关闭启动初始屏幕(如果已打开)。Close the start up splash screen if it opens.

  2. 从“主页” 选项卡,导航到“获取数据” > “更多...” 。From the Home tab, navigate to Get Data > More...

    从 HDInsight Apache Spark 将数据获取到 Power BI DesktopGet data into Power BI Desktop from HDInsight Apache Spark

  3. 在搜索框中输入 Spark,选择 Azure HDInsight Spark,然后选择“连接” 。Enter Spark in the search box, select Azure HDInsight Spark, and then select Connect.

    从 Apache Spark BI 将数据获取到 Power BIGet data into Power BI from Apache Spark BI

  4. 在“服务器” 文本框中输入群集 URL(形式为 mysparkcluster.azurehdinsight.cn)。Enter your cluster URL (in the form mysparkcluster.azurehdinsight.cn) in the Server text box.

  5. 在“数据连接模式” 下,择“DirectQuery” 。Under Data connectivity mode, select DirectQuery. 然后选择“确定”。 Then select OK.

    Spark 可使用任一数据连接模式。You can use either data connectivity mode with Spark. 若使用 DirectQuery,则报告中会显示更改,但不刷新整个数据集。If you use DirectQuery, changes are reflected in reports without refreshing the entire dataset. 如果导入数据,则必须刷新数据集以查看更改。If you import data, you must refresh the data set to see changes. 若要深入了解如何以及何时使用 DirectQuery,请参阅在 Power BI 中使用 DirectQueryFor more information on how and when to use DirectQuery, see Using DirectQuery in Power BI.

  6. 输入 HDInsight 登录帐户信息,然后选择“连接” 。Enter the HDInsight login account information, then select Connect. 默认帐户名为 admin 。The default account name is admin.

  7. 选择 hvac 表,稍后即可看到数据预览,然后选择“加载” 。Select the hvac table, wait to see a preview of the data, and then select Load.

    Spark 群集用户名和密码Spark cluster user name and password

    Power BI Desktop 拥有连接到 Spark 群集和从 hvac 表中加载数据所需的信息。Power BI Desktop has the information it needs to connect to the Spark cluster and load data from the hvac table. 该表及表中各列显示在“字段”窗格中 。The table and its columns are displayed in the Fields pane.

  8. 可视化每栋建筑物的目标温度与实际温度之间的差异:Visualize the variance between target temperature and actual temperature for each building:

    1. 在“可视化”窗格中,选择“分区图” 。In the VISUALIZATIONS pane, select Area Chart.

    2. 将“BuildingID”字段拖到“轴”,并将“ActualTemp”和“TargetTemp”字段拖至“值” 。Drag the BuildingID field to Axis, and drag the ActualTemp and TargetTemp fields to Value.

      添加值列add value columns

      关系图如下所示:The diagram looks like:

      面积图总和area graph sum

      默认情况下,可视化效果效果会显示 ActualTempTargetTemp 的总和。By default the visualization shows the sum for ActualTemp and TargetTemp. 选择“可视化效果”窗格中 ActualTemp 和 TragetTemp 旁边的向下键,可看到已选中“总和” 。Select the down arrow next to ActualTemp and TragetTemp in the Visualizations pane, you can see Sum is selected.

    3. 选择“可视化效果”窗格中 ActualTemp 和 TragetTemp 旁边的向下键,选择“平均值”以获得每栋建筑物的实际温度与目标温度的平均值 。Select the down arrows next to ActualTemp and TragetTemp in the Visualizations pane, select Average to get an average of actual and target temperatures for each building.

      值的平均值average of values

      数据可视化效果应与屏幕截图中类似。Your data visualization shall be similar to the one in the screenshot. 在视觉效果上移动光标可获取相关数据的工具提示。Move your cursor over the visualization to get tool tips with relevant data.

      面积图area graph

  9. 导航到“文件” > “保存” ,为文件输入名称 BuildingTemperature,然后选择“保存” 。Navigate to File > Save, enter the name BuildingTemperature for the file, then select Save.

将报表发布到 Power BI 服务(可选)Publish the report to the Power BI Service (optional)

Power BI 服务允许在整个组织中共享报表和仪表板。The Power BI service allows you to share reports and dashboards across your organization. 在本部分,首先发布数据集和报表。In this section, you first publish the dataset and the report. 然后,将报表固定到仪表板。Then, you pin the report to a dashboard. 仪表板通常用于重点处理报表中的数据子集;此报表中只有一个可视化效果,但仍可用于完成此步骤。Dashboards are typically used to focus on a subset of data in a report; you have only one visualization in your report, but it's still useful to go through the steps.

  1. 打开 Power BI Desktop。Open Power BI Desktop.

  2. 在“开始”选项卡上,单击“发布” 。From the Home tab, click Publish.

    从 Power BI Desktop 发布Publish from Power BI Desktop

  3. 选择要将数据集发布和报告到的工作区,然后单击“选择” 。Select a workspace to publish your dataset and report to, then click Select. 在下图中,默认选择“我的工作区” 。In the following image, the default My Workspace is selected.

    选择要将数据集和报告发布到的工作区Select workspace to publish dataset and report to

  4. 发布成功后,单击“在 Power BI 中打开 BuildingTemperature.pbix” 。After the publishing is succeeded, click Open 'BuildingTemperature.pbix' in Power BI.

    发布成功,单击以输入凭据Publish success, click to enter credentials

  5. 在 Power BI 服务中,单击“输入凭据” 。In the Power BI service, click Enter credentials.

    在 Power BI 服务中输入凭据Enter credentials in Power BI service

  6. 单击“编辑凭据” 。Click Edit credentials.

    在 Power BI 服务中编辑凭据Edit credentials in Power BI service

  7. 输入 HDInsight 登录帐户信息,然后单击“登录” 。Enter the HDInsight login account information, and then click Sign in. 默认帐户名为 admin 。The default account name is admin.

    登录到 Spark 群集Sign in to Spark cluster

  8. 在左窗格中,转到“工作区” > “我的工作区” > “报表”,然后单击 BuildingTemperature 。In the left pane, go to Workspaces > My Workspace > REPORTS, then click BuildingTemperature.

    在左窗格中的报告下列出的报告Report listed under reports in left pane

    还可在左窗格中的“数据集”下找到 BuildingTemperature 。You should also see BuildingTemperature listed under DATASETS in the left pane.

    现可在 Power BI 服务中使用 Power BI Desktop 中创建的视觉对象。The visual you created in Power BI Desktop is now available in the Power BI service.

  9. 将游标悬停在可视化效果上,然后单击右上角的固定图标。Hover your cursor over the visualization, and then click the pin icon on the upper right corner.

    Power BI 服务中的报告Report in the Power BI service

  10. 选择“新的仪表板”,输入名称 Building temperature,再单击“固定” 。Select "New dashboard", enter the name Building temperature, then click Pin.

    固定到新的仪表板Pin to new dashboard

  11. 在报表中,单击“转到仪表板” 。In the report, click Go to dashboard.

视觉对象已固定到仪表板 - 可在报表中添加其他视觉对象,并将其固定在同一仪表板上。Your visual is pinned to the dashboard - you can add other visuals to the report and pin them to the same dashboard. 有关报表和仪表板的详细信息,请参阅 Power BI 中的报表Power BI 中的仪表板For more information about reports and dashboards, see Reports in Power BI and Dashboards in Power BI.

清理资源Clean up resources

完成本教程后,可以删除群集。After you complete the tutorial, you may want to delete the cluster. 有了 HDInsight,便可以将数据存储在 Azure 存储中,因此可以在群集不用时安全地删除群集。With HDInsight, your data is stored in Azure Storage, so you can safely delete a cluster when it is not in use. 此外,还需要支付 HDInsight 群集费用,即使未使用。You are also charged for an HDInsight cluster, even when it is not in use. 由于群集费用高于存储空间费用数倍,因此在不使用群集时将其删除可以节省费用。Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they are not in use.

若要删除群集,请参阅使用浏览器、PowerShell 或 Azure CLI 删除 HDInsight 群集To delete a cluster, see Delete an HDInsight cluster using your browser, PowerShell, or the Azure CLI.

后续步骤Next steps

在本教程中,了解了如何使用 Microsoft Power BI 在 Azure HDInsight 中可视化 Apache Spark 群集中的数据。In this tutorial, you learned how to use Microsoft Power BI to visualize data in an Apache Spark cluster in Azure HDInsight. 请前进到下一篇文章,了解如何将在 Spark 中注册的数据拉取到 Power BI 等 BI 分析工具中。Advance to the next article to see how the data you registered in Spark can be pulled into a BI analytics tool such as Power BI.