在 Azure HDInsight 上的 Apache Spark 群集中使用 Apache Zeppelin 笔记本Use Apache Zeppelin notebooks with Apache Spark cluster on Azure HDInsight
HDInsight Spark 群集包括 Apache Zeppelin 笔记本。HDInsight Spark clusters include Apache Zeppelin notebooks. 使用笔记本运行 Apache Spark 作业。Use the notebooks to run Apache Spark jobs. 本文介绍如何在 HDInsight 群集中使用 Zeppelin 笔记本。In this article, you learn how to use the Zeppelin notebook on an HDInsight cluster.
先决条件Prerequisites
- HDInsight 上的 Apache Spark 群集。An Apache Spark cluster on HDInsight. 有关说明,请参阅在 Azure HDInsight 中创建 Apache Spark 群集。For instructions, see Create Apache Spark clusters in Azure HDInsight.
- 群集主存储的 URI 方案。The URI scheme for your clusters primary storage. 对于 Azure Blob 存储,此方案为
wasb://
;对于 Azure Data Lake Storage Gen2,此方案为abfs://
。The scheme would bewasb://
for Azure Blob Storage,abfs://
for Azure Data Lake Storage Gen2 . 如果为 Blob 存储启用了安全传输,则 URI 将为wasbs://
。If secure transfer is enabled for Blob Storage, the URI would bewasbs://
. 有关详细信息,请参阅在 Azure 存储中要求安全传输。For more information, see Require secure transfer in Azure Storage .
启动 Apache Zeppelin 笔记本Launch an Apache Zeppelin notebook
在 Spark 群集的“概述”中,从群集仪表板选择“Zeppelin 笔记本”。 From the Spark cluster Overview, select Zeppelin notebook from Cluster dashboards. 输入群集的管理员凭据。Enter the admin credentials for the cluster.
备注
也可以在浏览器中打开以下 URL 来访问群集的 Zeppelin 笔记本。You may also reach the Zeppelin Notebook for your cluster by opening the following URL in your browser. 将 CLUSTERNAME 替换为群集的名称:Replace CLUSTERNAME with the name of your cluster:
https://CLUSTERNAME.azurehdinsight.cn/zeppelin
创建新的笔记本。Create a new notebook. 在标题窗格中,导航到“笔记本” > “创建新笔记”。 From the header pane, navigate to Notebook > Create new note.
输入笔记本的名称,然后选择“创建笔记”。Enter a name for the notebook, then select Create Note.
确保笔记本标题显示“已连接”状态。Ensure the notebook header shows a connected status. 该状态由右上角的一个绿点表示。It is denoted by a green dot in the top-right corner.
将示例数据载入临时表。Load sample data into a temporary table. 在 HDInsight 中创建 Spark 群集时,系统会将示例数据文件
hvac.csv
复制到\HdiSamples\SensorSampleData\hvac
下的关联存储帐户。When you create a Spark cluster in HDInsight, the sample data file,hvac.csv
, is copied to the associated storage account under\HdiSamples\SensorSampleData\hvac
.将以下代码段粘贴到新笔记本中默认创建的空白段落处。In the empty paragraph that is created by default in the new notebook, paste the following snippet.
%livy2.spark //The above magic instructs Zeppelin to use the Livy Scala interpreter // Create an RDD using the default Spark context, sc val hvacText = sc.textFile("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv") // Define a schema case class Hvac(date: String, time: String, targettemp: Integer, actualtemp: Integer, buildingID: String) // Map the values in the .csv file to the schema val hvac = hvacText.map(s => s.split(",")).filter(s => s(0) != "Date").map( s => Hvac(s(0), s(1), s(2).toInt, s(3).toInt, s(6) ) ).toDF() // Register as a temporary table called "hvac" hvac.registerTempTable("hvac")
按 SHIFT + ENTER 或为段落选择“播放”按钮以运行代码片段。Press SHIFT + ENTER or select the Play button for the paragraph to run the snippet. 段落右上角的状态应从“就绪”逐渐变成“挂起”、“正在运行”和“已完成”。The status on the right-corner of the paragraph should progress from READY, PENDING, RUNNING to FINISHED. 输出会显示在同一段落的底部。The output shows up at the bottom of the same paragraph. 屏幕截图如下图所示:The screenshot looks like the following image:
也可以为每个段落提供标题。You can also provide a title to each paragraph. 在段落的右侧一角,选择“设置”图标(齿轮图标),然后选择“显示标题”。 From the right-hand corner of the paragraph, select the Settings icon (sprocket), and then select Show title.
备注
所有 HDInsight 版本中的 Zeppelin 笔记本均不支持 %spark2 解释器,HDInsight 4.0 及更高版本不支持 %sh 解释器。%spark2 interpreter is not supported in Zeppelin notebooks across all HDInsight versions, and %sh interpreter will not be supported from HDInsight 4.0 onwards.
现在可以针对
hvac
表运行 Spark SQL 语句。You can now run Spark SQL statements on thehvac
table. 将以下查询粘贴到新段落中。Paste the following query in a new paragraph. 该查询将检索建筑物 ID,The query retrieves the building ID. 以及每栋建筑物在指定日期的目标温度与实际温度之间的差异。Also the difference between the target and actual temperatures for each building on a given date. 按 SHIFT + ENTER。Press SHIFT + ENTER.%sql select buildingID, (targettemp - actualtemp) as temp_diff, date from hvac where date = "6/1/13"
开头的 %Sql 语句告诉笔记本要使用 Livy Scala 解释器。The %sql statement at the beginning tells the notebook to use the Livy Scala interpreter.
选择“条形图”图标以更改显示内容。Select the Bar Chart icon to change the display. 使用“设置”(选择“条形图”后显示)可以选择“键”和“值”。 settings, appear after you have selected Bar Chart, allows you to choose Keys, and Values. 以下屏幕快照显示了输出。The following screenshot shows the output.
还可以在查询中使用变量来运行 Spark SQL 语句。You can also run Spark SQL statements using variables in the query. 以下代码片段演示如何在查询中使用可以用来查询的值定义
Temp
变量。The next snippet shows how to define a variable,Temp
, in the query with the possible values you want to query with. 首次运行查询时,下拉列表中会自动填充你指定的变量值。When you first run the query, a drop-down is automatically populated with the values you specified for the variable.%sql select buildingID, date, targettemp, (targettemp - actualtemp) as temp_diff from hvac where targettemp > "${Temp = 65,65|75|85}"
将此代码段粘贴到新段落,并按 SHIFT + ENTER。Paste this snippet in a new paragraph and press SHIFT + ENTER. 然后,从“温度”下拉列表中选择“65”。 Then select 65 from the Temp drop-down list.
选择“条形图”图标以更改显示内容。Select the Bar Chart icon to change the display. 然后选择“设置”并进行以下更改:Then select settings and make the following changes:
组: 添加 targettemp。Groups: Add targettemp.
值: 1.Values: 1. 删除 date。Remove date. 2.2. 添加 temp_diff。Add temp_diff. 3.3. 将聚合器从“SUM”更改为“AVG”。 Change the aggregator from SUM to AVG.
以下屏幕快照显示了输出。The following screenshot shows the output.
如何在笔记本中使用外部包?How do I use external packages with the notebook?
HDInsight 上 Apache Spark 群集中的 Zeppelin 笔记本可以使用群集中未包含的、社区提供的外部包。Zeppelin notebook in Apache Spark cluster on HDInsight can use external, community-contributed packages that aren't included in the cluster. 在 Maven 存储库中搜索可用包的完整列表。Search the Maven repository for the complete list of packages that are available. 也可以从其他源获取可用包的列表。You can also get a list of available packages from other sources. 例如, Spark 包中提供了社区贡献包的完整列表。For example, a complete list of community-contributed packages is available at Spark Packages.
本文将介绍如何在 Jupyter 笔记本中使用 spark-csv 包。In this article, you'll see how to use the spark-csv package with the Jupyter notebook.
打开解释器设置。Open interpreter settings. 选择右上角的登录用户名,然后选择“解释器”。From the top-right corner, select the logged in user name, then select Interpreter.
滚动到“livy2”,然后选择“编辑”。 Scroll to livy2, then select edit.
导航到键
livy.spark.jars.packages
,并以group:id:version
格式设置其值。Navigate to keylivy.spark.jars.packages
, and set its value in the formatgroup:id:version
. 因此,如果要使用 spark-csv 包,必须将键的值设置为com.databricks:spark-csv_2.10:1.4.0
。So, if you want to use the spark-csv package, you must set the value of the key tocom.databricks:spark-csv_2.10:1.4.0
.依次选择“保存”、“确定”,以重启 Livy 解释器。 Select Save and then OK to restart the Livy interpreter.
要了解如何访问上面输入的键的值,请查看以下内容。If you want to understand how to arrive at the value of the key entered above, here's how.
a.a. 在 Maven 存储库中找出该包。Locate the package in the Maven Repository. 在本文中,使用了 spark-csv。For this article, we used spark-csv.
b.b. 从存储库中收集 GroupId、ArtifactId 和 Version 的值。From the repository, gather the values for GroupId, ArtifactId, and Version.
c.c. 串连这三个值并以冒号分隔 ( : )。Concatenate the three values, separated by a colon (:).
com.databricks:spark-csv_2.10:1.4.0
Zeppelin 笔记本保存在何处?Where are the Zeppelin notebooks saved?
Zeppelin 笔记本保存在群集头节点。The Zeppelin notebooks are saved to the cluster headnodes. 因此,如果删除群集,笔记本也会被删除。So, if you delete the cluster, the notebooks will be deleted as well. 如果想要保留笔记本以供将来在其他群集中使用,那么必须在运行完作业之后,将笔记本导出。If you want to preserve your notebooks for later use on other clusters, you must export them after you have finished running the jobs. 若要导出笔记本,请选择下图所示的“导出”图标。To export a notebook, select the Export icon as shown in the image below.
此操作可在下载位置将笔记本另存为 JSON 文件。This action saves the notebook as a JSON file in your download location.
使用 Shiro
在企业安全性套餐 (ESP) 群集中配置 Zeppelin 解释器的访问权限Use Shiro
to Configure Access to Zeppelin Interpreters in Enterprise Security Package (ESP) Clusters
如上所述,从 HDInsight 4.0 开始不再支持 %sh
解释器。As noted above, the %sh
interpreter isn't supported from HDInsight 4.0 onwards. 此外,由于 %sh
解释器会导致潜在的安全问题(例如,使用 shell 命令访问 keytabs),因此也从 HDInsight 3.6 ESP 群集中删除了该解释器。Furthermore, since %sh
interpreter introduces potential security issues, such as access keytabs using shell commands, it has been removed from HDInsight 3.6 ESP clusters as well. 这意味着,默认情况下,单击“创建新注释”时或位于解释器 UI 时,%sh
解析器不可用。It means %sh
interpreter isn't available when clicking Create new note or in the Interpreter UI by default.
特权域用户可以使用 Shiro.ini
文件来控制对解释器 UI 的访问。Privileged domain users can use the Shiro.ini
file to control access to the Interpreter UI. 只有这些用户可以创建新的 %sh
解释器并对每个新 %sh
解释器设置权限。Only these users can create new %sh
interpreters and set permissions on each new %sh
interpreter. 若要使用 shiro.ini
文件控制访问权限,请执行以下步骤:To control access using the shiro.ini
file, use the following steps:
使用现有域组名称定义新的角色。Define a new role using an existing domain group name. 在以下示例中,
adminGroupName
是 AAD 中的一组特权用户。In the following example,adminGroupName
is a group of privileged users in AAD. 请勿在组名称中使用特殊字符或空格。Don't use special characters or white spaces in the group name.=
后的字符用于为此角色提供权限。The characters after=
give the permissions for this role.*
表示组具有完全权限。*
means the group has full permissions.[roles] adminGroupName = *
添加新的角色以访问 Zeppelin 解释器。Add the new role for access to Zeppelin interpreters. 在以下示例中,
adminGroupName
中的所有用户都授予了 Zeppelin 解释器的访问权限,并且可以创建新的解释器。In the following example, all users inadminGroupName
are given access to Zeppelin interpreters and can create new interpreters. 你可以在roles[]
中的括号之间放置多个角色,用逗号分隔。You can put multiple roles between the brackets inroles[]
, separated by commas. 然后,具有必要权限的用户可以访问 Zeppelin 解释器。Then, users that have the necessary permissions, can access Zeppelin interpreters.[urls] /api/interpreter/** = authc, roles[adminGroupName]
Livy 会话管理Livy session management
Zeppelin 笔记本中的第一个代码段会在群集中创建一个新的 Livy 会话。The first code paragraph in your Zeppelin notebook creates a new Livy session in your cluster. 此会话会在随后创建的所有 Zeppelin 笔记本中共享。This session is shared across all Zeppelin notebooks that you later create. 如果 Livy 会话因任何原因而被终止,则 Zeppelin 笔记本将无法运行作业。If the Livy session is killed for any reason, jobs won't run from the Zeppelin notebook.
在这种情况下,必须先执行以下步骤,然后才能开始在 Zeppelin 笔记本中运行作业。In such a case, you must do the following steps before you can start running jobs from a Zeppelin notebook.
在 Zeppelin 笔记本中重启 Livy 解释器。Restart the Livy interpreter from the Zeppelin notebook. 为此,请选择右上角的登录用户名打开解释器设置,然后选择“解释器”。To do so, open interpreter settings by selecting the logged in user name from the top-right corner, then select Interpreter.
滚动到“livy2”,然后选择“重启”。 Scroll to livy2, then select restart.
在现有的 Zeppelin 笔记本中运行代码单元。Run a code cell from an existing Zeppelin notebook. 此代码会在 HDInsight 群集中创建新的 Livy 会话。This code creates a new Livy session in the HDInsight cluster.
常规信息General information
验证服务Validate service
若要从 Ambari 验证服务,请导航到 https://CLUSTERNAME.azurehdinsight.cn/#/main/services/ZEPPELIN/summary
,其中 CLUSTERNAME 是群集的名称。To validate the service from Ambari, navigate to https://CLUSTERNAME.azurehdinsight.cn/#/main/services/ZEPPELIN/summary
where CLUSTERNAME is the name of your cluster.
若要从命令行验证服务,请通过 SSH 连接到头节点。To validate the service from a command line, SSH to the head node. 使用命令 sudo su zeppelin
将用户切换到 zeppelin。Switch user to zeppelin using command sudo su zeppelin
. 状态命令:Status commands:
命令Command | 说明Description |
---|---|
/usr/hdp/current/zeppelin-server/bin/zeppelin-daemon.sh status |
服务状态。Service status. |
/usr/hdp/current/zeppelin-server/bin/zeppelin-daemon.sh --version |
服务版本。Service version. |
ps -aux | grep zeppelin |
标识 PID。Identify PID. |
日志位置Log locations
服务Service | Path Path |
---|---|
zeppelin-serverzeppelin-server | /usr/hdp/current/zeppelin-server//usr/hdp/current/zeppelin-server/ |
服务器日志Server Logs | /var/log/zeppelin/var/log/zeppelin |
配置解释器、Shiro 、site.xml、log4jConfiguration Interpreter, Shiro , site.xml, log4j |
/usr/hdp/current/zeppelin-server/conf or /etc/zeppelin/conf/usr/hdp/current/zeppelin-server/conf or /etc/zeppelin/conf |
PID 目录PID directory | /var/run/zeppelin/var/run/zeppelin |
启用调试日志记录Enable debug logging
导航到
https://CLUSTERNAME.azurehdinsight.cn/#/main/services/ZEPPELIN/summary
,其中 CLUSTERNAME 是群集的名称。Navigate tohttps://CLUSTERNAME.azurehdinsight.cn/#/main/services/ZEPPELIN/summary
where CLUSTERNAME is the name of your cluster.导航到“CONFIGS” > “Advanced zeppelin-log4j-properties” > “log4j_properties_content”。Navigate to CONFIGS > Advanced zeppelin-log4j-properties > log4j_properties_content.
将
log4j.appender.dailyfile.Threshold = INFO
修改为log4j.appender.dailyfile.Threshold = DEBUG
。Modifylog4j.appender.dailyfile.Threshold = INFO
tolog4j.appender.dailyfile.Threshold = DEBUG
.添加
log4j.logger.org.apache.zeppelin.realm=DEBUG
。Addlog4j.logger.org.apache.zeppelin.realm=DEBUG
.保存更改并重启服务。Save changes and restart service.
后续步骤Next steps
- 概述:Azure HDInsight 上的 Apache SparkOverview: Apache Spark on Azure HDInsight
- 在 HDInsight 的 Apache Spark 群集中可用于 Jupyter Notebook 的内核Kernels available for Jupyter notebook in Apache Spark cluster for HDInsight
- Install Jupyter on your computer and connect to an HDInsight Spark cluster(在计算机上安装 Jupyter 并连接到 HDInsight Spark 群集)Install Jupyter on your computer and connect to an HDInsight Spark cluster