HDInsight 上的 Apache Spark 群集的已知问题Known issues for Apache Spark cluster on HDInsight

本文档记述了 HDInsight Spark 公共预览版的所有已知问题。This document keeps track of all the known issues for the HDInsight Spark public preview.

Apache Livy 泄漏交互式会话Apache Livy leaks interactive session

如果 Apache Livy 在某个交互式会话仍保持活动状态的情况下重启(通过 Apache Ambari 重启或由于头节点 0 虚拟机重启导致),则会泄漏交互式作业会话。When Apache Livy restarts (from Apache Ambari or because of headnode 0 virtual machine reboot) with an interactive session still alive, an interactive job session is leaked. 因此,新作业可能会停滞在“已接受”状态。As a result, new jobs can be stuck in the Accepted state.

缓解措施:Mitigation:

请使用以下步骤解决该问题:Use the following procedure to work around the issue:

  1. 通过 SSH 连接到头节点。Ssh into headnode. 有关信息,请参阅将 SSH 与 HDInsight 配合使用For information, see Use SSH with HDInsight.

  2. 运行以下命令,以查找通过 Livy 启动的交互式作业的应用程序 ID。Run the following command to find the application IDs of the interactive jobs started through Livy.

     yarn application -list
    

    如果在未指定显式名称的情况下通过 Livy 交互式对话启动作业,则默认的作业名称将为 Livy。The default job names will be Livy if the jobs were started with a Livy interactive session with no explicit names specified. 对于由 Jupyter Notebook 启动的 Livy 对话,作业名称以 remotesparkmagics_* 开头。For the Livy session started by Jupyter Notebook, the job name starts with remotesparkmagics_*.

  3. 运行以下命令以终止这些作业。Run the following command to kill those jobs.

     yarn application -kill <Application ID>
    

新作业将开始运行。New jobs start running.

Spark History Server未启动Spark History Server not started

创建群集后,Spark History Server 不自动启动。Spark History Server is not started automatically after a cluster is created.

缓解措施:Mitigation:

从 Ambari 手动启动 History Server。Manually start the history server from Ambari.

Spark 日志目录中的权限问题Permission issue in Spark log directory

使用 spark-submit 提交作业时 hdiuser 会收到以下错误:hdiuser gets the following error when submitting a job using spark-submit:

java.io.FileNotFoundException: /var/log/spark/sparkdriver_hdiuser.log (Permission denied)

并且不会写入任何驱动程序日志。And no driver log is written.

缓解措施:Mitigation:

  1. 将 hdiuser 添加到 Hadoop 组。Add hdiuser to the Hadoop group.
  2. 创建群集后,提供对 /var/log/spark 的 777 权限。Provide 777 permissions on /var/log/spark after cluster creation.
  3. 使用 Ambari 将 Spark 日志位置更新为具有 777 权限的目录。Update the spark log location using Ambari to be a directory with 777 permissions.
  4. 以 sudo 身份运行 spark-submit。Run spark-submit as sudo.

不支持 Spark-Phoenix 连接器Spark-Phoenix connector is not supported

HDInsight Spark 群集不支持 Spark-Phoenix 连接器。HDInsight Spark clusters do not support the Spark-Phoenix connector.

缓解措施:Mitigation:

必须改用 Spark-HBase 连接器。You must use the Spark-HBase connector instead. 相关说明请参阅如何使用 Spark-HBase 连接器For the instructions, see How to use Spark-HBase connector.

下面是与 Jupyter 笔记本相关的一些已知问题。Following are some known issues related to Jupyter notebooks.

笔记本的文件名中包含非 ASCII 字符Notebooks with non-ASCII characters in filenames

不要在 Jupyter Notebook 文件名中使用非 ASCII 字符。Do not use non-ASCII characters in Jupyter notebook filenames. 如果尝试通过 Jupyter UI 来上传具有非 ASCII 文件名的文件,则上传将失败且不会显示任何错误消息。If you try to upload a file through the Jupyter UI, which has a non-ASCII filename, it fails without any error message. Jupyter 不会让你上传文件,但是也不会引发可见的错误。Jupyter does not let you upload the file, but it does not throw a visible error either.

加载大型笔记本时发生错误Error while loading notebooks of larger sizes

加载大型笔记本时,可能会看到错误“ Error loading notebookYou might see an error Error loading notebook when you load notebooks that are larger in size.

缓解措施:Mitigation:

收到此错误并不表示数据已损坏或丢失。If you get this error, it does not mean your data is corrupt or lost. 笔记本仍在磁盘上的 /var/lib/jupyter 中,可以通过 SSH 连接到群集来访问它。Your notebooks are still on disk in /var/lib/jupyter, and you can SSH into the cluster to access them. 有关信息,请参阅将 SSH 与 HDInsight 配合使用For information, see Use SSH with HDInsight.

一旦使用 SSH 连接到群集后,即可将笔记本从群集复制到本地计算机(使用 SCP 或 WinSCP)作为备份,以免丢失笔记本中的重要数据。Once you have connected to the cluster using SSH, you can copy your notebooks from your cluster to your local machine (using SCP or WinSCP) as a backup to prevent the loss of any important data in the notebook. 然后,可以使用端口 8001 通过 SSH 隧道(不通过网关)连接到头节点来访问 Jupyter。You can then SSH tunnel into your headnode at port 8001 to access Jupyter without going through the gateway. 可从该处清除笔记本的输出并将其重新保存,以尽量减小笔记本的大小。From there, you can clear the output of your notebook and resave it to minimize the notebook's size.

若要防止今后发生此错误,必须遵循一些最佳实践:To prevent this error from happening in the future, you must follow some best practices:

  • 必须保持较小的笔记本大小。It is important to keep the notebook size small. 发回到 Jupyter 的所有 Spark 作业输出都会保存在笔记本中。Any output from your Spark jobs that is sent back to Jupyter is persisted in the notebook. 一般而言,Jupyter 的最佳用法是避免对大型 RDD 或数据帧运行 .collect();相反,如果想要查看 RDD 的内容,请考虑运行 .take().sample(),以使输出不至于过大。It is a best practice with Jupyter in general to avoid running .collect() on large RDD's or dataframes; instead, if you want to peek at an RDD's contents, consider running .take() or .sample() so that your output doesn't get too large.
  • 此外,在保存笔记本时,请清除所有输出单元以减小大小。Also, when you save a notebook, clear all output cells to reduce the size.

笔记本初次启动花费的时间比预期要长Notebook initial startup takes longer than expected

在使用 Spark magic 的 Jupyter 笔记本中,第一个代码语句可能需要花费一分钟以上。First code statement in Jupyter notebook using Spark magic could take more than a minute.

解释:Explanation:

这发生在运行第一个代码单元时。This happens because when the first code cell is run. 它在后台启动设置会话配置,以及设置 SQL、Spark 和 Hive 上下文。In the background this initiates session configuration and Spark, SQL, and Hive contexts are set. 设置这些上下文后,第一个语句才运行,因此让人觉得完成语句需要花费很长时间。After these contexts are set, the first statement is run and this gives the impression that the statement took a long time to complete.

创建会话时 Jupyter 笔记本超时Jupyter notebook timeout in creating the session

如果 Spark 群集的资源不足,Jupyter 笔记本中的 Spark 和 Pyspark 内核在尝试创建会话时会超时。When Spark cluster is out of resources, the Spark and Pyspark kernels in the Jupyter notebook will timeout trying to create the session.

缓解措施:Mitigations:

  1. 通过以下方式释放 Spark 群集中的一些资源:Free up some resources in your Spark cluster by:

    • 转到“关闭并停止”菜单或单击笔记本资源管理器中的“关闭”,以停止其他 Spark 笔记本。Stopping other Spark notebooks by going to the Close and Halt menu or clicking Shutdown in the notebook explorer.
    • 通过 YARN 停止其他 Spark 应用程序。Stopping other Spark applications from YARN.
  2. 重新启动先前尝试启动的笔记本。Restart the notebook you were trying to start up. 现在应有足够的资源用于创建会话。Enough resources should be available for you to create a session now.

另请参阅See also

方案Scenarios

创建和运行应用程序Create and run applications

工具和扩展Tools and extensions

管理资源Manage resources