外部 Apache Hive 元存储 External Apache Hive metastore

本文介绍了如何设置 Azure Databricks 群集以连接到现有的外部 Apache Hive 元存储。This article describes how to set up Azure Databricks clusters to connect to existing external Apache Hive metastores. 其中提供了有关建议的元存储设置和群集配置要求的信息,以及有关配置群集以连接到外部元存储的说明。It provides information about recommended metastore setup and cluster configuration requirements, followed by instructions for configuring clusters to connect to an external metastore. 下表汇总了 Databricks Runtime 的每个版本支持的 Hive 元存储版本。The following table summarizes which Hive metastore versions are supported in each version of Databricks Runtime.

Databricks 运行时版本Databricks Runtime Version 0.13 - 1.2.10.13 - 1.2.1 2.12.1 2.22.2 2.32.3 3.1.03.1.0
7.x7.x Yes Yes Yes Yes Yes
6.x6.x Yes Yes Yes Yes Yes
5.3 及更高版本5.3 and above Yes Yes Yes Yes Yes
5.1 - 5.2 和 4.x5.1 - 5.2 and 4.x Yes Yes Yes Yes No
3.x3.x Yes Yes No No No

重要

  • SQL Server 不能用作 Hive 2.0 及更高版本的基础元存储数据库;但是,Azure SQL 数据库可以,并且在本文中从头到尾用作示例。SQL Server does not work as the underlying metastore database for Hive 2.0 and above; however, Azure SQL Database does work and is used as the example throughout this article.
  • 可以使用 HDInsight 群集的 Hive 1.2.0 或 1.2.1 元存储作为外部元存储。You can use a Hive 1.2.0 or 1.2.1 metastore of an HDInsight cluster as an external metastore. 请参阅使用 Azure HDInsight 中的外部元数据存储See Use external metadata stores in Azure HDInsight.
  • 如果使用 Azure Database for MySQL 作为外部元存储,则必须在服务器端数据库配置中将 lower_case_table_names 属性的值从 1(默认值)更改为 2。If you use Azure Database for MySQL as an external metastore, you must change the value of the lower_case_table_names property from 1 (the default) to 2 in the server-side database configuration. 有关详细信息,请参阅标识符区分大小写For details, see Identifier Case Sensitivity.

Hive 元存储设置Hive metastore setup

在群集内运行的元存储客户端直接使用 JDBC 连接到基础元存储数据库。The metastore client running inside a cluster connects to your underlying metastore database directly using JDBC.

若要测试从群集到元存储的网络连接,可以在笔记本中运行以下命令:To test network connectivity from a cluster to the metastore, you can run the following command inside a notebook:

%sh
nc -vz <DNS name> <port>

wherewhere

  • <DNS name> 是 Azure SQL 数据库的服务器名称。<DNS name> is the server name of Azure SQL Database.
  • <port> 是数据库的端口。<port> is the port of the database.

群集配置Cluster configurations

你必须设置两组配置选项才能将群集连接到外部元存储:You must set two sets of configuration options to connect a cluster to an external metastore:

  • Spark 选项为 Spark 配置元存储客户端的 Hive 元存储版本和 JAR。Spark options configure Spark with the Hive metastore version and the JARs for the metastore client.
  • Hive 选项配置元存储客户端以连接到外部元存储。Hive options configure the metastore client to connect to the external metastore.

Spark 配置选项 Spark configuration options

spark.sql.hive.metastore.version 设置为你的 Hive 元存储的版本,并将 spark.sql.hive.metastore.jars 设置如下:Set spark.sql.hive.metastore.version to the version of your Hive metastore and spark.sql.hive.metastore.jars as follows:

  • Hive 0.13:不设置 spark.sql.hive.metastore.jarsHive 0.13: do not set spark.sql.hive.metastore.jars.

  • Hive 1.2.0 或 1.2.1(低于 7.0 的 Databricks Runtime 版本):将 spark.sql.hive.metastore.jars 设置为 builtinHive 1.2.0 or 1.2.1 (Databricks Runtime versions below 7.0): set spark.sql.hive.metastore.jars to builtin.

    备注

    在 Databricks Runtime 7.0 及更高版本上,Hive 1.2.0 和 1.2.1 不是内置的元存储。Hive 1.2.0 and 1.2.1 are not the built-in metastore on Databricks Runtime 7.0 and above. 如果要将 Hive 1.2.0 或 1.2.1 与 Databricks Runtime 7.0 及更高版本一起使用,请按照下载元存储 jar 并指向它们中所述的过程进行操作。If you want to use Hive 1.2.0 or 1.2.1 with Databricks Runtime 7.0 and above, follow the procedure described in Download the metastore jars and point to them.

  • Hive 2.3(Databricks Runtime 7.0 及更高版本):将 spark.sql.hive.metastore.jars 设置为 builtinHive 2.3 (Databricks Runtime 7.0 and above): set spark.sql.hive.metastore.jars to builtin.

  • 对于所有其他 Hive 版本,Azure Databricks 建议你下载元存储 JAR 并将配置 spark.sql.hive.metastore.jars 设置为指向下载的 JAR,方法是使用下载元存储 jar 并指向它们中所述的过程。For all other Hive versions, Azure Databricks recommends that you download the metastore JARs and set the configuration spark.sql.hive.metastore.jars to point to the downloaded JARs using the procedure described in Download the metastore jars and point to them.

下载元存储 jar 并指向它们Download the metastore jars and point to them

  1. 创建一个群集,将 spark.sql.hive.metastore.jars 设置为 maven,将 spark.sql.hive.metastore.version 设置为与你的元存储的版本匹配。Create a cluster with spark.sql.hive.metastore.jars set to maven and spark.sql.hive.metastore.version to match the version of your metastore.

  2. 在群集运行时搜索驱动程序日志,找到如下所示的行:When the cluster is running, search the driver log and find a line like the following:

    17/11/18 22:41:19 INFO IsolatedClientLoader: Downloaded metastore jars to <path>
    

    目录 <path> 是已下载的 JAR 在群集的驱动程序节点中的位置。The directory <path> is the location of downloaded JARs in the driver node of the cluster.

    另外,你可以在 Scala 笔记本中运行以下代码来输出 JAR 的位置:Alternatively you can run the following code in a Scala notebook to print the location of the JARs:

    import com.typesafe.config.ConfigFactory
    val path = ConfigFactory.load().getString("java.io.tmpdir")
    
    println(s"\nHive JARs are downloaded to the path: $path \n")
    
  3. 运行 %sh cp -r <path> /dbfs/hive_metastore_jar(将 <path> 替换为你的群集的信息),以通过驱动程序节点中的 FUSE 客户端将此目录复制到 DBFS 中名为 hive_metastore_jar 的目录。Run %sh cp -r <path> /dbfs/hive_metastore_jar (replacing <path> with your cluster’s info) to copy this directory to a directory in DBFS called hive_metastore_jar through the FUSE client in the driver node.

  4. 创建一个会将 /dbfs/hive_metastore_jar 复制到节点的本地文件系统中的初始化脚本,确保初始化脚本在访问 DBFS FUSE 客户端之前进入睡眠状态几秒钟。Create an init script that copies /dbfs/hive_metastore_jar to the local filesystem of the node, making sure to make the init script sleep a few seconds before it accesses the DBFS FUSE client. 这可确保客户端准备就绪。This ensures that the client is ready.

  5. 设置 spark.sql.hive.metastore.jars 以使用此目录。Set spark.sql.hive.metastore.jars to use this directory. 如果初始化脚本将 /dbfs/hive_metastore_jar 复制到 /databricks/hive_metastore_jars/,请将 spark.sql.hive.metastore.jars 设置为 /databricks/hive_metastore_jars/*If your init script copies /dbfs/hive_metastore_jar to /databricks/hive_metastore_jars/, set spark.sql.hive.metastore.jars to /databricks/hive_metastore_jars/*. 该位置必须包含尾随 /*The location must include the trailing /*.

  6. 重启群集。Restart the cluster.

Hive 配置选项 Hive configuration options

本部分介绍了特定于 Hive 的选项。This section describes options specific to Hive.

若要使用本地模式连接到外部元存储,请设置以下 Hive 配置选项:To connect to an external metastore using local mode, set the following Hive configuration options:

# JDBC connect string for a JDBC metastore
javax.jdo.option.ConnectionURL <mssql-connection-string>

# Username to use against metastore database
javax.jdo.option.ConnectionUserName <mssql-username>

# Password to use against metastore database
javax.jdo.option.ConnectionPassword <mssql-password>

# Driver class name for a JDBC metastore
javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver

wherewhere

  • <mssql-connection-string> 是 JDBC 连接字符串(可以在 Azure 门户中获取)。<mssql-connection-string> is the JDBC connection string (which you can get in the Azure portal). 不需要在连接字符串中包含用户名和密码,因为它们将由 javax.jdo.option.ConnectionUserNamejavax.jdo.option.ConnectionDriverName 设置。You do not need to include username and password in the connection string, because these will be set by javax.jdo.option.ConnectionUserName and javax.jdo.option.ConnectionDriverName.
  • <mssql-username><mssql-password> 指定具有数据库读/写访问权限的 Azure SQL 数据库帐户的用户名和密码。<mssql-username> and <mssql-password> specify the username and password of your Azure SQL Database account that has read/write access to the database.

备注

对于生产环境,建议将 hive.metastore.schema.verification 设置为 trueFor production environments, we recommend that you set hive.metastore.schema.verification to true. 这可防止当元存储客户端版本与元存储数据库版本不匹配时,Hive 元存储客户端隐式修改元存储数据库架构。This prevents Hive metastore client from implicitly modifying the metastore database schema when the metastore client version does not match the metastore database version. 为低于 Hive 1.2.0 的元存储客户端版本启用此设置时,请确保元存储客户端对元存储数据库具有写入权限(以防止 HIVE-9749 中所述的问题)。When enabling this setting for metastore client versions lower than Hive 1.2.0, make sure that the metastore client has the write permission to the metastore database (to prevent the issue described in HIVE-9749).

  • 对于 Hive 元存储 1.2.0 及更高版本,请将 hive.metastore.schema.verification.record.version 设置为 true 以启用 hive.metastore.schema.verificationFor Hive metastore 1.2.0 and higher, set hive.metastore.schema.verification.record.version to true to enable hive.metastore.schema.verification.
  • 对于 Hive 元存储 2.1.1 及更高版本,请将 hive.metastore.schema.verification.record.version 设置为 true ,因为它默认设置为 falseFor Hive metastore 2.1.1 and higher, set hive.metastore.schema.verification.record.version to true as it is set to false by default.

使用 UI 设置外部元存储Set up an external metastore using the UI

若要使用 Azure Databricks UI 设置外部元存储,请执行以下操作:To set up an external metastore using the Azure Databricks UI:

  1. 单击边栏中的“群集”按钮。Click the Clusters button on the sidebar.

  2. 单击“创建群集”。 Click Create Cluster .

  3. 输入以下 Spark 配置选项Enter the following Spark configuration options:

    # Hive-specific configuration options.
    # spark.hadoop prefix is added to make sure these Hive specific options propagate to the metastore client.
    # JDBC connect string for a JDBC metastore
    spark.hadoop.javax.jdo.option.ConnectionURL <mssql-connection-string>
    
    # Username to use against metastore database
    spark.hadoop.javax.jdo.option.ConnectionUserName <mssql-username>
    
    # Password to use against metastore database
    spark.hadoop.javax.jdo.option.ConnectionPassword <mssql-password>
    
    # Driver class name for a JDBC metastore
    spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
    
    # Spark specific configuration options
    spark.sql.hive.metastore.version <hive-version>
    # Skip this one if <hive-version> is 0.13.x.
    spark.sql.hive.metastore.jars <hive-jar-source>
    
  4. 根据配置群集中的说明继续你的群集配置。Continue your cluster configuration, following the instructions in Configure clusters.

  5. 单击“创建群集”以创建群集。Click Create Cluster to create the cluster.

使用初始化脚本设置外部元存储Set up an external metastore using an init script

初始化脚本允许你连接到现有 Hive 元存储,无需手动设置所需的配置。Init scripts let you connect to an existing Hive metastore without manually setting required configurations.

  1. 创建要在其中存储初始化脚本的基目录(如果不存在)。Create the base directory you want to store the init script in if it does not exist. 下面的示例使用 dbfs:/databricks/scriptsThe following example uses dbfs:/databricks/scripts.

  2. 在笔记本中运行以下代码片段。Run the following snippet in a notebook. 此代码片段在 Databricks 文件系统 (DBFS) 中创建初始化脚本 /databricks/scripts/external-metastore.shThe snippet creates the init script /databricks/scripts/external-metastore.sh in Databricks File System (DBFS). 此外,你可以使用 DBFS REST API 的 put 操作创建初始化脚本。Alternatively, you can use the DBFS REST API’s put operation to create the init script. 每当名称指定为 <cluster-name> 的群集启动时,此初始化脚本都会将所需的配置选项写入到一个名为 00-custom-spark.conf 的配置文件中,该文件采用类似于 JSON 的格式,位于群集的每个节点中的 /databricks/driver/conf/ 下。This init script writes required configuration options to a configuration file named 00-custom-spark.conf in a JSON-like format under /databricks/driver/conf/ inside every node of the cluster, whenever a cluster with the name specified as <cluster-name> starts. Azure Databricks 在 /databricks/driver/conf/spark-branch.conf 文件中提供了默认的 Spark 配置。Azure Databricks provides default Spark configurations in the /databricks/driver/conf/spark-branch.conf file. /databricks/driver/conf 目录中的配置文件以反向字母顺序应用。Configuration files in the /databricks/driver/conf directory apply in reverse alphabetical order. 如果要更改 00-custom-spark.conf 文件的名称,请确保它在 spark-branch.conf 文件之前持续应用。If you want to change the name of the 00-custom-spark.conf file, make sure that it continues to apply before the spark-branch.conf file.

    dbutils.fs.put(
        "/databricks/scripts/external-metastore.sh",
        """#!/bin/sh
          |# Loads environment variables to determine the correct JDBC driver to use.
          |source /etc/environment
          |# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
          |cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
          |[driver] {
          |    # Hive specific configuration options.
          |    # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
          |    # JDBC connect string for a JDBC metastore
          |    "spark.hadoop.javax.jdo.option.ConnectionURL" = "<mssql-connection-string>"
          |
          |    # Username to use against metastore database
          |    "spark.hadoop.javax.jdo.option.ConnectionUserName" = "<mssql-username>"
          |
          |    # Password to use against metastore database
          |    "spark.hadoop.javax.jdo.option.ConnectionPassword" = "<mssql-password>"
          |
          |    # Driver class name for a JDBC metastore
          |    "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
          |
          |    # Spark specific configuration options
          |    "spark.sql.hive.metastore.version" = "<hive-version>"
          |    # Skip this one if <hive-version> is 0.13.x.
          |    "spark.sql.hive.metastore.jars" = "<hive-jar-source>"
          |}
          |EOF
          |""".stripMargin,
        overwrite = true
    )
    
  3. 使用初始化脚本配置你的群集。Configure your cluster with the init script.

  4. 重启群集。Restart the cluster.

故障排除Troubleshooting

群集未启动(由于初始化脚本设置不正确)Clusters do not start (due to incorrect init script settings)

如果用于设置外部元存储的初始化脚本导致群集创建失败,请将初始化脚本配置为日志,然后使用日志调试初始化脚本。If an init script for setting up the external metastore causes cluster creation failure, configure the init script to log, and debug the init script using the logs.

SQL 语句出错:InvocationTargetExceptionError in SQL statement: InvocationTargetException

  • 完整异常堆栈跟踪中的错误消息模式:Error message pattern in the full exception stack trace:

    Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = [...]
    

    外部元存储 JDBC 连接信息配置不正确。External metastore JDBC connection information is misconfigured. 请验证所配置的主机名、端口、用户名、密码和 JDBC 驱动程序类名称。Verify the configured hostname, port, username, password, and JDBC driver class name. 此外,请确保用户名具有访问元存储数据库的适当权限。Also, make sure that the username has the right privilege to access the metastore database.

  • 完整异常堆栈跟踪中的错误消息模式:Error message pattern in the full exception stack trace:

    Required table missing : "`DBS`" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. [...]
    

    外部元存储数据库未正确初始化。External metastore database not properly initialized. 验证你是否已创建元存储数据库并将正确的数据库名称放在 JDBC 连接字符串中。Verify that you created the metastore database and put the correct database name in the JDBC connection string. 然后,使用以下两个 Spark 配置选项启动新群集:Then, start a new cluster with the following two Spark configuration options:

    datanucleus.autoCreateSchema true
    datanucleus.fixedDatastore false
    

    这样,当 Hive 客户端库尝试访问表但找不到它们时,它会尝试在元存储数据库中自动创建并初始化表。In this way, the Hive client library will try to create and initialize tables in the metastore database automatically when it tries to access them but finds them absent.

SQL 语句出错:AnalysisException:无法实例化 org.apache.hadoop.hive.metastore.HiveMetastoreClientError in SQL statement: AnalysisException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetastoreClient

完整异常堆栈跟踪中的错误消息:Error message in the full exception stacktrace:

The specified datastore driver (driver name) was not found in the CLASSPATH

群集配置为使用不正确的 JDBC 驱动程序。The cluster is configured to use an incorrect JDBC driver.

将 datanucleus.autoCreateSchema 设置为 true 的操作不按预期工作Setting datanucleus.autoCreateSchema to true doesn’t work as expected

默认情况下,Databricks 还会将 datanucleus.fixedDatastore 设置为 true,这会阻止意外更改元存储数据库的结构。By default, Databricks also sets datanucleus.fixedDatastore to true, which prevents any accidental structural changes to the metastore databases. 因此,即使将 datanucleus.autoCreateSchema 设置为 true,Hive 客户端库也无法创建元存储表。Therefore, the Hive client library cannot create metastore tables even if you set datanucleus.autoCreateSchema to true. 通常,对于生产环境而言,此策略会更安全,因为它可以防止元存储数据库意外升级。This strategy is, in general, safer for production environments since it prevents the metastore database to be accidentally upgraded.

如果确实需要使用 datanucleus.autoCreateSchema 来帮助初始化元存储数据库,请确保将 datanucleus.fixedDatastore 设置为 falseIf you do want to use datanucleus.autoCreateSchema to help initialize the metastore database, make sure you set datanucleus.fixedDatastore to false. 此外,你可能想要在初始化元存储数据库后翻转这两个标志,以便为你的生产环境提供更好的保护。Also, you may want to flip both flags after initializing the metastore database to provide better protection to your production environment.