使用外部元数据存储 - Azure HDInsightUse external metadata stores in Azure HDInsight

HDInsight 允许通过外部数据存储来控制数据和元数据。HDInsight allows you to take control of your data and metadata with external data stores. 此功能可用于 Apache Hive 元存储Apache Oozie 元存储Apache Ambari 数据库This feature is available for Apache Hive metastore, Apache Oozie metastore, and Apache Ambari database.

HDInsight 中的 Apache Hive 元存储是 Apache Hadoop 体系结构的必备部分。The Apache Hive metastore in HDInsight is an essential part of the Apache Hadoop architecture. 元存储是中心架构存储库。A metastore is the central schema repository. 其他大型数据访问工具(例如 Apache Spark、Interactive Query [LLAP]、Presto 或 Apache Pig)使用元存储。The metastore is used by other big data access tools such as Apache Spark, Interactive Query (LLAP), Presto, or Apache Pig. HDInsight 使用 Azure SQL 数据库作为 Hive 元存储。HDInsight uses an Azure SQL Database as the Hive metastore.

HDInsight Hive 元数据存储体系结构

可使用以下两种方式为 HDInsight 存储设置元存储:There are two ways you can set up a metastore for your HDInsight clusters:

默认元存储Default metastore

默认情况下,HDInsight 为每一种群集类型创建一个元存储。By default, HDInsight creates a metastore with every cluster type. 转而可指定自定义元存储。You can instead specify a custom metastore. 默认元存储包括以下注意事项:The default metastore includes the following considerations:

  • 没有任何额外费用。No additional cost. HDInsight 会为每个群集类型创建一个元存储,而不额外产生任何费用。HDInsight creates a metastore with every cluster type without any additional cost to you.

  • 每个默认元存储都是群集生命周期的一部分。Each default metastore is part of the cluster lifecycle. 删除群集时,也会删除相应的该元存储和元数据。When you delete a cluster, the corresponding metastore and metadata are also deleted.

  • 不可与其他群集共享默认元存储。You can't share the default metastore with other clusters.

  • 建议将默认元存储仅用于简单工作负载。Default metastore is recommended only for simple workloads. 即不需要多个群集且不需要在群集生命周期之外保留的元数据的工作负荷。Workloads that don't require multiple clusters and don't need metadata preserved beyond the cluster's lifecycle.

重要

默认元存储提供具有基本层 5 DTU 限制(不可升级)的 Azure SQL 数据库!The default metastore provides an Azure SQL Database with a basic tier 5 DTU limit (not upgradeable)! 适用于基本测试目的。Suitable for basic testing purposes. 对于大型工作负载或生产工作负载,建议迁移到外部元存储。For large or production workloads, we recommend migrating to an external metastore.

自定义元存储Custom metastore

HDInsight 还支持自定义元存储,建议对生产群集使用此项:HDInsight also supports custom metastores, which are recommended for production clusters:

  • 将自己的 Azure SQL 数据库指定为元存储。You specify your own Azure SQL Database as the metastore.
  • 元存储的生命周期不由群集生命周期决定,因此可创建和删除群集,而不会丢失元数据。The lifecycle of the metastore is not tied to a clusters lifecycle, so you can create and delete clusters without losing metadata. 即使删除和重新创建 HDInsight 群集之后,系统仍然保留 Hive 架构等元数据。Metadata such as your Hive schemas will persist even after you delete and re-create the HDInsight cluster.
  • 通过自定义元存储,可将多个群集和群集类型附加到元存储。A custom metastore lets you attach multiple clusters and cluster types to that metastore. 例如,可跨交互式查询、Hive 和 HDInsight 中的群集的 Spark 共享单个元存储。For example, a single metastore can be shared across Interactive Query, Hive, and Spark clusters in HDInsight.
  • 根据所选的性能级别收取元存储 (Azure SQL DB) 的费用。You pay for the cost of a metastore (Azure SQL DB) according to the performance level you choose.
  • 可按需增加元存储。You can scale up the metastore as needed.
  • 群集和外部元存储必须托管在同一区域中。The cluster and the external metastore must be hosted in the same region.

HDInsight Hive 元数据存储使用案例

针对自定义元存储创建并配置 Azure SQL 数据库Create and config Azure SQL Database for the custom metastore

在为 HDInsight 群集设置自定义 Hive 元存储之前,需创建 Azure SQL 数据库或有一个现有的 Azure SQL 数据库。Create or have an existing Azure SQL Database before setting up a custom Hive metastore for a HDInsight cluster. 有关详细信息,请参阅快速入门:在 Azure SQL DB 中创建单一数据库For more information, see Quickstart: Create a single database in Azure SQL DB.

创建群集时,HDInsight 服务需要连接到外部元存储并验证你的凭据。While creating the cluster, HDInsight service needs to connect to the external metastore and verify your credentials. 配置 Azure SQL 数据库防火墙规则以允许 Azure 服务和资源访问服务器。Configure Azure SQL Database firewall rules to allow Azure services and resources to access the server. 通过选择“设置服务器防火墙” 来在 Azure 门户中启用此选项。Enable this option in the Azure portal by selecting Set server firewall. 然后为 Azure SQL 数据库服务器或数据库在“拒绝公用网络访问” 下选择“否”,在“允许 Azure 服务和资源访问此服务器” 下选择“是”。Then select No underneath Deny public network access, and Yes underneath Allow Azure services and resources to access this server for the Azure SQL Database server or database. 有关详细信息,请参阅创建和管理 IP 防火墙规则For more information, see Create and manage IP firewall rules

不支持使用 SQL 存储的专用终结点。Private endpoints for SQL stores are not supported.

“设置服务器防火墙”按钮

允许 Azure 服务访问

在群集创建期间选择自定义元存储Select a custom metastore during cluster creation

可以随时将群集指向之前创建的 Azure SQL 数据库。You can point your cluster to a previously created Azure SQL Database at any time. 若要通过门户创建群集,请从“存储”>“元存储设置” 指定该选项。For cluster creation through the portal, the option is specified from the Storage > Metastore settings.

HDInsight Hive 元数据存储 Azure 门户

Hive 元存储指南Hive metastore guidelines

  • 尽可能使用自定义元存储来帮助分离计算资源(正在运行的群集)和元数据(存储在元存储中)。Use a custom metastore whenever possible, to help separate compute resources (your running cluster) and metadata (stored in the metastore).

  • 首先使用 S2 层,它提供 50 DTU 和 250 GB 的存储空间。Start with an S2 tier, which provides 50 DTU and 250 GB of storage. 如果空间不够,可扩大数据库。If you see a bottleneck, you can scale the database up.

  • 如果你希望多个 HDInsight 群集访问单独的数据,请对每个群集上的元存储使用单独的数据库。If you intend multiple HDInsight clusters to access separate data, use a separate database for the metastore on each cluster. 如果在多个 HDInsight 群集之间共享元存储,则意味着这些群集将使用相同的元数据和底层用户数据文件。If you share a metastore across multiple HDInsight clusters, it means that the clusters use the same metadata and underlying user data files.

  • 请定期备份自定义元存储。Back up your custom metastore periodically. Azure SQL 数据库会自动生成备份,但备份保留时间范围会有所不同。Azure SQL Database generates backups automatically, but the backup retention timeframe varies. 有关详细信息,请参阅了解 SQL 数据库自动备份For more information, see Learn about automatic SQL Database backups.

  • 将元存储和 HDInsight 群集放在同一区域。Locate your metastore and HDInsight cluster in the same region. 此配置将提供最高的性能和最低的网络流出费用。This configuration will provide the highest performance and lowest network egress charges.

  • 使用 Azure SQL 数据库监视工具或 Azure Monitor 日志监视元存储库的性能和可用性。Monitor your metastore for performance and availability using Azure SQL Database Monitoring tools, or Azure Monitor logs.

  • 针对现有的自定义元存储数据库创建更高版本的新 Azure HDInsight 时,系统将升级元存储的架构。When a new, higher version of Azure HDInsight is created against an existing custom metastore database, the system upgrades the schema of the metastore. 如果不从备份还原数据库,则升级不可逆。The upgrade is irreversible without restoring the database from backup.

  • 如果在多个群集之间共享元存储,请确保所有群集都具有相同的 HDInsight 版本。If you share a metastore across multiple clusters, ensure all the clusters are the same HDInsight version. 不同的 Hive 版本使用不同的元存储数据库架构。Different Hive versions use different metastore database schemas. 例如,不能在具有 Hive 版本 2.1 的群集和具有 Hive 版本 3.1 的群集之间共享元存储。For example, you can't share a metastore across Hive 2.1 and Hive 3.1 versioned clusters.

  • 在 HDInsight 4.0 中,Spark 和 Hive 使用独立目录来访问 SparkSQL 或 Hive 表。In HDInsight 4.0, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables. Spark 创建的表位于 Spark 目录中。A table created by Spark lives in the Spark catalog. Hive 创建的表位于 Hive 目录中。A table created by Hive lives in the Hive catalog. 这与 HDInsight 3.6 不同,在 HDInsight 3.6 中,Hive 和 Spark 共享公共目录。This behavior is different than HDInsight 3.6 where Hive and Spark shared common catalog. HDInsight 4.0 中的 Hive 和 Spark 集成依赖于 Hive 仓库连接器 (HWC)。Hive and Spark Integration in HDInsight 4.0 relies on Hive Warehouse Connector (HWC). HWC 在 Spark 和 Hive 之间起到桥梁作用。HWC works as a bridge between Spark and Hive. 了解 Hive 仓库连接器Learn about Hive Warehouse Connector.

  • 在 HDInsight 4.0 中,如果想在 Hive 和 Spark 之间共享元存储,可以通过在 Spark 群集中将 metastore.catalog.default 属性更改为 hive 来实现。In HDInsight 4.0 if you would like to Share the metastore between Hive and Spark, you can do so by changing the property metastore.catalog.default to hive in your Spark cluster. 可在 Ambari Advanced spark2-hive-site-override 中找到此属性。You can find this property in Ambari Advanced spark2-hive-site-override. 请务必了解,共享元存储仅适用于外部 Hive 表,如果具有内部/托管的 Hive 表或 ACID 表,则无法共享元存储。It’s important to understand that sharing of metastore only works for external hive tables, this will not work if you have internal/managed hive tables or ACID tables.

Apache Oozie 元存储Apache Oozie metastore

Apache Oozie 是一个管理 Hadoop 作业的工作流协调系统。Apache Oozie is a workflow coordination system that manages Hadoop jobs. Oozie 支持对 Apache MapReduce、Pig 和 Hive 等模型执行 Hadoop 作业。Oozie supports Hadoop jobs for Apache MapReduce, Pig, Hive, and others. Oozie 使用元存储来存储有关工作流的详细信息。Oozie uses a metastore to store details about workflows. 可使用 Azure SQL 数据库作为自定义元存储,提高使用 Oozie 时的性能。To increase performance when using Oozie, you can use Azure SQL Database as a custom metastore. 删除群集后,可通过元存储访问 Oozie 作业数据。The metastore provides access to Oozie job data after you delete your cluster.

若要了解如何使用 Azure SQL 数据库创建 Oozie 元存储,请参阅使用 Apache Oozie 处理工作流For instructions on creating an Oozie metastore with Azure SQL Database, see Use Apache Oozie for workflows.

自定义 Ambari DBCustom Ambari DB

若要在 Apache Ambari on HDInsight 上使用自己的外部数据库,请参阅自定义 Apache Ambari 数据库To use your own external database with Apache Ambari on HDInsight, see Custom Apache Ambari database.

后续步骤Next steps