使用 Azure HDInsight 中的 Apache Hive 分析和处理 JSON 文档Process and analyze JSON documents by using Apache Hive in Azure HDInsight

了解如何使用 Azure HDInsight 中的 Apache Hive 处理和分析 JavaScript 对象表示法 (JSON) 文件。Learn how to process and analyze JavaScript Object Notation (JSON) files by using Apache Hive in Azure HDInsight. 本文使用以下 JSON 文档:This article uses the following JSON document:

{
  "StudentId": "trgfg-5454-fdfdg-4346",
  "Grade": 7,
  "StudentDetails": [
    {
      "FirstName": "Peggy",
      "LastName": "Williams",
      "YearJoined": 2012
    }
  ],
  "StudentClassCollection": [
    {
      "ClassId": "89084343",
      "ClassParticipation": "Satisfied",
      "ClassParticipationRank": "High",
      "Score": 93,
      "PerformedActivity": false
    },
    {
      "ClassId": "78547522",
      "ClassParticipation": "NotSatisfied",
      "ClassParticipationRank": "None",
      "Score": 74,
      "PerformedActivity": false
    },
    {
      "ClassId": "78675563",
      "ClassParticipation": "Satisfied",
      "ClassParticipationRank": "Low",
      "Score": 83,
      "PerformedActivity": true
    }
  ]
}

可以在 wasb://processjson@hditutorialdata.blob.core.chinacloudapi.cn/上找到该文件。The file can be found at wasb://processjson@hditutorialdata.blob.core.chinacloudapi.cn/. 有关如何将 Azure Blob 存储与 HDInsight 配合使用的详细信息,请参阅将 HDFS 兼容的 Azure Blob 存储与 HDInsight 中的 Apache Hadoop 配合使用For more information on how to use Azure Blob storage with HDInsight, see Use HDFS-compatible Azure Blob storage with Apache Hadoop in HDInsight. 可以将该文件复制到群集的默认容器。You can copy the file to the default container of your cluster.

在本文中,将使用 Apache Hive 控制台。In this article, you use the Apache Hive console. 有关如何打开 Hive 控制台的说明,请参阅在 HDInsight 中将 Apache Ambari Hive 视图与 Apache Hadoop 配合使用For instructions on how to open the Hive console, see Use Apache Ambari Hive View with Apache Hadoop in HDInsight.

平展 JSON 文档Flatten JSON documents

下一部分中所列的方法需要 JSON 文档在单一行中。The methods listed in the next section require that the JSON document be composed of a single row. 因此,必须将 JSON 文档平展成字符串。So, you must flatten the JSON document to a string. 如果 JSON 文档已平展,则可以跳过此步骤,直接转到有关分析 JSON 数据的下一部分。If your JSON document is already flattened, you can skip this step and go straight to the next section on analyzing JSON data. 若要平展 JSON 文档,请运行以下脚本:To flatten the JSON document, run the following script:

    DROP TABLE IF EXISTS StudentsRaw;
    CREATE EXTERNAL TABLE StudentsRaw (textcol string) STORED AS TEXTFILE LOCATION "wasb://processjson@hditutorialdata.blob.core.chinacloudapi.cn/";

    DROP TABLE IF EXISTS StudentsOneLine;
    CREATE EXTERNAL TABLE StudentsOneLine
    (
      json_body string
    )
    STORED AS TEXTFILE LOCATION '/json/students';

    INSERT OVERWRITE TABLE StudentsOneLine
    SELECT CONCAT_WS(' ',COLLECT_LIST(textcol)) AS singlelineJSON
          FROM (SELECT INPUT__FILE__NAME,BLOCK__OFFSET__INSIDE__FILE, textcol FROM StudentsRaw DISTRIBUTE BY INPUT__FILE__NAME SORT BY BLOCK__OFFSET__INSIDE__FILE) x
          GROUP BY INPUT__FILE__NAME;

    SELECT * FROM StudentsOneLine

原始 JSON 文件位于 wasb://processjson@hditutorialdata.blob.core.chinacloudapi.cn/The raw JSON file is located at wasb://processjson@hditutorialdata.blob.core.chinacloudapi.cn/. StudentsRaw Hive 表指向未平展的原始 JSON 文档。The StudentsRaw Hive table points to the raw JSON document that is not flattened.

StudentsOneLine Hive 表将数据存储在 HDInsight 默认文件系统中的 /json/students/ 路径下。The StudentsOneLine Hive table stores the data in the HDInsight default file system under the /json/students/ path.

INSERT 语句使用平展的 JSON 数据填充 StudentOneLine 表。The INSERT statement populates the StudentOneLine table with the flattened JSON data.

SELECT 语句只返回 1 行。The SELECT statement only returns one row.

下面是 SELECT 语句的输出:Here is the output of the SELECT statement:

平展 JSON 文档

在 Hive 中分析 JSON 文档Analyze JSON documents in Hive

Hive 提供了三种不同的机制用于对 JSON 文档运行查询,你也可以编写自己的代码:Hive provides three different mechanisms to run queries on JSON documents, or you can write your own:

  • 使用 get_json_object 用户定义的函数 (UDF)。Use the get_json_object user-defined function (UDF).
  • 使用 json_tuple UDF。Use the json_tuple UDF.
  • 使用自定义序列化程序/反序列化程序 (SerDe)。Use the custom Serializer/Deserializer (SerDe).
  • 使用 Python 或其他语言编写自己的 UDF。Write your own UDF by using Python or other languages. 有关如何使用 Hive 运行自己的 Python 代码的详细信息,请参阅使用 Apache Hive 和 Apache Pig 运行 Python UDFFor more information on how to run your own Python code with Hive, see Python UDF with Apache Hive and Apache Pig.

使用 get_json_object UDFUse the get_json_object UDF

Hive 提供名为 get_json_object 的内置 UDF,它可以在运行时执行 JSON 查询。Hive provides a built-in UDF called get_json_object that can perform JSON querying during runtime. 此方法采用两个参数 - 表名称和方法名称,具有平展的 JSON 文档和需要进行分析的 JSON 字段。This method takes two arguments--the table name and method name, which has the flattened JSON document and the JSON field that needs to be parsed. 让我们探讨一个示例,了解此 UDF 的工作原理。Let's look at an example to see how this UDF works.

以下查询返回每个学生的名字和姓氏:The following query returns the first name and last name for each student:

SELECT
  GET_JSON_OBJECT(StudentsOneLine.json_body,'$.StudentDetails.FirstName'),
  GET_JSON_OBJECT(StudentsOneLine.json_body,'$.StudentDetails.LastName')
FROM StudentsOneLine;

这是在控制台窗口中运行此查询时的输出:Here is the output when you run this query in the console window:

get_json_object UDF

get_json_object UDF 有限制:There are limitations of the get_json_object UDF:

  • 由于查询中的每个字段都需要重新分析查询,因此会影响性能。Because each field in the query requires reparsing of the query, it affects the performance.
  • GET_JSON_OBJECT() 返回数组的字符串表示形式。GET_JSON_OBJECT() returns the string representation of an array. 要将此数组转换为 Hive 数组,必须使用正则表达式来替换方括号“[”和“]”,然后调用拆分来获取数组。To convert this array to a Hive array, you have to use regular expressions to replace the square brackets "[" and "]", and then you also have to call split to get the array.

正因如此,Hive wiki 建议使用 json_tuple。This is why the Hive wiki recommends that you use json_tuple.

使用 json_tuple UDFUse the json_tuple UDF

Hive 提供的另一个 UDF 称为 json_tuple,其性能比 get_ json _object 要高。Another UDF provided by Hive is called json_tuple, which performs better than get_ json _object. 此方法采用一组键和一个 JSON 字符串,并使用一个函数返回值的元组。This method takes a set of keys and a JSON string, and returns a tuple of values by using one function. 以下查询将从 JSON 文档返回学生 ID 和年级:The following query returns the student ID and the grade from the JSON document:

SELECT q1.StudentId, q1.Grade
FROM StudentsOneLine jt
LATERAL VIEW JSON_TUPLE(jt.json_body, 'StudentId', 'Grade') q1
  AS StudentId, Grade;

此脚本在 Hive 控制台中的输出:The output of this script in the Hive console:

json_tuple UDF

json_tuple UDF 在 Hive 中使用了横向视图语法,使 json_tuple 能够通过将 UDT 函数应用于原始表的每一行来创建虚拟表。The json_tuple UDF uses the lateral view syntax in Hive, which enables json_tuple to create a virtual table by applying the UDT function to each row of the original table. 由于重复使用横向视图,复杂的 JSON 会变得过于庞大。Complex JSONs become too unwieldy because of the repeated use of LATERAL VIEW. 此外,JSON_TUPLE 无法处理嵌套的 JSONs。Furthermore, JSON_TUPLE cannot handle nested JSONs.

使用自定义 SerDeUse a custom SerDe

SerDe 是用于分析嵌套 JSON 文档的最佳选择。SerDe is the best choice for parsing nested JSON documents. 使用它可以定义 JSON 架构,然后使用该架构来分析文档。It lets you define the JSON schema, and then you can use the schema to parse the documents. 有关说明,请参阅如何将自定义 JSON SerDe 与 Microsoft Azure HDInsight 配合使用For instructions, see How to use a custom JSON SerDe with Microsoft Azure HDInsight.

摘要Summary

总之,在 Hive 中选择 JSON 运算符类型取决于方案。In conclusion, the type of JSON operator in Hive that you choose depends on your scenario. 如果有一个简单的 JSON 文档,并只有一个要查找的字段,可以选择使用 Hive UDF get_json_object。If you have a simple JSON document and you have only one field to look up on, you can choose to use the Hive UDF get_json_object. 如果有多个键用于查找,则可以使用 json_tuple。If you have more than one key to look up on, then you can use json_tuple. 如果拥有嵌套文档,应该使用 JSON SerDe。If you have a nested document, then you should use the JSON SerDe.

后续步骤Next steps

相关文章请参阅:For related articles, see: