使用 HDInsight 上的 Apache Spark 分析 Application Insights 遥测日志Analyze Application Insights telemetry logs with Apache Spark on HDInsight

了解如何在 HDInsight 上使用 Apache Spark 分析 Application Insight 遥测数据。Learn how to use Apache Spark on HDInsight to analyze Application Insight telemetry data.

Visual Studio Application Insights 是用于监视 Web 应用程序的分析服务。Visual Studio Application Insights is an analytics service that monitors your web applications. 可将 Application Insights 生成的遥测数据导出到 Azure 存储。Telemetry data generated by Application Insights can be exported to Azure Storage. 当数据位于 Azure 存储中后,可以使用 HDInsight 来分析数据。Once the data is in Azure Storage, HDInsight can be used to analyze it.

先决条件Prerequisites

  • 配置为使用 Application Insights 的应用程序。An application that is configured to use Application Insights.

  • 熟悉基于 Linux 的 HDInsight 群集的创建过程。Familiarity with creating a Linux-based HDInsight cluster. 有关详细信息,请参阅在 HDInsight 上创建 Apache SparkFor more information, see Create Apache Spark on HDInsight.

  • Web 浏览器。A web browser.

开发和测试本文档时使用了以下资源:The following resources were used in developing and testing this document:

架构与规划Architecture and planning

下图演示了本示例的服务体系结构:The following diagram illustrates the service architecture of this example:

数据从 Application Insights 流动到 Blob 存储,然后到 Spark

Azure 存储Azure Storage

Application Insights 可配置为将遥测信息连续导出到 blob。Application Insights can be configured to continuously export telemetry information to blobs. 然后,HDInsight 可读取存储在 blob 中的数据。HDInsight can then read data stored in the blobs. 但是,必须符合某些要求:However, there are some requirements that you must follow:

  • 位置:如果存储帐户和 HDInsight 位于不同位置,则可能会增加延迟。Location: If the Storage Account and HDInsight are in different locations, it may increase latency. 此外还会增加成本,因为在区域之间移动数据会收取出口费用。It also increases cost, as egress charges are applied to data moving between regions.

    警告

    不支持在 HDInsight 之外的其他位置使用存储帐户。Using a Storage Account in a different location than HDInsight is not supported.

  • Blob 类型:HDInsight 仅支持块 Blob。Blob type: HDInsight only supports block blobs. Application Insights 默认为使用块 Blob,因此,默认情况下可配合 HDInsight 一起使用。Application Insights defaults to using block blobs, so should work by default with HDInsight.

有关向现有群集添加存储的信息,请参阅添加额外的存储帐户文档。For information on adding storage to an existing cluster, see the Add additional storage accounts document.

数据架构Data schema

Application Insights 为导出到 Blob 的遥测数据格式提供 导出数据模型 信息。Application Insights provides export data model information for the telemetry data format exported to blobs. 本文档中的步骤使用 Spark SQL 来处理数据。The steps in this document use Spark SQL to work with the data. Spark SQL 可以自动针对 Application Insights 记录的 JSON 数据结构生成架构。Spark SQL can automatically generate a schema for the JSON data structure logged by Application Insights.

导出遥测数据Export telemetry data

根据配置连续导出中的步骤配置 Application Insights,将遥测信息导出到 Azure 存储 Blob。Follow the steps in Configure Continuous Export to configure your Application Insights to export telemetry information to an Azure Storage blob.

配置 HDInsight 以访问数据Configure HDInsight to access the data

如果要创建 HDInsight 群集,请在创建群集期间添加存储帐户。If you're creating an HDInsight cluster, add the storage account during cluster creation.

若要向现有群集添加 Azure 存储帐户,请使用添加额外的存储帐户文档中的信息。To add the Azure Storage Account to an existing cluster, use the information in the Add additional Storage Accounts document.

分析数据:PySparkAnalyze the data: PySpark

  1. 在 Web 浏览器中,导航到 https://CLUSTERNAME.azurehdinsight.cn/jupyter,其中 CLUSTERNAME 是群集的名称。From a web browser, navigate to https://CLUSTERNAME.azurehdinsight.cn/jupyter where CLUSTERNAME is the name of your cluster.

  2. 在 Jupyter 页面右上角选择“新建”,并选择“PySpark”。In the upper right corner of the Jupyter page, select New, and then PySpark. 此时会打开新浏览器选项卡,其中包含基于 Python 的 Jupyter 笔记本。A new browser tab containing a Python-based Jupyter Notebook opens.

  3. 在页面上的第一个字段(称为“单元格”)中输入以下文本:In the first field (called a cell) on the page, enter the following text:

    sc._jsc.hadoopConfiguration().set('mapreduce.input.fileinputformat.input.dir.recursive', 'true')
    

    此代码将 Spark 配置为以递归方式访问输入数据的目录结构。This code configures Spark to recursively access the directory structure for the input data. Application Insights 遥测数据将记录到类似于 /{telemetry type}/YYYY-MM-DD/{##}/的目录结构中。Application Insights telemetry is logged to a directory structure similar to the /{telemetry type}/YYYY-MM-DD/{##}/.

  4. 使用 SHIFT + ENTER 运行代码。Use SHIFT+ENTER to run the code. 在单元格左侧,括号之间会出现“*”,以表示正在执行此单元格中的代码。On the left side of the cell, an '*' appears between the brackets to indicate that the code in this cell is being executed. 完成后,“*”会更改成数字,在单元格下面会显示类似于下面的输出:Once it completes, the '*' changes to a number, and output similar to the following text is displayed below the cell:

    Creating SparkContext as 'sc'
    
    ID    YARN Application ID    Kind    State    Spark UI    Driver log    Current session?
    3    application_1468969497124_0001    pyspark    idle    Link    Link    ✔
    
    Creating HiveContext as 'sqlContext'
    SparkContext and HiveContext created. Executing user code ...
    
  5. 会在第一个单元格下方创建一个新单元格。A new cell is created below the first one. 在新单元格中输入以下文本。Enter the following text in the new cell. STORAGEACCOUNTCONTAINER 分别替换为包含 Application Insights 数据的 Azure 存储帐户名和 blob 容器名称。Replace CONTAINER and STORAGEACCOUNT with the Azure Storage account name and blob container name that contains Application Insights data.

    %%bash
    hdfs dfs -ls wasbs://CONTAINER@STORAGEACCOUNT.blob.core.chinacloudapi.cn/
    

    使用 SHIFT + ENTER 执行此单元格中的命令。Use SHIFT+ENTER to execute this cell. 可看到类似于以下文本的结果:You see a result similar to the following text:

        Found 1 items
        drwxrwxrwx   -          0 1970-01-01 00:00 wasbs://appinsights@contosostore.blob.core.chinacloudapi.cn/contosoappinsights_2bededa61bc741fbdee6b556571a4831
    

    返回的 wasbs 路径是 Application Insights 遥测数据的位置。The wasbs path returned is the location of the Application Insights telemetry data. 将单元格中的 hdfs dfs -ls 行更改为使用返回的 wasbs 路径,然后使用 Shift+Enter 再次执行单元格中的命令。Change the hdfs dfs -ls line in the cell to use the wasbs path returned, and then use SHIFT+ENTER to run the cell again. 这一次,结果应显示包含遥测数据的目录。This time, the results should display the directories that contain telemetry data.

    备注

    本部分中的余下步骤使用了 wasbs://appinsights@contosostore.blob.core.chinacloudapi.cn/contosoappinsights_{ID}/Requests 目录。For the remainder of the steps in this section, the wasbs://appinsights@contosostore.blob.core.chinacloudapi.cn/contosoappinsights_{ID}/Requests directory was used. 目录结构可能有所不同。Your directory structure may be different.

  6. 在下一个单元格中输入以下代码:将 WASB_PATH 替换为上一步中的路径。In the next cell, enter the following code: Replace WASB_PATH with the path from the previous step.

    jsonFiles = sc.textFile('WASB_PATH')
    jsonData = sqlContext.read.json(jsonFiles)
    

    此代码会从通过连续导出过程导出的 JSON 文件创建数据框架。This code creates a dataframe from the JSON files exported by the continuous export process. 使用 SHIFT + ENTER 运行此单元格中的命令。Use SHIFT+ENTER to run this cell.

  7. 在下一个单元格中输入并运行以下命令,查看 Spark 为 JSON 文件创建的架构:In the next cell, enter and run the following to view the schema that Spark created for the JSON files:

    jsonData.printSchema()
    

    每种类型的遥测数据的架构各不相同。The schema for each type of telemetry is different. 以下示例是为 Web 请求生成的架构(数据存储在 Requests 子目录中):The following example is the schema that is generated for web requests (data stored in the Requests subdirectory):

    root
    |-- context: struct (nullable = true)
    |    |-- application: struct (nullable = true)
    |    |    |-- version: string (nullable = true)
    |    |-- custom: struct (nullable = true)
    |    |    |-- dimensions: array (nullable = true)
    |    |    |    |-- element: string (containsNull = true)
    |    |    |-- metrics: array (nullable = true)
    |    |    |    |-- element: string (containsNull = true)
    |    |-- data: struct (nullable = true)
    |    |    |-- eventTime: string (nullable = true)
    |    |    |-- isSynthetic: boolean (nullable = true)
    |    |    |-- samplingRate: double (nullable = true)
    |    |    |-- syntheticSource: string (nullable = true)
    |    |-- device: struct (nullable = true)
    |    |    |-- browser: string (nullable = true)
    |    |    |-- browserVersion: string (nullable = true)
    |    |    |-- deviceModel: string (nullable = true)
    |    |    |-- deviceName: string (nullable = true)
    |    |    |-- id: string (nullable = true)
    |    |    |-- osVersion: string (nullable = true)
    |    |    |-- type: string (nullable = true)
    |    |-- location: struct (nullable = true)
    |    |    |-- city: string (nullable = true)
    |    |    |-- clientip: string (nullable = true)
    |    |    |-- continent: string (nullable = true)
    |    |    |-- country: string (nullable = true)
    |    |    |-- province: string (nullable = true)
    |    |-- operation: struct (nullable = true)
    |    |    |-- name: string (nullable = true)
    |    |-- session: struct (nullable = true)
    |    |    |-- id: string (nullable = true)
    |    |    |-- isFirst: boolean (nullable = true)
    |    |-- user: struct (nullable = true)
    |    |    |-- anonId: string (nullable = true)
    |    |    |-- isAuthenticated: boolean (nullable = true)
    |-- internal: struct (nullable = true)
    |    |-- data: struct (nullable = true)
    |    |    |-- documentVersion: string (nullable = true)
    |    |    |-- id: string (nullable = true)
    |-- request: array (nullable = true)
    |    |-- element: struct (containsNull = true)
    |    |    |-- count: long (nullable = true)
    |    |    |-- durationMetric: struct (nullable = true)
    |    |    |    |-- count: double (nullable = true)
    |    |    |    |-- max: double (nullable = true)
    |    |    |    |-- min: double (nullable = true)
    |    |    |    |-- sampledValue: double (nullable = true)
    |    |    |    |-- stdDev: double (nullable = true)
    |    |    |    |-- value: double (nullable = true)
    |    |    |-- id: string (nullable = true)
    |    |    |-- name: string (nullable = true)
    |    |    |-- responseCode: long (nullable = true)
    |    |    |-- success: boolean (nullable = true)
    |    |    |-- url: string (nullable = true)
    |    |    |-- urlData: struct (nullable = true)
    |    |    |    |-- base: string (nullable = true)
    |    |    |    |-- hashTag: string (nullable = true)
    |    |    |    |-- host: string (nullable = true)
    |    |    |    |-- protocol: string (nullable = true)
    
  8. 使用以下命令将数据框架注册为临时表,并针对数据运行查询:Use the following to register the dataframe as a temporary table and run a query against the data:

    jsonData.registerTempTable("requests")
    df = sqlContext.sql("select context.location.city from requests where context.location.city is not null")
    df.show()
    

    此查询返回 context.location.city 不为 NULL 的前 20 条记录的城市信息。This query returns the city information for the top 20 records where context.location.city isn't null.

    备注

    context 结构存在于由 Application Insights 记录的所有遥测中。The context structure is present in all telemetry logged by Application Insights. 日志中可能没有填充 city 元素。The city element may not be populated in your logs. 使用架构识别你可以查询的、可能包含日志数据的其他元素。Use the schema to identify other elements that you can query that may contain data for your logs.

    此查询返回类似于以下文本的信息:This query returns information similar to the following text:

    +---------+
    |     city|
    +---------+
    | Bellevue|
    |  Redmond|
    |  Seattle|
    |Charlotte|
    ...
    +---------+
    

分析数据:ScalaAnalyze the data: Scala

  1. 在 Web 浏览器中,导航到 https://CLUSTERNAME.azurehdinsight.cn/jupyter,其中 CLUSTERNAME 是群集的名称。From a web browser, navigate to https://CLUSTERNAME.azurehdinsight.cn/jupyter where CLUSTERNAME is the name of your cluster.

  2. 在 Jupyter 页面右上角选择“新建”,并选择“Scala”。In the upper right corner of the Jupyter page, select New, and then Scala. 此时会打开新浏览器选项卡,其中包含基于 Scala 的 Jupyter Notebook。A new browser tab containing a Scala-based Jupyter Notebook appears.

  3. 在页面上的第一个字段(称为“单元格”)中输入以下文本:In the first field (called a cell) on the page, enter the following text:

    sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
    

    此代码将 Spark 配置为以递归方式访问输入数据的目录结构。This code configures Spark to recursively access the directory structure for the input data. Application Insights 遥测数据将记录到类似于 /{telemetry type}/YYYY-MM-DD/{##}/的目录结构中。Application Insights telemetry is logged to a directory structure similar to /{telemetry type}/YYYY-MM-DD/{##}/.

  4. 使用 SHIFT + ENTER 运行代码。Use SHIFT+ENTER to run the code. 在单元格左侧,括号之间会出现“*”,以表示正在执行此单元格中的代码。On the left side of the cell, an '*' appears between the brackets to indicate that the code in this cell is being executed. 完成后,“*”会更改成数字,在单元格下面会显示类似于下面的输出:Once it completes, the '*' changes to a number, and output similar to the following text is displayed below the cell:

    Creating SparkContext as 'sc'
    
    ID    YARN Application ID    Kind    State    Spark UI    Driver log    Current session?
    3    application_1468969497124_0001    spark    idle    Link    Link    ✔
    
    Creating HiveContext as 'sqlContext'
    SparkContext and HiveContext created. Executing user code ...
    
  5. 会在第一个单元格下方创建一个新单元格。A new cell is created below the first one. 在新单元格中输入以下文本。Enter the following text in the new cell. STORAGEACCOUNTCONTAINER 分别替换为包含 Application Insights 日志的 Azure 存储帐户名和 blob 容器名称。Replace CONTAINER and STORAGEACCOUNT with the Azure Storage account name and blob container name that contains Application Insights logs.

    %%bash
    hdfs dfs -ls wasbs://CONTAINER@STORAGEACCOUNT.blob.core.chinacloudapi.cn/
    

    使用 SHIFT + ENTER 执行此单元格中的命令。Use SHIFT+ENTER to execute this cell. 可看到类似于以下文本的结果:You see a result similar to the following text:

        Found 1 items
        drwxrwxrwx   -          0 1970-01-01 00:00 wasbs://appinsights@contosostore.blob.core.chinacloudapi.cn/contosoappinsights_2bededa61bc741fbdee6b556571a4831
    

    返回的 wasbs 路径是 Application Insights 遥测数据的位置。The wasbs path returned is the location of the Application Insights telemetry data. 将单元格中的 hdfs dfs -ls 行更改为使用返回的 wasbs 路径,然后使用 Shift+Enter 再次执行单元格中的命令。Change the hdfs dfs -ls line in the cell to use the wasbs path returned, and then use SHIFT+ENTER to run the cell again. 这一次,结果应显示包含遥测数据的目录。This time, the results should display the directories that contain telemetry data.

    备注

    本部分中的余下步骤使用了 wasbs://appinsights@contosostore.blob.core.chinacloudapi.cn/contosoappinsights_{ID}/Requests 目录。For the remainder of the steps in this section, the wasbs://appinsights@contosostore.blob.core.chinacloudapi.cn/contosoappinsights_{ID}/Requests directory was used. 除非遥测数据用于 Web 应用,否则此目录可能并不存在。This directory may not exist unless your telemetry data is for a web app.

  6. 在下一个单元格中输入以下代码:将 WASB\_PATH 替换为上一步中的路径。In the next cell, enter the following code: Replace WASB\_PATH with the path from the previous step.

    var jsonFiles = sc.textFile('WASB_PATH')
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    var jsonData = sqlContext.read.json(jsonFiles)
    

    此代码会从通过连续导出过程导出的 JSON 文件创建数据框架。This code creates a dataframe from the JSON files exported by the continuous export process. 使用 SHIFT + ENTER 运行此单元格中的命令。Use SHIFT+ENTER to run this cell.

  7. 在下一个单元格中输入并运行以下命令,查看 Spark 为 JSON 文件创建的架构:In the next cell, enter and run the following to view the schema that Spark created for the JSON files:

    jsonData.printSchema
    

    每种类型的遥测数据的架构各不相同。The schema for each type of telemetry is different. 以下示例是为 Web 请求生成的架构(数据存储在 Requests 子目录中):The following example is the schema that is generated for web requests (data stored in the Requests subdirectory):

    root
    |-- context: struct (nullable = true)
    |    |-- application: struct (nullable = true)
    |    |    |-- version: string (nullable = true)
    |    |-- custom: struct (nullable = true)
    |    |    |-- dimensions: array (nullable = true)
    |    |    |    |-- element: string (containsNull = true)
    |    |    |-- metrics: array (nullable = true)
    |    |    |    |-- element: string (containsNull = true)
    |    |-- data: struct (nullable = true)
    |    |    |-- eventTime: string (nullable = true)
    |    |    |-- isSynthetic: boolean (nullable = true)
    |    |    |-- samplingRate: double (nullable = true)
    |    |    |-- syntheticSource: string (nullable = true)
    |    |-- device: struct (nullable = true)
    |    |    |-- browser: string (nullable = true)
    |    |    |-- browserVersion: string (nullable = true)
    |    |    |-- deviceModel: string (nullable = true)
    |    |    |-- deviceName: string (nullable = true)
    |    |    |-- id: string (nullable = true)
    |    |    |-- osVersion: string (nullable = true)
    |    |    |-- type: string (nullable = true)
    |    |-- location: struct (nullable = true)
    |    |    |-- city: string (nullable = true)
    |    |    |-- clientip: string (nullable = true)
    |    |    |-- continent: string (nullable = true)
    |    |    |-- country: string (nullable = true)
    |    |    |-- province: string (nullable = true)
    |    |-- operation: struct (nullable = true)
    |    |    |-- name: string (nullable = true)
    |    |-- session: struct (nullable = true)
    |    |    |-- id: string (nullable = true)
    |    |    |-- isFirst: boolean (nullable = true)
    |    |-- user: struct (nullable = true)
    |    |    |-- anonId: string (nullable = true)
    |    |    |-- isAuthenticated: boolean (nullable = true)
    |-- internal: struct (nullable = true)
    |    |-- data: struct (nullable = true)
    |    |    |-- documentVersion: string (nullable = true)
    |    |    |-- id: string (nullable = true)
    |-- request: array (nullable = true)
    |    |-- element: struct (containsNull = true)
    |    |    |-- count: long (nullable = true)
    |    |    |-- durationMetric: struct (nullable = true)
    |    |    |    |-- count: double (nullable = true)
    |    |    |    |-- max: double (nullable = true)
    |    |    |    |-- min: double (nullable = true)
    |    |    |    |-- sampledValue: double (nullable = true)
    |    |    |    |-- stdDev: double (nullable = true)
    |    |    |    |-- value: double (nullable = true)
    |    |    |-- id: string (nullable = true)
    |    |    |-- name: string (nullable = true)
    |    |    |-- responseCode: long (nullable = true)
    |    |    |-- success: boolean (nullable = true)
    |    |    |-- url: string (nullable = true)
    |    |    |-- urlData: struct (nullable = true)
    |    |    |    |-- base: string (nullable = true)
    |    |    |    |-- hashTag: string (nullable = true)
    |    |    |    |-- host: string (nullable = true)
    |    |    |    |-- protocol: string (nullable = true)
    
  8. 使用以下命令将数据框架注册为临时表,并针对数据运行查询:Use the following to register the dataframe as a temporary table and run a query against the data:

    jsonData.registerTempTable("requests")
    var city = sqlContext.sql("select context.location.city from requests where context.location.city isn't null limit 10").show()
    

    此查询返回 context.location.city 不为 NULL 的前 20 条记录的城市信息。This query returns the city information for the top 20 records where context.location.city isn't null.

    备注

    context 结构存在于由 Application Insights 记录的所有遥测中。The context structure is present in all telemetry logged by Application Insights. 日志中可能没有填充 city 元素。The city element may not be populated in your logs. 使用架构识别你可以查询的、可能包含日志数据的其他元素。Use the schema to identify other elements that you can query that may contain data for your logs.

    此查询返回类似于以下文本的信息:This query returns information similar to the following text:

    +---------+
    |     city|
    +---------+
    | Bellevue|
    |  Redmond|
    |  Seattle|
    |Charlotte|
    ...
    +---------+
    

后续步骤Next steps

有关使用 Apache Spark 处理 Azure 中数据和服务的更多示例,请参阅以下文档:For more examples of using Apache Spark to work with data and services in Azure, see the following documents:

有关创建和运行 Spark 应用程序的信息,请参阅以下文档:For information on creating and running Spark applications, see the following documents: