将自定义 Python 库与 HDInsight 上的 Apache Spark 群集配合使用来分析网站日志Analyze website logs using a custom Python library with Apache Spark cluster on HDInsight

此笔记本演示如何将自定义库与 HDInsight 上的 Apache Spark 配合使用来分析日志数据。This notebook demonstrates how to analyze log data using a custom library with Apache Spark on HDInsight. 我们使用的自定义库是一个名为 iislogparser.py 的 Python 库。The custom library we use is a Python library called iislogparser.py.

必备条件Prerequisites

HDInsight 上的 Apache Spark 群集。An Apache Spark cluster on HDInsight. 有关说明,请参阅在 Azure HDInsight 中创建 Apache Spark 群集For instructions, see Create Apache Spark clusters in Azure HDInsight.

将原始数据另存为 RDDSave raw data as an RDD

在本部分中,将使用与 HDInsight 中的 Apache Spark 群集关联的 Jupyter 笔记本来运行用于处理原始示例数据并将其保存为 Hive 表的作业。In this section, we use the Jupyter notebook associated with an Apache Spark cluster in HDInsight to run jobs that process your raw sample data and save it as a Hive table. 示例数据是所有群集在默认情况下均会提供的 .csv 文件 (hvac.csv)。The sample data is a .csv file (hvac.csv) available on all clusters by default.

将数据保存为 Apache Hive 表之后,下一部分我们将使用 Power BI 和 Tableau 等 BI 工具来连接该 Hive 表。Once your data is saved as an Apache Hive table, in the next section we will connect to the Hive table using BI tools such as Power BI and Tableau.

  1. 在 Web 浏览器中导航到 https://CLUSTERNAME.azurehdinsight.cn/jupyter,其中的 CLUSTERNAME 是群集的名称。From a web browser, navigate to https://CLUSTERNAME.azurehdinsight.cn/jupyter, where CLUSTERNAME is the name of your cluster.

  2. 创建新的笔记本。Create a new notebook. 依次选择“新建”、“PySpark” 。Select New, and then PySpark.

    创建新的 Jupyter 笔记本Create a new Jupyter notebook

  3. 新笔记本随即已创建,并以 Untitled.pynb 名称打开。A new notebook is created and opened with the name Untitled.pynb. 在顶部单击笔记本名称,并输入一个友好名称。Click the notebook name at the top, and enter a friendly name.

    提供笔记本的名称Provide a name for the notebook

  4. 使用笔记本是使用 PySpark 内核创建的,因此不需要显式创建任何上下文。Because you created a notebook using the PySpark kernel, you do not need to create any contexts explicitly. 运行第一个代码单元格时,系统会自动创建 Spark 和 Hive 上下文。The Spark and Hive contexts will be automatically created for you when you run the first code cell. 首先可以导入此方案所需的类型。You can start by importing the types that are required for this scenario. 将以下代码段粘贴到空白单元格中,然后按 Shift+EnterPaste the following snippet in an empty cell, and then press SHIFT + ENTER.

    from pyspark.sql import Row
    from pyspark.sql.types import *
    
  5. 使用群集上已可用的示例日志数据创建 RDD。Create an RDD using the sample log data already available on the cluster. 可以在 \HdiSamples\HdiSamples\WebsiteLogSampleData\SampleLog\909f2b.log 访问与群集关联的默认存储帐户中的数据。You can access the data in the default storage account associated with the cluster at \HdiSamples\HdiSamples\WebsiteLogSampleData\SampleLog\909f2b.log. 执行以下代码:Execute the following code:

    logs = sc.textFile('wasbs:///HdiSamples/HdiSamples/WebsiteLogSampleData/SampleLog/909f2b.log')
    
  6. 检索示例日志集以验证上一步是否成功完成。Retrieve a sample log set to verify that the previous step completed successfully.

    logs.take(5)
    

    应该会看到与以下文本类似的输出:You should see an output similar to the following text:

    [u'#Software: Microsoft Internet Information Services 8.0',
    u'#Fields: date time s-sitename cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken',
    u'2014-01-01 02:01:09 SAMPLEWEBSITE GET /blogposts/mvc4/step2.png X-ARR-LOG-ID=2ec4b8ad-3cf0-4442-93ab-837317ece6a1 80 - 1.54.23.196 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36 - http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx www.sample.com 200 0 0 53175 871 46',
    u'2014-01-01 02:01:09 SAMPLEWEBSITE GET /blogposts/mvc4/step3.png X-ARR-LOG-ID=9eace870-2f49-4efd-b204-0d170da46b4a 80 - 1.54.23.196 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36 - http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx www.sample.com 200 0 0 51237 871 32',
    u'2014-01-01 02:01:09 SAMPLEWEBSITE GET /blogposts/mvc4/step4.png X-ARR-LOG-ID=4bea5b3d-8ac9-46c9-9b8c-ec3e9500cbea 80 - 1.54.23.196 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36 - http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx www.sample.com 200 0 0 72177 871 47']
    

使用自定义 Python 库分析日志数据Analyze log data using a custom Python library

  1. 在上面的输出中,前几行包括标头信息,其余的每一行均与此标头中描述的架构相匹配。In the output above, the first couple lines include the header information and each remaining line matches the schema described in that header. 分析此类日志可能很复杂。Parsing such logs could be complicated. 因此,可使用自定义 Python 库 (iislogparser.py),它能使分析这类日志变得容易得多。So, we use a custom Python library (iislogparser.py) that makes parsing such logs much easier. 默认情况下,此库包含在 /HdiSamples/HdiSamples/WebsiteLogSampleData/iislogparser.py 处 HDInsight 上的 Spark 群集中。By default, this library is included with your Spark cluster on HDInsight at /HdiSamples/HdiSamples/WebsiteLogSampleData/iislogparser.py.

    但是,此库不在 PYTHONPATH 中,因此不能通过 import iislogparser 等导入语句来使用它。However, this library isn't in the PYTHONPATH so we can't use it by using an import statement like import iislogparser. 要使用此库,必须将其分发给所有从节点。To use this library, we must distribute it to all the worker nodes. 运行以下代码段。Run the following snippet.

    sc.addPyFile('wasbs:///HdiSamples/HdiSamples/WebsiteLogSampleData/iislogparser.py')
    
  2. 如果日志行是标题行,并且在遇到日志行时返回 LogLine 类的实例,则 iislogparser 提供返回 None 的函数 parse_log_lineiislogparser provides a function parse_log_line that returns None if a log line is a header row, and returns an instance of the LogLine class if it encounters a log line. 使用 LogLine 类从 RDD 中仅提取日志行:Use the LogLine class to extract only the log lines from the RDD:

    def parse_line(l):
        import iislogparser
        return iislogparser.parse_log_line(l)
    logLines = logs.map(parse_line).filter(lambda p: p is not None).cache()
    
  3. 检索一些提取的日志行,以验证该步骤是否成功完成。Retrieve a couple of extracted log lines to verify that the step completed successfully.

    logLines.take(2)
    

    输出应类似于以下文本:The output should be similar to the following text:

    [2014-01-01 02:01:09 SAMPLEWEBSITE GET /blogposts/mvc4/step2.png X-ARR-LOG-ID=2ec4b8ad-3cf0-4442-93ab-837317ece6a1 80 - 1.54.23.196 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36 - http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx www.sample.com 200 0 0 53175 871 46,
    2014-01-01 02:01:09 SAMPLEWEBSITE GET /blogposts/mvc4/step3.png X-ARR-LOG-ID=9eace870-2f49-4efd-b204-0d170da46b4a 80 - 1.54.23.196 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36 - http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx www.sample.com 200 0 0 51237 871 32]
    
  4. 反过来,LogLine 类具有一些有用的方法,如 is_error(),可返回日志条目是否具有错误代码。The LogLine class, in turn, has some useful methods, like is_error(), which returns whether a log entry has an error code. 使用此类来计算提取日志行中的错误数,然后将所有错误记录到另一个文件中。Use this class to compute the number of errors in the extracted log lines, and then log all the errors to a different file.

    errors = logLines.filter(lambda p: p.is_error())
    numLines = logLines.count()
    numErrors = errors.count()
    print 'There are', numErrors, 'errors and', numLines, 'log entries'
    errors.map(lambda p: str(p)).saveAsTextFile('wasbs:///HdiSamples/HdiSamples/WebsiteLogSampleData/SampleLog/909f2b-2.log')
    

    输出应该指出“There are 30 errors and 646 log entries”。The output should state There are 30 errors and 646 log entries.

  5. 还可以使用 Matplotlib 构造数据的可视化效果。You can also use Matplotlib to construct a visualization of the data. 例如,如果要找出请求长时间运行的原因,可能需要查找平均执行时间最长的文件。For example, if you want to isolate the cause of requests that run for a long time, you might want to find the files that take the most time to serve on average. 下面的代码段检索执行请求花费时间最长的前 25 个资源。The snippet below retrieves the top 25 resources that took most time to serve a request.

    def avgTimeTakenByKey(rdd):
        return rdd.combineByKey(lambda line: (line.time_taken, 1),
                                lambda x, line: (x[0] + line.time_taken, x[1] + 1),
                                lambda x, y: (x[0] + y[0], x[1] + y[1]))\
                    .map(lambda x: (x[0], float(x[1][0]) / float(x[1][1])))
    
    avgTimeTakenByKey(logLines.map(lambda p: (p.cs_uri_stem, p))).top(25, lambda x: x[1])
    

    应该看到输出类似于以下文本:You should see an output like the following text:

    [(u'/blogposts/mvc4/step13.png', 197.5),
    (u'/blogposts/mvc2/step10.jpg', 179.5),
    (u'/blogposts/extractusercontrol/step5.png', 170.0),
    (u'/blogposts/mvc4/step8.png', 159.0),
    (u'/blogposts/mvcrouting/step22.jpg', 155.0),
    (u'/blogposts/mvcrouting/step3.jpg', 152.0),
    (u'/blogposts/linqsproc1/step16.jpg', 138.75),
    (u'/blogposts/linqsproc1/step26.jpg', 137.33333333333334),
    (u'/blogposts/vs2008javascript/step10.jpg', 127.0),
    (u'/blogposts/nested/step2.jpg', 126.0),
    (u'/blogposts/adminpack/step1.png', 124.0),
    (u'/BlogPosts/datalistpaging/step2.png', 118.0),
    (u'/blogposts/mvc4/step35.png', 117.0),
    (u'/blogposts/mvcrouting/step2.jpg', 116.5),
    (u'/blogposts/aboutme/basketball.jpg', 109.0),
    (u'/blogposts/anonymoustypes/step11.jpg', 109.0),
    (u'/blogposts/mvc4/step12.png', 106.0),
    (u'/blogposts/linq8/step0.jpg', 105.5),
    (u'/blogposts/mvc2/step18.jpg', 104.0),
    (u'/blogposts/mvc2/step11.jpg', 104.0),
    (u'/blogposts/mvcrouting/step1.jpg', 104.0),
    (u'/blogposts/extractusercontrol/step1.png', 103.0),
    (u'/blogposts/sqlvideos/sqlvideos.jpg', 102.0),
    (u'/blogposts/mvcrouting/step21.jpg', 101.0),
    (u'/blogposts/mvc4/step1.png', 98.0)]
    
  6. 还可以在此绘图窗体中显示此信息。You can also present this information in the form of plot. 创建绘图的第一步是创建一个临时表 AverageTimeAs a first step to create a plot, let us first create a temporary table AverageTime. 该表按照时间对日志进行分组,以查看在任何特定时间是否有任何异常延迟峰值。The table groups the logs by time to see if there were any unusual latency spikes at any particular time.

    avgTimeTakenByMinute = avgTimeTakenByKey(logLines.map(lambda p: (p.datetime.minute, p))).sortByKey()
    schema = StructType([StructField('Minutes', IntegerType(), True),
                        StructField('Time', FloatType(), True)])
    
    avgTimeTakenByMinuteDF = sqlContext.createDataFrame(avgTimeTakenByMinute, schema)
    avgTimeTakenByMinuteDF.registerTempTable('AverageTime')
    
  7. 接下来可以运行以下 SQL 查询以获取 AverageTime 表中的所有记录。You can then run the following SQL query to get all the records in the AverageTime table.

    %%sql -o averagetime
    SELECT * FROM AverageTime
    

    后接 -o averagetime%%sql magic 可确保查询输出本地保存在 Jupyter 服务器上(通常在群集的头结点)。The %%sql magic followed by -o averagetime ensures that the output of the query is persisted locally on the Jupyter server (typically the headnode of the cluster). 输出将作为 Pandas 数据帧进行保存,指定名称为“averagetime” 。The output is persisted as a Pandas dataframe with the specified name averagetime.

    应该看到输出类似于下图:You should see an output like the following image:

    hdinsight jupyter sql 查询输出hdinsight jupyter sql query output

    有关 %%sql magic 的详细信息,请参阅 %%sql magic 支持的参数For more information about the %%sql magic, see Parameters supported with the %%sql magic.

  8. 现在可以使用 Matplotlib(用于构造数据可视化的库)来创建绘图。You can now use Matplotlib, a library used to construct visualization of data, to create a plot. 因为必须从本地保存的 averagetime 数据帧中创建绘图,所以代码片段必须以 %%local magic 开头。Because the plot must be created from the locally persisted averagetime dataframe, the code snippet must begin with the %%local magic. 这可确保代码在 Jupyter 服务器上本地运行。This ensures that the code is run locally on the Jupyter server.

    %%local
    %matplotlib inline
    import matplotlib.pyplot as plt
    
    plt.plot(averagetime['Minutes'], averagetime['Time'], marker='o', linestyle='--')
    plt.xlabel('Time (min)')
    plt.ylabel('Average time taken for request (ms)')
    

    应该看到输出类似于下图:You should see an output like the following image:

    apache spark web 日志分析图apache spark web log analysis plot

  9. 运行完应用程序之后,应该关闭笔记本以释放资源。After you have finished running the application, you should shut down the notebook to release the resources. 为此,请在 Notebook 的“文件”菜单中选择“关闭并停止” 。To do so, from the File menu on the notebook, select Close and Halt. 此操作会关闭笔记本。This action will shut down and close the notebook.

后续步骤Next steps

请参阅以下文章:Explore the following articles: