将 Apache Mahout 与 HDInsight 中的 Apache Hadoop 配合使用生成电影推荐 (SSH)Generate movie recommendations using Apache Mahout with Apache Hadoop in HDInsight (SSH)

了解如何使用 Apache Mahout 机器学习库通过 Azure HDInsight 生成电影推荐。Learn how to use the Apache Mahout machine learning library with Azure HDInsight to generate movie recommendations.

Mahout 是适用于 Apache Hadoop 的计算机学习库。Mahout is a machine learning library for Apache Hadoop. Mahout 包含用于处理数据的算法,例如筛选、分类和群集。Mahout contains algorithms for processing data, such as filtering, classification, and clustering. 在本文中,用户使用推荐引擎根据好友看过的电影生成电影推荐。In this article, you use a recommendation engine to generate movie recommendations that are based on movies your friends have seen.

必备条件Prerequisites

Apache Mahout 版本控制Apache Mahout versioning

若要深入了解 HDInsight 中的 Mahout 版本,请参阅 HDInsight 版本和 Apache Hadoop 组件For more information about the version of Mahout in HDInsight, see HDInsight versions and Apache Hadoop components.

了解建议Understanding recommendations

由 Mahout 提供的功能之一是推荐引擎。One of the functions that is provided by Mahout is a recommendation engine. 此引擎接受 userIDitemIdprefValue 格式(项的首选项)的数据。This engine accepts data in the format of userID, itemId, and prefValue (the preference for the item). 然后,Mahout 将执行共现分析,以确定: 偏好某个项的用户也偏好其他类似项Mahout can then perform co-occurrence analysis to determine: users who have a preference for an item also have a preference for these other items. 随后,Mahout 确定拥有类似项偏好的用户,这些偏好可用于推荐。Mahout then determines users with like-item preferences, which can be used to make recommendations.

下面的工作流是使用电影数据的简化示例:The following workflow is a simplified example that uses movie data:

  • 共现:Joe、Alice 和 Bob 都喜欢电影《星球大战》 、《帝国反击战》 和《绝地大反击》 。Co-occurrence: Joe, Alice, and Bob all liked Star Wars, The Empire Strikes Back, and Return of the Jedi. Mahout 可确定喜欢以上电影之一的用户也喜欢其他两部。Mahout determines that users who like any one of these movies also like the other two.

  • 共现:Bob 和 Alice 还喜欢电影《幽灵的威胁》 、《克隆人的进攻》 和《西斯的复仇》 。Co-occurrence: Bob and Alice also liked The Phantom Menace, Attack of the Clones, and Revenge of the Sith. Mahout 可确定喜欢前面三部电影的用户也喜欢这三部电影。Mahout determines that users who liked the previous three movies also like these three movies.

  • 类似性推荐:由于 Joe 喜欢前三部电影,Mahout 会查看具有类似偏好的其他人已喜欢但 Joe 还未观看过(已喜欢/已评价)的电影。Similarity recommendation: Because Joe liked the first three movies, Mahout looks at movies that others with similar preferences liked, but Joe has not watched (liked/rated). 在这种情况下,Mahout 推荐《幽灵的威胁》 、《克隆人的进攻》 和《西斯的复仇》 。In this case, Mahout recommends The Phantom Menace, Attack of the Clones, and Revenge of the Sith.

了解数据Understanding the data

为方便起见,GroupLens 研究以兼容 Mahout 的格式提供电影的评价数据。Conveniently, GroupLens Research provides rating data for movies in a format that is compatible with Mahout. 此数据在 /HdiSamples/HdiSamples/MahoutMovieData 中群集的默认存储中可用。This data is available on your cluster's default storage at /HdiSamples/HdiSamples/MahoutMovieData.

有两个文件,即 moviedb.txtuser-ratings.txtThere are two files, moviedb.txt and user-ratings.txt. user-ratings.txt 文件在分析期间使用。The user-ratings.txt file is used during analysis. moviedb.txt 用于在查看结果时提供用户友好的文本信息。The moviedb.txt is used to provide user-friendly text information when viewing the results.

user-ratings.txt 中包含的数据具有 userIDmovieIDuserRatingtimestamp 结构,指示每个用户对电影评级的情况。The data contained in user-ratings.txt has a structure of userID, movieID, userRating, and timestamp, which indicates how highly each user rated a movie. 下面是数据的示例:Here is an example of the data:

196    242    3    881250949
186    302    3    891717742
22    377    1    878887116
244    51    2    880606923
166    346    1    886397596

运行分析Run the analysis

  1. 使用 ssh 命令连接到群集。Use ssh command to connect to your cluster. 编辑以下命令(将 CLUSTERNAME 替换为群集的名称),然后输入该命令:Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:

    ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.cn
    
  2. 使用以下命令来运行推荐作业:Use the following command to run the recommendation job:

mahout recommenditembased -s SIMILARITY_COOCCURRENCE -i /HdiSamples/HdiSamples/MahoutMovieData/user-ratings.txt -o /example/data/mahoutout --tempDir /temp/mahouttemp

备注

该作业可能需要几分钟才能完成,并可能运行多个 MapReduce 作业。The job may take several minutes to complete, and may run multiple MapReduce jobs.

查看输出View the output

  1. 作业完成后,使用以下命令查看生成的输出:Once the job completes, use the following command to view the generated output:

    hdfs dfs -text /example/data/mahoutout/part-r-00000
    

    输出将如下所示:The output appears as follows:

    1    [234:5.0,347:5.0,237:5.0,47:5.0,282:5.0,275:5.0,88:5.0,515:5.0,514:5.0,121:5.0]
    2    [282:5.0,210:5.0,237:5.0,234:5.0,347:5.0,121:5.0,258:5.0,515:5.0,462:5.0,79:5.0]
    3    [284:5.0,285:4.828125,508:4.7543354,845:4.75,319:4.705128,124:4.7045455,150:4.6938777,311:4.6769233,248:4.65625,272:4.649266]
    4    [690:5.0,12:5.0,234:5.0,275:5.0,121:5.0,255:5.0,237:5.0,895:5.0,282:5.0,117:5.0]
    

    第一列是 userIDThe first column is the userID. “[”和“]”中包含的值为 movieId:recommendationScoreThe values contained in '[' and ']' are movieId:recommendationScore.

  2. 可使用该输出以及 moviedb.txt 提供有关建议的详细信息。You can use the output, along with the moviedb.txt, to provide more information on the recommendations. 首先,使用以下命令在本地复制文件:First, copy the files locally using the following commands:

    hdfs dfs -get /example/data/mahoutout/part-r-00000 recommendations.txt
    hdfs dfs -get /HdiSamples/HdiSamples/MahoutMovieData/* .
    

    此命令会将输出数据以及电影数据文件复制到当前目录中名为 recommendations.txt 的文件。This command copies the output data to a file named recommendations.txt in the current directory, along with the movie data files.

  3. 使用如下命令创建 Python 脚本,该脚本查找电影名称中是否存在建议输出中的数据:Use the following command to create a Python script that looks up movie names for the data in the recommendations output:

    nano show_recommendations.py
    

    编辑器打开后,使用以下文本作为该文件的内容:When the editor opens, use the following text as the contents of the file:

    #!/usr/bin/env python
    
    import sys
    
    if len(sys.argv) != 5:
         print "Arguments: userId userDataFilename movieFilename recommendationFilename"
         sys.exit(1)
    
    userId, userDataFilename, movieFilename, recommendationFilename = sys.argv[1:]
    
    print "Reading Movies Descriptions"
    movieFile = open(movieFilename)
    movieById = {}
    for line in movieFile:
        tokens = line.split("|")
        movieById[tokens[0]] = tokens[1:]
    movieFile.close()
    
    print "Reading Rated Movies"
    userDataFile = open(userDataFilename)
    ratedMovieIds = []
    for line in userDataFile:
        tokens = line.split("\t")
        if tokens[0] == userId:
            ratedMovieIds.append((tokens[1],tokens[2]))
    userDataFile.close()
    
    print "Reading Recommendations"
    recommendationFile = open(recommendationFilename)
    recommendations = []
    for line in recommendationFile:
        tokens = line.split("\t")
        if tokens[0] == userId:
            movieIdAndScores = tokens[1].strip("[]\n").split(",")
            recommendations = [ movieIdAndScore.split(":") for movieIdAndScore in movieIdAndScores ]
            break
    recommendationFile.close()
    
    print "Rated Movies"
    print "------------------------"
    for movieId, rating in ratedMovieIds:
        print "%s, rating=%s" % (movieById[movieId][0], rating)
    print "------------------------"
    
    print "Recommended Movies"
    print "------------------------"
    for movieId, score in recommendations:
        print "%s, score=%s" % (movieById[movieId][0], score)
    print "------------------------"
    

    Ctrl-XY,最后按 Enter 来保存数据。Press Ctrl-X, Y, and finally Enter to save the data.

  4. 运行 Python 脚本。Run the Python script. 以下命令假设用户处于内含所有已下载文件的目录中:The following command assumes you are in the directory where all the files were downloaded:

    python show_recommendations.py 4 user-ratings.txt moviedb.txt recommendations.txt
    

    此命令查看为用户 ID 4 生成的建议。This command looks at the recommendations generated for user ID 4.

    • user-ratings.txt 文件用于检索已被评价过的电影。The user-ratings.txt file is used to retrieve movies that have been rated.

    • moviedb.txt 文件用于检索电影的名称。The moviedb.txt file is used to retrieve the names of the movies.

    • recommendations.txt 用于检索此用户的电影建议。The recommendations.txt is used to retrieve the movie recommendations for this user.

      此命令的输出类似于以下文本:The output from this command is similar to the following text:

      Seven Years in Tibet (1997), score=5.0
      Indiana Jones and the Last Crusade (1989), score=5.0
      Jaws (1975), score=5.0
      Sense and Sensibility (1995), score=5.0
      Independence Day (ID4) (1996), score=5.0
      My Best Friend's Wedding (1997), score=5.0
      Jerry Maguire (1996), score=5.0
      Scream 2 (1997), score=5.0
      Time to Kill, A (1996), score=5.0
      

删除临时数据Delete temporary data

Mahout 作业不删除在处理作业时创建的临时数据。Mahout jobs do not remove temporary data that is created while processing the job. 在示例作业中指定 --tempDir 参数,以将临时文件隔离到特定路径中轻松删除。The --tempDir parameter is specified in the example job to isolate the temporary files into a specific path for easy deletion. 若要删除临时文件,请使用以下命令:To remove the temp files, use the following command:

hdfs dfs -rm -f -r /temp/mahouttemp

警告

如需再次运行此命令,则还必须删除输出目录。If you want to run the command again, you must also delete the output directory. 使用以下命令删除此目录:Use the following to delete this directory:

hdfs dfs -rm -f -r /example/data/mahoutout

后续步骤Next steps

既已学习如何使用 Mahout,可探索在 HDInsight 上处理数据的其他方式:Now that you have learned how to use Mahout, discover other ways of working with data on HDInsight: