当前位置:   article > 正文

Apache Spark 练习六:使用Spark分析音乐专辑数据_音乐专辑数据分析

音乐专辑数据分析

一、源数据

本章所分析的数据来自于Kaggle公开的、人工合成的音乐专辑发行数据(https://www.kaggle.com/datasets/revilrosa/music-label-dataset)。以下,我们只针对albums.csv文件进行分析。该数据具体包括以下字段:

  • id: the album identifier;
  • artist_id: the artist identifier;
  • album_title: the title of the album;
  • genre: the genre of the album. An artist can release albums of different genres;
  • year_of_pub: the year the album was published;
  • num_of_tracks: how many tracks there are in the album (a small number can mean longer tracks);
  • num_of_sales: how many sales the album has made in the first month after the release;
  • rolling_stone_critic: how magazine Rolling Stone has rated the album;
  • mtv_critic: how MTV has rated the album;
  • music_maniac_critic: how review site Music Maniac has rated the album.

二、练习题

0. 数据预处理

以下,我们将csv文件从HDFS中读取进来,并转换为Spark DataFrame格式。

  1. val spark = SparkSession
  2. .builder()
  3. .appName("Albums")
  4. .getOrCreate()
  5. import spark.implicits._
  6. val df = spark.read
  7. .option("header", "true")
  8. .option("inferSchema", "true")
  9. .csv("hdfs:///SparkLearning/albums.csv")

1. 统计各类型专辑的数量

  1. val res = df
  2. .select($"genre")
  3. .groupBy($"genre")
  4. .count()
  5. .orderBy($"count".desc)

2. 统计各类型专辑的销量总数

  1. val res = df
  2. .select($"genre", $"num_of_sales")
  3. .groupBy($"genre")
  4. .sum("num_of_sales")
  5. .withColumnRenamed("sum(num_of_sales)", "total_sales")
  6. .orderBy($"total_sales".desc)

3. 统计近20年每年发行的专辑数量和单曲数量

  1. val res = df
  2. .select("year_of_pub", "num_of_tracks")
  3. .filter($"year_of_pub" >= 2000)
  4. .groupBy($"year_of_pub")
  5. .agg("num_of_tracks" -> "count", "num_of_tracks" -> "sum")
  6. .withColumnRenamed("count(num_of_tracks)", "total_albums")
  7. .withColumnRenamed("sum(num_of_tracks)", "total_tracks")
  8. .orderBy("year_of_pub")

4. 分析总销量前五的专辑类型的各年份销量

  1. val res = df
  2. .select($"genre", $"num_of_sales")
  3. .groupBy("genre")
  4. .sum("num_of_sales")
  5. .withColumnRenamed("sum(num_of_sales)", "total_sales")
  6. .orderBy($"total_sales".desc)
  7. .limit(5)
  8. .alias("t1")
  9. .join(
  10. df.select($"genre", $"num_of_sales", $"year_of_pub").alias("t2"),
  11. $"t1.genre" === $"t2.genre"
  12. )
  13. .groupBy("t2.genre", "t2.year_of_pub")
  14. .sum("t2.num_of_sales")
  15. .orderBy($"genre", $"year_of_pub")

5. 分析总销量前五的专辑类型,在不同评分体系中的平均评分

  1. val res = df
  2. .select($"genre", $"num_of_sales")
  3. .groupBy("genre")
  4. .sum("num_of_sales")
  5. .withColumnRenamed("sum(num_of_sales)", "total_sales")
  6. .orderBy($"total_sales".desc)
  7. .limit(5)
  8. .alias("t1")
  9. .join(
  10. df.select(
  11. $"genre",
  12. $"rolling_stone_critic",
  13. $"mtv_critic",
  14. $"music_maniac_critic"
  15. ).alias("t2"),
  16. $"t1.genre" === $"t2.genre"
  17. )
  18. .groupBy("t2.genre")
  19. .agg(
  20. "rolling_stone_critic" -> "avg",
  21. "mtv_critic" -> "avg",
  22. "music_maniac_critic" -> "avg"
  23. )
  24. .orderBy($"genre")

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/码创造者/article/detail/743366
推荐阅读
相关标签
  

闽ICP备14008679号