scala - Map key, value pair based on similarity of their value in Spark

Question

Welcome To Ask or Share your Answers For Others

scala - Map key, value pair based on similarity of their value in Spark

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

scala - Map key, value pair based on similarity of their value in Spark

I have been learning Spark for several weeks, currently I am trying to group several items or people based on their connection using Spark and Hadoop in Scala. For example, I want to see how football players are connected based on their club history. My "players" rdd would be:

(John, FC Sion)
(Mike, FC Sion)
(Bobby, PSV Eindhoven)
(Hans, FC Sion)

I want to have rdd like this:

(John, <Mike, Hans>)
(Mike, <John, Hans>)
(Bobby, <>)
(Hans, <Mike, John>)

I plan to use map to accomplish this.

val splitClubs = players.map(player=> (player._1, parseTeammates(player._2, players)))

Where parseTeammates is a function that will find players that are also playing for same club (player._2)

// RDD is not a type, how can I insert rdd into a function?
def parseTeammates(club: String, rdd: RDD) : List[String] = {
    // will generate a list of players that contains same "club" value
    val playerList = rdd.filter(_._1 == club)
    return playerList.values;
}

I get compilation error, type mismatch since the function is expected to return List[String] but instead playerList.values returns org.apache.spark.rdd.RDD[List[String]]. Can anybody help me to get the value of an RDD in its simple form (in my case, List[String]) ?

Also, I think there is a more elegant way to solve this problem, rather than creating a separate RDD and then find a certain key in the new RDD and then returning the value as a list

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:21:19+0000

I think your parseTeammates approach is a little off in the world of RDDs. When it comes to dealing with RDDs and potentially really, REALLY large amount of data, you don't want to do this kind of nested looping. Try instead to re-organize your data.

The code below will get you what you want

players.map{case(player, club) => (club, List(player))}
   .reduceByKey(_++_)
   .flatMap{case(_, list) =>list.zipWithIndex.map{case(player, index) => (player, list.take(index) ++ list.drop(index+1))}}

Note that I first organize the data according to the club they played for and then afterwards combine the players to yield the result in the format you are looking for.

I hope this helps.

Categories

scala - Map key, value pair based on similarity of their value in Spark

scala - Map key, value pair based on similarity of their value in Spark

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags