Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
194 views
in Technique[技术] by (71.8m points)

scala - Map key, value pair based on similarity of their value in Spark

I have been learning Spark for several weeks, currently I am trying to group several items or people based on their connection using Spark and Hadoop in Scala. For example, I want to see how football players are connected based on their club history. My "players" rdd would be:

(John, FC Sion)
(Mike, FC Sion)
(Bobby, PSV Eindhoven)
(Hans, FC Sion)

I want to have rdd like this:

(John, <Mike, Hans>)
(Mike, <John, Hans>)
(Bobby, <>)
(Hans, <Mike, John>)

I plan to use map to accomplish this.

val splitClubs = players.map(player=> (player._1, parseTeammates(player._2, players)))

Where parseTeammates is a function that will find players that are also playing for same club (player._2)

// RDD is not a type, how can I insert rdd into a function?
def parseTeammates(club: String, rdd: RDD) : List[String] = {
    // will generate a list of players that contains same "club" value
    val playerList = rdd.filter(_._1 == club)
    return playerList.values;
}

I get compilation error, type mismatch since the function is expected to return List[String] but instead playerList.values returns org.apache.spark.rdd.RDD[List[String]]. Can anybody help me to get the value of an RDD in its simple form (in my case, List[String]) ?

Also, I think there is a more elegant way to solve this problem, rather than creating a separate RDD and then find a certain key in the new RDD and then returning the value as a list

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I think your parseTeammates approach is a little off in the world of RDDs. When it comes to dealing with RDDs and potentially really, REALLY large amount of data, you don't want to do this kind of nested looping. Try instead to re-organize your data.

The code below will get you what you want

players.map{case(player, club) => (club, List(player))}
   .reduceByKey(_++_)
   .flatMap{case(_, list) =>list.zipWithIndex.map{case(player, index) => (player, list.take(index) ++ list.drop(index+1))}}

Note that I first organize the data according to the club they played for and then afterwards combine the players to yield the result in the format you are looking for.

I hope this helps.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...