Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
128 views
in Technique[技术] by (71.8m points)

scala - Spark filtering based on matches in two Arrays in RDD's

I have a RDD of Words, than I have another RDD of something that contains a string that if a match is made it is removed from the string.

val wordList = sc.textFile("wordList.txt").map(x => x.split(',')).map(x => x(0))

Sample of wordList:

res15: Array[String] = Array(basetting, choosinesses, concavenesses, crabbinesses, cupidinously, falliblenesses, fleecinesses, hackishes, immaterialnesses, impiousnesses)

Than I have my other:

val filterWord = posts.map(x => (x._1, x._2.split(" ").filter(x => x != (wordList)))

Sample filterWord:

res16: Array[(String, Array[String])] = Array((6,Array(how, sweet, is, it, that, we, have)), (2,Array("")), (2,Array(will, this, question, cause, an, error)), (2,Array("")), (4,Array(how, do, we, create, a, new, tag, in), (7,Array("")), (2,Array(test, after, clr, on)), (2,Array("")), (2,Array(testing, a, long, tag)), (2,Array("")))

I need to get filterWord to only contain words that are not in the wordList but doesnt seem to be working because it is not filtering out any words in the wordList and if I change it to == instead it filters out everything.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This removes any post that contains any of the words in wordlist. It may or may not be what you want. Please do clarify your question.

Spark setup.

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc = new SparkContext(conf)

Test data:

val jabberwocky = """
Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe.

“Beware the Jabberwock, my son!
      The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
      The frumious Bandersnatch!”

He took his vorpal sword in hand;
      Long time the manxome foe he sought—
So rested he by the Tumtum tree
      And stood awhile in thought.

And, as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
      And burbled as it came!

One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
He left it dead, and with its head
      He went galumphing back.

“And hast thou slain the Jabberwock?
      Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!”
      He chortled in his joy.

’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe
"""
val words = "the and in all were"

Convert the test data to RDDs.

val posts = sc.parallelize(jabberwocky.split('
')
                                      .filter(_.nonEmpty)
                                      .zipWithIndex
                                      .map (_.swap))

val wordList = sc.parallelize(words.split(' ')).map(x => (x.toLowerCase(), x))

Make a PairRDD where there is a row for each word in each post. The key is each of the words, and the value is the original post

val postsPairs = posts.flatMap
    { case (i, s) => s.split("\W+").map(w=> (w.toLowerCase(), (i, s))) }

Find all the posts that DO have one of the excluded words

  val withExcluded = postsPairs.join(wordList).map(_._2._1)

(could do a .distinct here but there's no point, the duplicates won't matter for the next step)

Remove all the posts from the original list that have one of the excluded words. So any remaining have none of the excluded words. WWWWW.

  val res = posts.subtract(withExcluded)

  // (19,      He went galumphing back.)
  // (22,O frabjous day! Callooh! Callay!”)
  // (21,      Come to my arms, my beamish boy!)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...