Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
275 views
in Technique[技术] by (71.8m points)

python - How does the pyspark mapPartitions function work?

So I am trying to learn Spark using Python (Pyspark). I want to know how the function mapPartitions work. That is what Input it takes and what Output it gives. I couldn't find any proper example from the internet. Lets say, I have an RDD object containing lists, such as below.

[ [1, 2, 3], [3, 2, 4], [5, 2, 7] ] 

And I want to remove element 2 from all the lists, how would I achieve that using mapPartitions.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

mapPartition should be thought of as a map operation over partitions and not over the elements of the partition. It's input is the set of current partitions its output will be another set of partitions.

The function you pass to map operation must take an individual element of your RDD

The function you pass to mapPartition must take an iterable of your RDD type and return an iterable of some other or the same type.

In your case you probably just want to do something like:

def filter_out_2(line):
    return [x for x in line if x != 2]

filtered_lists = data.map(filterOut2)

If you wanted to use mapPartition it would be:

def filter_out_2_from_partition(list_of_lists):
  final_iterator = []
  for sub_list in list_of_lists:
    final_iterator.append( [x for x in sub_list if x != 2])
  return iter(final_iterator)

filtered_lists = data.mapPartition(filterOut2FromPartion)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...