java - Sorting large data using MapReduce/Hadoop

Question

Welcome To Ask or Share your Answers For Others

java - Sorting large data using MapReduce/Hadoop

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - Sorting large data using MapReduce/Hadoop

I am reading about MapReduce and the following thing is confusing me.

Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:

Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.

My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?

Thanks, Chander

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:29:35+0000

Check out merge-sort.

It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.

If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.

Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.

Categories

java - Sorting large data using MapReduce/Hadoop

java - Sorting large data using MapReduce/Hadoop

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags