Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
473 views
in Technique[技术] by (71.8m points)

hadoop - MapReduce job Output sort order

i can see in my mapreduce jobs that the output of the reducer part is sorted by key ..

so if i have set number of reducers to 10, the output directory would contain 10 files and each of those output files have a sorted data.

the reason i am putting it here is that even though all the files have sorted data but these files itself are not sorted.. for example : there are scenarios where the part-000* files have started from 0 and end at zzzz assuming i am using Text as the key.

i was assumming that the file's should be sorted even within the files i.e file 1 should have a and the last file part--00009 should have entries with zzzz or atleaset > a

assuming if i have all the alphabets uniformally distributed keys.

could someone throw some light why such a behavior

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can achieve a globally sorted file (which is what you basically want) using these methods:

  1. Use just one reducer in mapreduce (bad idea !! This puts too much work on one machine)
  2. Write a custom partitioner. Partioner is the class which divides the key space in mapreduce. The default partioner (Hashpartioner) evenly divides the key space into the number of reducers. Check out this example for writing a custom partioner.

  3. Use Hadoop Pig/Hive to do sort.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...