I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. That is, a 64mb
file, which is the default split size for TextInputFormat
, would take even several hours to be processed.
What I need to do, is to reduce the split size, so that I can utilize even more nodes for a job.
So the question is, how is it possible to split the files by let's say 10kb
? Do I need to implement my own InputFormat
and RecordReader
for this, or is there any parameter to set? Thanks.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…