The split size is calculated by the formula:-
max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))
In your case it will be:-
split size=max(128,min(Long.MAX_VALUE(default),64))
So above inference:-
each map will process 2 hdfs blocks(assuming each block 64MB): True
There will be a new division of my input file (previously included HDFS) to occupy blocks in HDFS 128M: False
but making the minimum split size greater than the block size increases the split size, but at the cost of locality.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…