I am going through hadoop definitive guide, where it clearly explains about input splits.
It goes like
Input splits doesn’t contain actual data, rather it has the storage
locations to data on HDFS
and
Usually,Size of Input split is same as block size
1) let’s say a 64MB block is on node A and replicated among 2 other nodes(B,C), and the input split size for the map-reduce program is 64MB, will this split just have location for node A? Or will it have locations for all the three nodes A,b,C?
2) Since data is local to all the three nodes how the framework decides(picks) a maptask to run on a particular node?
3) How is it handled if the Input Split size is greater or lesser than block size?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…