Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
536 views
in Technique[技术] by (71.8m points)

compression - Why can't hadoop split up a large text file and then compress the splits using gzip?

I've recently been looking into hadoop and HDFS. When you load a file into HDFS, it will normally split the file into 64MB chunks and distribute these chunks around your cluster. Except it can't do this with gzip'd files because a gzip'd file can't be split. I completely understand why this is the case (I don't need anyone explaining why a gzip'd file can't be split up). But why couldn't HDFS take a plain text file as input and split it like normal, then compress each split using gzip separately? When any split is accessed, it's just decompressed on the fly.

In my scenario, each split is compressed completely independently. There's no dependencies between splits, so you don't need the entire original file to decompress any one of the splits. That is the approach this patch takes: https://issues.apache.org/jira/browse/HADOOP-7076, note that this is not what I'd want.

This seems pretty basic... what am I missing? Why couldn't this be done? Or if it could be done, why have the hadoop developers not looked down this route? It seems strange given how much discussion I've found regarding people wanting split gzip'd files in HDFS.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The simple reason is the design principle of "separation of concerns".

If you do what you propose then HDFS must know what the actual bits and bytes of the file mean. Also HDFS must be made able to reason about it (i.e. extract, decompress, etc.). In general you don't want this kind of mixing up responsibilities in software.

So the 'only' part that is to understand what the bits mean is the application that must be able to read it: which is commonly written using the MapReduce part of Hadoop.

As stated in the Javadoc of HADOOP-7076 (I wrote that thing ;) ):

Always remember that there are alternative approaches:

HTH


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...