I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is as:
s3://<bucket>/project1/20141201/logtype1/logtype1.0000.gz
s3://<bucket>/project1/20141201/logtype1/logtype1.0100.gz
....
s3://<bucket>/project1/20141201/logtype1/logtype1.2300.gz
Since the data should be analyzed on a daily basis, I have to download and decompress the files belongs to a specific day, then assemble the content as a single RDD.
There should be several ways can do this, but I would like to know the best practice for Spark.
Thanks in advance.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…