I am having a Zipped file containing multiple text files.
I want to read each of the file and build a List of RDD containining the content of each files.
val test = sc.textFile("/Volumes/work/data/kaggle/dato/test/5.zip")
will just entire files, but how to iterate through each content of zip and then save the same in RDD using Spark.
I am fine with Scala or Python.
Possible solution in Python with using Spark -
archive = zipfile.ZipFile(archive_path, 'r')
file_paths = zipfile.ZipFile.namelist(archive)
for file_path in file_paths:
urls = file_path.split("/")
urlId = urls[-1].split('_')[0]
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…