hadoop - How to open/stream .zip files through Spark?

Question

Welcome To Ask or Share your Answers For Others

hadoop - How to open/stream .zip files through Spark?

1 Reply

深蓝 · Answer 1 · 2021-10-16T22:31:11+0000

There was no solution with python code and I recently had to read zips in pyspark. And, while searching how to do that I came across this question. So, hopefully this'll help others.

import zipfile
import io

def zip_extract(x):
    in_memory_data = io.BytesIO(x[1])
    file_obj = zipfile.ZipFile(in_memory_data, "r")
    files = [i for i in file_obj.namelist()]
    return dict(zip(files, [file_obj.open(file).read() for file in files]))


zips = sc.binaryFiles("hdfs:/Testing/*.zip")
files_data = zips.map(zip_extract).collect()

In the above code I returned a dictionary with filename in the zip as a key and the text data in each file as the value. you can change it however you want to suit your purposes.

Categories

hadoop - How to open/stream .zip files through Spark?

hadoop - How to open/stream .zip files through Spark?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags