python - Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

Question

Welcome To Ask or Share your Answers For Others

python - Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this:

files = ['s3a://dev/2017/01/03/data.parquet',
         's3a://dev/2017/01/02/data.parquet']
df = session.read.parquet(*files)

This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as it finds into the dataframe, and return this result without complaining. Is this possible?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:46:44+0000

Yes, it's possible if you change method of specifying input to hadoop glob pattern, for example:

files = 's3a://dev/2017/01/{02,03}/data.parquet'
df = session.read.parquet(files)

You can read more on patterns in Hadoop javadoc.

But, in my opinion this isn't elegant way of working with data partitioned by time (by day in your case). If you are able to rename directories like this:

s3a://dev/2017/01/03/data.parquet --> s3a://dev/day=2017-01-03/data.parquet
s3a://dev/2017/01/02/data.parquet --> s3a://dev/day=2017-01-02/data.parquet

then you can take advantage of spark partitioning schema and read data by:

session.read.parquet('s3a://dev/') 
    .where(col('day').between('2017-01-02', '2017-01-03')

This way will omit empty/non-existing directories as well. Additionall column day will appear in your dataframe (it will be string in spark <2.1.0 and datetime in spark >= 2.1.0), so you will know in which directory each record exists.

Categories

python - Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

python - Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags