Yes, it's possible if you change method of specifying input to hadoop glob pattern, for example:
files = 's3a://dev/2017/01/{02,03}/data.parquet'
df = session.read.parquet(files)
You can read more on patterns in Hadoop javadoc.
But, in my opinion this isn't elegant way of working with data partitioned by time (by day in your case). If you are able to rename directories like this:
s3a://dev/2017/01/03/data.parquet
--> s3a://dev/day=2017-01-03/data.parquet
s3a://dev/2017/01/02/data.parquet
--> s3a://dev/day=2017-01-02/data.parquet
then you can take advantage of spark partitioning schema and read data by:
session.read.parquet('s3a://dev/')
.where(col('day').between('2017-01-02', '2017-01-03')
This way will omit empty/non-existing directories as well. Additionall column day
will appear in your dataframe (it will be string in spark <2.1.0 and datetime in spark >= 2.1.0), so you will know in which directory each record exists.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…