I am having monthly
Revenue data for the last 5 years and I am storing the DataFrames for respective months in parquet
formats in append
mode, but partitioned by
month
column. Here is the pseudo-code below -
def Revenue(filename):
df = spark.read.load(filename)
.
.
df.write.format('parquet').mode('append').partitionBy('month').save('/path/Revenue')
Revenue('Revenue_201501.csv')
Revenue('Revenue_201502.csv')
Revenue('Revenue_201503.csv')
Revenue('Revenue_201504.csv')
Revenue('Revenue_201505.csv')
The df
gets stored in parquet
format on monthly basis, as can be seen below -
Question: How can I delete the parquet
folder corresponding to a particular month?
One way would be to load all these parquet
files in a big df
and then use .where()
clause to filter out that particular month and then save it back into parquet
format partitionBy
month in overwrite
mode, like this -
# If we want to remove data from Feb, 2015
df = spark.read.format('parquet').load('Revenue.parquet')
df = df.where(col('month') != lit('2015-02-01'))
df.write.format('parquet').mode('overwrite').partitionBy('month').save('/path/Revenue')
But, this approach is quite cumbersome.
Other way is to directly delete the folder of that particular month, but I am not sure if that's a right way to approach things, lest we alter the metadata
in an unforseeable way.
What would be the right way to delete the parquet
data for a particular month?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…