I have pandas DataFrame of geo data, that has columns of latitude and longitude that are represented as meters.
I want to binning latitude and longitude as 5-meters, so I performed as follows.
df_geodata['lat_meter']=(df_geodata['lat_meter']//5)*5
df_geodata['lon_meter']=(df_geodata['lon_meter']//5)*5
Also, I wanted to perform Max 5-ea random sampling for each bins. So I performed as follows.
df_sampled=df_geodata.groupby(['lat_meter','lon_meter'], as_index=False).apply(lambda obj: obj.loc[np.random.choice(obj.index, 5),:])
df_sampled=df_sampled.reset_index(drop=True).drop_duplicates()
I used this np.random.choice for each bins, also I drop duplicates for each bins.
However, I recently found that when I using pretty big data(as 300GB, 170,000,000rows) this cost too much time to perform....as 5 hours?
I think, performing np.random.choice for each bins takes too much time...(where np is numpy).
Is there any more efficient way to perform random-sapling for 2D-binned data?
Especially, not using numpy but using only pandas maybe more efficient, I think. and I want to know more efficient and fast way. Is it possible?
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…