Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
212 views
in Technique[技术] by (71.8m points)

How to perform random sampling for 2D-binned data in pandas

I have pandas DataFrame of geo data, that has columns of latitude and longitude that are represented as meters.

I want to binning latitude and longitude as 5-meters, so I performed as follows.

df_geodata['lat_meter']=(df_geodata['lat_meter']//5)*5
df_geodata['lon_meter']=(df_geodata['lon_meter']//5)*5

Also, I wanted to perform Max 5-ea random sampling for each bins. So I performed as follows.

df_sampled=df_geodata.groupby(['lat_meter','lon_meter'], as_index=False).apply(lambda obj: obj.loc[np.random.choice(obj.index, 5),:])
df_sampled=df_sampled.reset_index(drop=True).drop_duplicates()

I used this np.random.choice for each bins, also I drop duplicates for each bins. However, I recently found that when I using pretty big data(as 300GB, 170,000,000rows) this cost too much time to perform....as 5 hours?

I think, performing np.random.choice for each bins takes too much time...(where np is numpy). Is there any more efficient way to perform random-sapling for 2D-binned data?

Especially, not using numpy but using only pandas maybe more efficient, I think. and I want to know more efficient and fast way. Is it possible?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...