Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
70 views
in Technique[技术] by (71.8m points)

Efficient groupby() coding for pandas in python

I have written the following code, and I get what I need. However, I am wondering whether there's an easier/efficient way of achieving the same.

job_hist.groupby('employee_id').count()[job_hist.groupby('employee_id').count()['start_date'] > 1]

See that I have repeated job_hist.groupby('employee_id').count() inside [ ] after count().

Thank you.

question from:https://stackoverflow.com/questions/65849229/efficient-groupby-coding-for-pandas-in-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you do want all the columns (e.g. because some of them may contain NaN), then:

cnt = job_hist.groupby('employee_id').count()
out = cnt.loc[cnt['start_date'] > 1]

But a more customary goal is simply to count how many rows there are for each employee_id:

cnt = job_hist.groupby('employee_id').size()
out = cnt.loc[cnt > 1]

Or, in one go:

out = job_hist.groupby('employee_id').size().to_frame('cnt').query('cnt > 1')

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...