Efficient groupby() coding for pandas in python

Question

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

I have written the following code, and I get what I need. However, I am wondering whether there's an easier/efficient way of achieving the same.

job_hist.groupby('employee_id').count()[job_hist.groupby('employee_id').count()['start_date'] > 1]

See that I have repeated job_hist.groupby('employee_id').count() inside [ ] after count().

Thank you.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:29:33+0000

If you do want all the columns (e.g. because some of them may contain NaN), then:

cnt = job_hist.groupby('employee_id').count()
out = cnt.loc[cnt['start_date'] > 1]

But a more customary goal is simply to count how many rows there are for each employee_id:

cnt = job_hist.groupby('employee_id').size()
out = cnt.loc[cnt > 1]

Or, in one go:

out = job_hist.groupby('employee_id').size().to_frame('cnt').query('cnt > 1')