The major issue is that you likely have the string 'np.nan'
stored and not a real null value. Here are how the three handle null
values differently:
Sample Data:
import pandas as pd
df = pd.DataFrame({'A': [1,1,2,2,3,3], 'B': [None, '1', np.NaN, '2', 3, 4]})
first
/last
This will return the first/last non-null value within each group. Oddly enough it will not skip None
, though this can be made possible with the kwarg dropna=True
. As a result, you may return values for columns that were part of different rows originally:
df.groupby('A', as_index=False).first()
# A B
#0 1 None
#1 2 2
#2 3 3
df.groupby('A', as_index=False).first(dropna=True)
# A B
#0 1 1
#1 2 2
#2 3 3
head(n)
/tail(n)
Returns the top/bottom n rows within a group. Values remain bound within rows. If you give it an n
that is more than the number of rows, it returns all rows in that group without complaining:
df.groupby('A', as_index=False).head(1)
# A B
#0 1 None
#2 2 NaN
#4 3 3
df.groupby('A', as_index=False).head(200)
# A B
#0 1 None
#1 1 1
#2 2 NaN
#3 2 2
#4 3 3
#5 3 4
nth
This takes the nth
row, so again values remain bound within the row. .nth(0)
is the same as .head(1)
, though they have different uses. For instance, if you need the 0th and 2nd row, that's difficult to do with .head()
, but easy with .nth([0,2])
. Also it's fair easier to write .head(10)
than .nth(list(range(10))))
.
df.groupby('A', as_index=False).nth(0)
# A B
#0 1 None
#2 2 NaN
#4 3 3
nth
also supports dropping rows with any null-values, so you can use it to return the first row without any null-values, unlike .head()
df.groupby('A', as_index=False).nth(0, dropna='any')
# A B
#A
#1 1 1
#2 2 2
#3 3 3