We have a DataFrame that looks like this:
> df.ix[:2,:10]
0 1 2 3 4 5 6 7 8 9 10
0 NaN NaN NaN NaN 6 5 NaN NaN 4 NaN 5
1 NaN NaN NaN NaN 8 NaN NaN 7 NaN NaN 5
2 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN
We simply want the counts of all unique values in the DataFrame. A simple solution is:
df.stack().value_counts()
However:
1. It looks like stack
returns a copy, not a view, which is memory prohibitive in this case. Is this correct?
2. I want to group the DataFrame by rows, and then get the different histograms for each grouping. If we ignore the memory issues with stack
and use it for now, how does one do the grouping correctly?
d = pd.DataFrame([[nan, 1, nan, 2, 3],
[nan, 1, 1, 1, 3],
[nan, 1, nan, 2, 3],
[nan,2,2,2, 3]])
len(d.stack()) #14
d.stack().groupby(arange(4))
AssertionError: Grouper and axis must be same length
The stacked DataFrame has a MultiIndex, with a length of some number less than n_rows*n_columns
, because the nan
s are removed.
0 1 1
3 2
4 3
1 0 1
1 1
2 1
3 1
4 3
....
This means we don't easily know how to build our grouping. It would be much better to just operate on the first level, but then I'm stuck on how to then apply the grouping I actually want.
d.stack().groupby(level=0).groupby(list('aabb'))
KeyError: 'a'
Edit: A solution, which doesn't use stacking:
f = lambda x: pd.value_counts(x.values.ravel())
d.groupby(list('aabb')).apply(f)
a 1 4
3 2
2 1
b 2 4
3 2
1 1
dtype: int64
Looks clunky, though. If there's a better option I'm happy to hear it.
Edit: Dan's comment revealed I had a typo, though correcting that still doesn't get us to the finish line.
See Question&Answers more detail:
os