python - Why doesn't groupby sum convert boolean to int or float?

Question

Welcome To Ask or Share your Answers For Others

python - Why doesn't groupby sum convert boolean to int or float?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Why doesn't groupby sum convert boolean to int or float?

I'll start with 3 simple examples:

pd.DataFrame([[True]]).sum()

0    1
dtype: int64

pd.DataFrame([True]).sum()

0    1
dtype: int64

pd.Series([True]).sum()

1

All of these are as expected. Here is a more complicated example.

df = pd.DataFrame([
        ['a', 'A', True],
        ['a', 'B', False],
        ['a', 'C', True],
        ['b', 'A', True],
        ['b', 'B', True],
        ['b', 'C', False],
    ], columns=list('XYZ'))

df.Z.sum()

4

Also as expected. However, if I groupby(['X', 'Y']).sum()

I expected it to look like:

I'm thinking bug. Is there another explanation?

Per @unutbu's answer

pandas is trying to recast as original dtypes. I had thought that maybe the group by I'd performed didn't really groupby anything. So I tried this example to test out the idea.

df = pd.DataFrame([
        ['a', 'A', False],
        ['a', 'B', False],
        ['a', 'C', True],
        ['b', 'A', False],
        ['b', 'B', False],
        ['b', 'C', False],
    ], columns=list('XYZ'))

I'll groupby('X') and sum. If @unutbu is correct, these sums should be 1 and 0 and are castable to bool, therefore we should see bool

df.groupby('X').sum()

Sure enough... bool

But if the process is the same but the values are slightly different.

df = pd.DataFrame([
        ['a', 'A', True],
        ['a', 'B', False],
        ['a', 'C', True],
        ['b', 'A', False],
        ['b', 'B', False],
        ['b', 'C', False],
    ], columns=list('XYZ'))

df.groupby('X').sum()

lesson learned. Always use astype(int) or something similar when doing this.

df.groupby('X').sum().astype(int)

gives consistent results for either scenario.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:59:04+0000

This occurs because _cython_agg_blocks calls _try_coerce_and_cast_result which calls _try_cast_result which tries to return a result of the same dtype as the original values (in this case, bool).

This returns something a little peculiar when Z has dtype bool (and all the groups have no more than one True value). If any of the groups have 2 or more True values, then the resulting values are floats since _try_cast_result does not convert 2.0 back to a boolean.

_try_cast_result does something more useful when Z has dtype int: Internally, the Cython aggregator used by df.groupby(['X', 'Y']).sum() returns a result of dtype float. Here then, _try_cast_result returns the result to dtype int.

Categories

python - Why doesn't groupby sum convert boolean to int or float?

python - Why doesn't groupby sum convert boolean to int or float?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags