python - Pandas aggregating average while excluding current row

Question

Welcome To Ask or Share your Answers For Others

python - Pandas aggregating average while excluding current row

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Pandas aggregating average while excluding current row

How to aggregate in the way to get the average of b for group a, while excluding the current row (the target result is in c)?

a b   c

1 1   0.5   # (avg of 0 & 1, excluding 1)
1 1   0.5   # (avg of 0 & 1, excluding 1)
1 0   1     # (avg of 1 & 1, excluding 0)

2 1   0.5   # (avg of 0 & 1, excluding 1)
2 0   1     # (avg of 1 & 1, excluding 0)
2 1   0.5   # (avg of 0 & 1, excluding 1)

3 1   0.5   # (avg of 0 & 1, excluding 1)
3 0   1     # (avg of 1 & 1, excluding 0)
3 1   0.5   # (avg of 0 & 1, excluding 1)

Data dump:

import pandas as pd
data = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1], 
                     [2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
                     columns=['a', 'b', 'c'])

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:04:18+0000

Suppose a group has values x_1, ..., x_n.

The average of the entire group would be

m = (x_1 + ... + x_n)/n

The sum of the group without x_i would be

(m*n - x_i)

The average of the group without x_i would be

(m*n - x_i)/(n-1)

Therefore, you could compute the desired column of values with

import pandas as pd
df = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1], 
                     [2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
                     columns=['a', 'b', 'c'])

grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)

which yields

In [32]: df
Out[32]: 
   a  b    c  result
0  1  1  0.5     0.5
1  1  1  0.5     0.5
2  1  0  1.0     1.0
3  2  1  0.5     0.5
4  2  0  1.0     1.0
5  2  1  0.5     0.5
6  3  1  0.5     0.5
7  3  0  1.0     1.0
8  3  1  0.5     0.5

In [33]: assert df['result'].equals(df['c'])

Per the comments below, in the OP's actual use case, the DataFrame's a column contains strings:

def make_random_str_array(letters, strlen, size):
    return (np.random.choice(list(letters), size*strlen)
            .view('|S{}'.format(strlen)))

N = 3*10**6
df = pd.DataFrame({'a':make_random_str_array(letters='ABCD', strlen=10, size=N),
                   'b':np.random.randint(10, size=N)})

so that there are about a million unique values in df['a'] out of 3 million total:

In [87]: uniq, key = np.unique(df['a'], return_inverse=True)
In [88]: len(uniq)
Out[88]: 988337

In [89]: len(df)
Out[89]: 3000000

In this case the calculation above requires (on my machine) about 11 seconds:

In [86]: %%timeit
   ....: grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
   ....:    ....:    ....:    ....: 
1 loops, best of 3: 10.5 s per loop

Pandas converts all string-valued columns to object dtype. But we could convert the DataFrame column to a NumPy array with a fixed-width dtype, and the group according to those values.

Here is a benchmark showing that if we convert the Series with object dtype to a NumPy array with fixed-width string dtype, the calculation requires less than 2 seconds:

In [97]: %%timeit
   ....: grouped = df.groupby(df['a'].values.astype('|S4'))
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
   ....:    ....:    ....:    ....: 
1 loops, best of 3: 1.39 s per loop

Beware that you need to know the maximum length of the strings in df['a'] to choose the appropriate fixed-width dtype. In the example above, all the strings have length 4, so |S4 works. If you use |Sn for some integer n and n is smaller than the longest string, then those strings will get silently truncated with no error warning. This could potentially lead to the grouping of values which should not be grouped together. Thus, the onus is on you to choose the correct fixed-width dtype.

You could use

dtype = '|S{}'.format(df['a'].str.len().max())
grouped = df.groupby(df['a'].values.astype(dtype))

to ensure the conversion uses the correct dtype.

Categories

python - Pandas aggregating average while excluding current row

python - Pandas aggregating average while excluding current row

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags