you want to take the cumulative sum of data_binary
and subtract the most recent cumulative sum where data_binary
was zero.
b = df.data_binary
c = b.cumsum()
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
0 1
1 0
2 1
3 2
4 3
5 0
6 0
7 1
Name: data_binary, dtype: int64
Explanation
Let's start by looking at each step side by side
cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result']
print(pd.concat([
b, c,
c.mask(b != 0),
c.mask(b != 0).ffill(),
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
], axis=1, keys=cols))
data_binary cumulative_sum nan_non_zero forward_fill final_result
0 1 1 NaN NaN 1
1 0 1 1.0 1.0 0
2 1 2 NaN 1.0 1
3 1 3 NaN 1.0 2
4 1 4 NaN 1.0 3
5 0 4 4.0 4.0 0
6 0 4 4.0 4.0 0
7 1 5 NaN 4.0 1
The problem with cumulative_sum
is that the rows where data_binary
is zero, do not reset the sum. And that is the motivation for this solution. How do we "reset" the sum when data_binary
is zero? Easy! I slice the cumulative sum where data_binary
is zero and forward fill the values. When I take the difference between this and the cumulative sum, I've effectively reset the sum.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…