Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
5.6k views
in Technique[技术] by (71.8m points)

python - Winsorize dataframe columns per month while ignoring NaN's

I have a dataframe with monthly data and the following colums: date, bm and cash

date        bm        cash
1981-09-30  0.210308  2.487146
1981-10-31  0.241291  2.897529
1981-11-30  0.221529  2.892758
1981-12-31  0.239002  2.726372
1981-09-30  0.834520  4.387087
1981-10-31  0.800472  4.297658
1981-11-30  0.815778  4.459382
1981-12-31  0.836681  4.895269

Now I want to winsorize my data per month while keeping NaN values in the data. I.e. I want to group the data per month and overwrite observations above the 0.99 and below the 0.01 percentile with the 99 percentile and 0.01 percentile respectively. From Winsorizing data by column in pandas with NaN I found that I should do this with the "clip" function. My code looks as follows:

df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['date'])
df_grouped = df.groupby(pd.Grouper(freq='M'))
cols = df.columns
for c in cols:
    df[c] = df_grouped[c].apply(lambda x: x.clip(lower=x.quantile(0.01), upper=x.quantile(0.99)))

I get the following output: ValueError: cannot reindex from a duplicate axis

P.S. I realize that I have not included my required output, but I hope that the required output is clear. Otherwise I can try to put something together.

Edit: These solution from @Allolz is already of great help, but it does not work exactly as it is supposed to. Before I run the code from @Allolz I I ran : df_in.groupby(pd.Grouper(freq='M', key='date'))['secured'].quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])

Which returned:

date            
1980-01-31  0.00    1.580564e+00
            0.01    1.599805e+00
            0.25    2.388106e+00
            0.50    6.427071e+00
            0.75    1.200685e+01
            0.99    5.133111e+01
            1.00    5.530329e+01

After winsorizing I get:

date            
1980-01-31  0.00         1.599805
            0.01         1.617123
            0.25         2.388106
            0.50         6.427071
            0.75        12.006854
            0.99        47.756152
            1.00        51.331114

It is clear that the new 0.0 and 1.0 quantiles are equal to the original 0.01 and 0.09 quantiles, which is what we would expect. However, the new 0.01 and 0.99 quantiles are not equal to the original 0.01 and 0.99 quantiles where I would expect that these should remain the same. What can cause this and wat could solve it? My hunch is that it might have to do with NaN's in the data, but I'm not sure if that is really the cause.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

One method which will be faster requires you to create helper columns. We will use groupby + transform to broadcast columns for the 0.01 and 0.99 quantile (for that Month group) back to the DataFrame and then you can use those Series to clip the original at once. (clip will leave NaN alone so it satisfies that requirement too). Then if you want, remove the helper columns (I'll leave them in for clariity).

Sample Data

import numpy as np
import panda as pd

np.random.seed(123)
N = 10000
df = pd.DataFrame({'date': np.random.choice(pd.date_range('2010-01-01', freq='MS', periods=12), N),
                   'val': np.random.normal(1, 0.95, N)})

Code

gp = df.groupby(pd.Grouper(freq='M', key='date'))['val']

# Assign the lower-bound ('lb') and upper-bound ('ub') for Winsorizing
df['lb'] = gp.transform('quantile', 0.01)
df['ub'] = gp.transform('quantile', 0.99)

# Winsorize
df['val_wins'] = df['val'].clip(upper=df['ub'], lower=df['lb'])

Output

The majority of rows will not be changed (only those outside of the 1-99th percentile) so we can check the small susbet rows that did change to see it works. You can see rows for the same months have the same bounds and the winsorized value ('val_wins') is properly clipped to the bound it exceeds.

df[df['val'] != df['val_wins']]

#           date       val        lb        ub  val_wins
#42   2010-09-01 -1.686566 -1.125862  3.206333 -1.125862
#96   2010-04-01 -1.255322 -1.243975  2.995711 -1.243975
#165  2010-08-01  3.367880 -1.020273  3.332030  3.332030
#172  2010-09-01 -1.813011 -1.125862  3.206333 -1.125862
#398  2010-09-01  3.281198 -1.125862  3.206333  3.206333
#...         ...       ...       ...       ...       ...
#9626 2010-12-01  3.626950 -1.198967  3.249161  3.249161
#9746 2010-11-01  3.472490 -1.259557  3.261329  3.261329
#9762 2010-09-01  3.460467 -1.125862  3.206333  3.206333
#9768 2010-06-01 -1.625013 -1.482529  3.295520 -1.482529
#9854 2010-12-01 -1.475515 -1.198967  3.249161 -1.198967
#
#[214 rows x 5 columns]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...