Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
182 views
in Technique[技术] by (71.8m points)

python - How to replace repeated NaNs with a different value from lone NaNs from Pandas data frame

I have several timeseries arranged in a data frame, similar to below:


   category value   time_idx
0   810     0.118794    0
1   830     0.552947    0
2   1120    0.133193    0
3   1370    0.840183    0
4   810     0.129385    1
... ... ... ...
6095 1370   0.157391    1523
6096 810    0.141377    1524
6097 830    0.212254    1524
6098 1120   0.069970    1524
6099 1370   0.134947    1524

Some values are NaN. What I would like is to replace any NaN values that are NOT repeated with 0, as I am assuming the value is 0 for that category at that time. However, any time that every single category has a value of NaN at that same time (i.e. at the same time_idx), then I want to replace every value with -1.

Just replacing the NaNs with a value is of course trivial in Pandas, but the added complexity of specifically replacing NaNs that are NaN for every category at a given time has stumped me. I know I can just loop through the time indices, but my actual datasets will have over 900 categories, so I would like to find a more efficient Pandas-esque method.

The only thing I could think of was list comprehension, which I don't think is even necessarily more efficient than an explicit loop anyway, plus I couldn't come up with one that worked properly.

I know that I can just replace all NaNs like so:

data["value"] = data["value"].replace(np.nan, 0)

but I'm not sure how to implement this in my case, where I only want to replace long NaNs with 0. This is the loop I have currently:

num_channels = data["category"].nunique()
nan_vals = data[lambda x: np.isnan(x.value)]
nan_times = nan_vals["time_idx"]

for time in nan_times:
        if nan_vals[lambda x: x.time_idx == time]["category"].nunique() < num_channels:
            # Set 0 for every channel that has nan at time t
            index = nan_vals[lambda x: x.time_idx == time].index

            data.loc[index, ["value"]] =  data.loc[index, "value"].replace(np.nan, 0)

        else:

            index = nan_vals[lambda x: x.time_idx == time].index
            data.loc[index, ["value"]] = data[lambda x: x.time_idx == time]["value"].replace(np.nan, -1)

Any ideas are appreciated.

Here is an example:

given the following data frame:

    category    value   time_idx
0   810          NaN    0
1   830          NaN    0
2   1120         NaN    0
3   1370         NaN    0
4   810      0.129385   1
5   830          NaN    1
6   1120     0.144378   1
7   1370         NaN    1
8   810      0.124334   2
9   830      0.487274   2
10  1120     0.119153   2
11  1370     0.871687   2

I would like this output:

    category    value   time_idx
0   810        -1.000000    0
1   830        -1.000000    0
2   1120       -1.000000    0
3   1370       -1.000000    0
4   810         0.129385    1
5   830         0.000000    1
6   1120        0.144378    1
7   1370        0.000000    1
8   810         0.124334    2
9   830         0.487274    2
10  1120        0.119153    2
11  1370        0.871687    2

In this example, at time = 0 every category's value was NaN, so they would be replaced with -1. At time = 1, there were non-NaN values, so any NaN values present (category 830 and 1370) would be replaced with 0.

question from:https://stackoverflow.com/questions/65939149/how-to-replace-repeated-nans-with-a-different-value-from-lone-nans-from-pandas-d

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can find those time_idx where all entries are NaN using groupby and then group.isna().all(). You can use that mask to fill the NaNs with -1.

Afterwards fill all other NaNs with 0 using fillna.

all_nas = df.groupby("time_idx").value.apply(lambda group: group.isna().all())
df = df.set_index("time_idx")
df.loc[all_nas, "value"] = -1
df = df.reset_index().fillna(0)
print(df)

#     time_idx  category     value
# 0          0       810 -1.000000
# 1          0       830 -1.000000
# 2          0      1120 -1.000000
# 3          0      1370 -1.000000
# 4          1       810  0.129385
# 5          1       830  0.000000
# 6          1      1120  0.144378
# 7          1      1370  0.000000
# 8          2       810  0.124334
# 9          2       830  0.487274
# 10         2      1120  0.119153
# 11         2      1370  0.871687

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...