If I'm understanding you correctly, you have a dataset with lots of small gaps (single NaN
s) that you want filled and larger gaps that you don't.
Using pandas
to "forward-fill" gaps
One option is to use pandas
fillna
with a limited amount of fill values.
As a quick example of how this works:
In [1]: import pandas as pd; import numpy as np
In [2]: x = pd.Series([1, np.nan, 2, np.nan, np.nan, 3, np.nan, np.nan, np.nan, 4])
In [3]: x.fillna(method='ffill', limit=1)
Out[3]:
0 1
1 1
2 2
3 2
4 NaN
5 3
6 3
7 NaN
8 NaN
9 4
dtype: float64
In [4]: x.fillna(method='ffill', limit=2)
Out[4]:
0 1
1 1
2 2
3 2
4 2
5 3
6 3
7 3
8 NaN
9 4
dtype: float64
As an example of using this for something similar to your case:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1977)
x = np.random.normal(0, 1, 1000).cumsum()
# Set every third value to NaN
x[::3] = np.nan
# Set a few bigger gaps...
x[20:100], x[200:300], x[400:450] = np.nan, np.nan, np.nan
# Use pandas with a limited forward fill
# You may want to adjust the `limit` here. This will fill 2 nan gaps.
filled = pd.Series(x).fillna(limit=2, method='ffill')
# Let's plot the results
fig, axes = plt.subplots(nrows=2, sharex=True)
axes[0].plot(x, color='lightblue')
axes[1].plot(filled, color='lightblue')
axes[0].set(ylabel='Original Data')
axes[1].set(ylabel='Filled Data')
plt.show()
Using numpy
to interpolate gaps
Alternatively, we can do this using only numpy
. It's possible (and more efficient) to do a "forward fill" identical to the pandas method above, but I'll show another method to give you more options than just repeating values.
Instead of repeating the last value through the "gap", we can perform linear interpolation of the values in the gap. This is less efficient computationally (and I'm going to make it even less efficient by interpolating everywhere), but for most datasets you won't notice a major difference.
As an example, let's define an interpolate_gaps
function:
def interpolate_gaps(values, limit=None):
"""
Fill gaps using linear interpolation, optionally only fill gaps up to a
size of `limit`.
"""
values = np.asarray(values)
i = np.arange(values.size)
valid = np.isfinite(values)
filled = np.interp(i, i[valid], values[valid])
if limit is not None:
invalid = ~valid
for n in range(1, limit+1):
invalid[:-n] &= invalid[n:]
filled[invalid] = np.nan
return filled
Note that we'll get interpolated value, unlike the previous pandas
version:
In [11]: values = [1, np.nan, 2, np.nan, np.nan, 3, np.nan, np.nan, np.nan, 4]
In [12]: interpolate_gaps(values, limit=1)
Out[12]:
array([ 1. , 1.5 , 2. , nan, 2.66666667,
3. , nan, nan, 3.75 , 4. ])
In the plotting example, if we replace the line:
filled = pd.Series(x).fillna(limit=2, method='ffill')
With:
filled = interpolate_gaps(x, limit=2)
We'll get a visually identical plot:
As a complete, stand-alone example:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1977)
def interpolate_gaps(values, limit=None):
"""
Fill gaps using linear interpolation, optionally only fill gaps up to a
size of `limit`.
"""
values = np.asarray(values)
i = np.arange(values.size)
valid = np.isfinite(values)
filled = np.interp(i, i[valid], values[valid])
if limit is not None:
invalid = ~valid
for n in range(1, limit+1):
invalid[:-n] &= invalid[n:]
filled[invalid] = np.nan
return filled
x = np.random.normal(0, 1, 1000).cumsum()
# Set every third value to NaN
x[::3] = np.nan
# Set a few bigger gaps...
x[20:100], x[200:300], x[400:450] = np.nan, np.nan, np.nan
# Interpolate small gaps using numpy
filled = interpolate_gaps(x, limit=2)
# Let's plot the results
fig, axes = plt.subplots(nrows=2, sharex=True)
axes[0].plot(x, color='lightblue')
axes[1].plot(filled, color='lightblue')
axes[0].set(ylabel='Original Data')
axes[1].set(ylabel='Filled Data')
plt.show()
Note: I originally completely mis-read the question. See version history for my original answer.