It would be dainty if you could fill NaN
with say 0
during read itself. Perhaps a feature request in Pandas's git-hub is in order...
Using a converter function
However, for the time being, you can define your own function to do that and pass it to the converters
argument in read_csv
:
def conv(val):
if val == np.nan:
return 0 # or whatever else you want to represent your NaN with
return val
df = pd.read_csv(file, converters={colWithNaN : conv}, dtypes=...)
Note that converters
takes a dict
, so you need to specify it for each column that has NaN to be dealt with. It can get a little tiresome if a lot of columns are affected. You can specify either column names or numbers as keys.
Also note that this might slow down your read_csv
performance, depending on how the converters
function is handled. Further, if you just have one column that needs NaNs handled during read, you can skip a proper function definition and use a lambda
function instead:
df = pd.read_csv(file, converters={colWithNaN : lambda x: 0 if x == np.nan else x}, dtypes=...)
Reading in chunks
You could also read the file in small chunks that you stitch together to get your final output. You can do a bunch of things this way. Here is an illustrative example:
result = pd.DataFrame()
df = pd.read_csv(file, chunksize=1000)
for chunk in df:
chunk.dropna(axis=0, inplace=True) # Dropping all rows with any NaN value
chunk[colToConvert] = chunk[colToConvert].astype(np.uint32)
result = result.append(chunk)
del df, chunk
Note that this method does not strictly duplicate data. There is a time when the data in chunk
exists twice, right after the result.append
statement, but only chunksize
rows are repeated, which is a fair bargain. This method may also work out to be faster than by using a converter function.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…