Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
814 views
in Technique[技术] by (71.8m points)

python - How to completely ignore whitespaces in csv with Pandas

I am trying to make a .csv file in a format that is both minimally human-readable and also easily pandas-readable. That means columns should be neatly separated so you can easily identify to which column each value belongs. Problem is, filling it up with whitespaces has some cut-downs in pandas functionality. So far what I've got is

work    ,roughness  ,unstab ,corr_c_w   ,u_star ,c_star
us      ,True       ,True   ,-0.39      ,0.35   ,-.99
wang    ,False      ,       ,-0.5       ,       ,
cheng   ,           ,True   ,           ,       ,
watanabe,           ,       ,           ,0.15   ,-.80

If I take out all the whitespaces on the above .csv and read it directly with pd.read_csv it works perfectly. The first two columns are booleans and the others are floats. However, it is not human-readable at all without the whitespaces. When I read the above .csv with

pd.read_csv('bibrev.csv', index_col=0)

it doesn't work because all the columns and considered string that include, obviously, the whitespaces. When I use

pd.read_csv('bibrev.csv', index_col=0, skipinitialspace=True)

then it kind of works, because floats are read as floats and missing values are read as NaNs, which is a big improvement. However, the column names and boolean columns are still strings with whitespaces.

Any method of reading that .csv directly with pandas? Or maybe chance the csv format a bit and still have a clean-read with a human-readable .csv?

PS.: I am trying to avoid reading everything with python as a string, replacing whitespaces and then feeding it to pandas and also trying to avoid defining some functions and passing it to pandas through the converters keyword.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

try this:

import pandas as pd

def booleator(col):
    if str(col).lower() in ['true', 'yes']:
        return True
    #elif str(col).lower() == "false":
    #    return False
    else:
        return False

df = pd.read_csv('data.csv', sep='s*,s*', index_col=0,
                 converters={'roughness': booleator, 'unstab': booleator},
                 engine='python')
print(df)
print(df.dtypes)

Output:

         roughness unstab  corr_c_w  u_star  c_star
work
us            True   True     -0.39    0.35   -0.99
wang         False  False     -0.50     NaN     NaN
cheng        False   True       NaN     NaN     NaN
watanabe     False  False       NaN    0.15   -0.80
roughness       bool
unstab          bool
corr_c_w     float64
u_star       float64
c_star       float64
dtype: object

This version also takes care of booleans - all NaN's will be converted to False, otherwise Pandas will promote dtype to Object (see details in my comment)...


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...