I have a tab separated file with a column that should be interpreted as a string, but many of the entries are integers. With small files read_csv correctly interprets the column as a string after seeing some non integer values, but with larger files, this doesnt work:
import pandas as pd
df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})
df.to_csv('test', sep='', index=False, na_rep='NA')
df2 = pd.read_csv('test', sep='')
print df2['a'].unique()
for a in df2['a'][262140:262150]:
print repr(a)
output:
['1' 'X' 1]
'1'
'1'
'1'
'1'
1
1
1
1
1
1
Interestingly 262144 is a power of 2 so I think inference and conversion is happening in chunks but is skipping some chunks.
I am fairly certain this is a bug, but would like a work around that perhaps uses quoting, though adding
quoting=csv.QUOTE_NONNUMERIC
for reading and writing does not fix the problem. Ideally I could work around this by quoting my string data and somehow force pandas to not do any inference on quoted data.
Using pandas 0.12.0
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…