python - Inconsistent pandas read_csv dtype inference on mostly-integer string column in huge TSV file

Question

Welcome To Ask or Share your Answers For Others

python - Inconsistent pandas read_csv dtype inference on mostly-integer string column in huge TSV file

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Inconsistent pandas read_csv dtype inference on mostly-integer string column in huge TSV file

I have a tab separated file with a column that should be interpreted as a string, but many of the entries are integers. With small files read_csv correctly interprets the column as a string after seeing some non integer values, but with larger files, this doesnt work:

import pandas as pd
df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})
df.to_csv('test', sep='', index=False, na_rep='NA')
df2 = pd.read_csv('test', sep='')
print df2['a'].unique()
for a in df2['a'][262140:262150]:
    print repr(a)

output:

['1' 'X' 1]
'1'
'1'
'1'
'1'
1
1
1
1
1
1

Interestingly 262144 is a power of 2 so I think inference and conversion is happening in chunks but is skipping some chunks.

I am fairly certain this is a bug, but would like a work around that perhaps uses quoting, though adding quoting=csv.QUOTE_NONNUMERIC for reading and writing does not fix the problem. Ideally I could work around this by quoting my string data and somehow force pandas to not do any inference on quoted data.

Using pandas 0.12.0

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T20:03:29+0000

To avoid having Pandas infer your data type, provide a converters argument to read_csv:

converters : dict. optional

Dict of functions for converting values in certain columns. Keys can either be integers or column labels

For your file this would look like:

df2 = pd.read_csv('test', sep='', converters={'a':str})

My reading of the docs is that you do not need to specify converters for every column. Pandas should continue to infer the datatype of unspecified columns.

Categories

python - Inconsistent pandas read_csv dtype inference on mostly-integer string column in huge TSV file

python - Inconsistent pandas read_csv dtype inference on mostly-integer string column in huge TSV file

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags