One solution is to use a StringIO
object. The only difference is that it keeps all the data in memory, instead of writing to disk and reading back in.
Code is as follows (note: Python 3!):
import numpy as np
import pandas as pd
import csv
from io import StringIO
data = [['string_boolean', 'numeric', 'numeric_missing'],
['FALSE', 23, 50],
['TRUE', 19, 12],
['FALSE', 4.8, '']]
with StringIO() as fobj:
writer = csv.writer(fobj)
writer.writerows(data)
fobj.seek(0)
df = pd.read_csv(fobj)
print(df.head(3))
print(df.string_boolean.dtype) # Automatically converted to bool
print(df.numeric.dtype) # Float, as expected
print(df.numeric_missing.dtype) # Float, doesn't care about empty string
The with StringIO() as fobj
isn't really necessary: fobj = String()
will work just as fine. And since the context manager will close the StringIO()
object outside its scope, the df = pd.read_csv(fobj)
has to be inside it.
Note also the fobj.seek(0)
, which is another necessity, since your solution simply closes and reopens a file, which will automatically set the file pointer to the start of the file.
A note on Python 2 vs Python 3
I actually tried to make the above code Python 2/3 compatible. That became a mess, because of the following: Python 2 has an io
module, just like Python 3, whose StringIO
class makes everything unicode (also in Python 2; in Python 3 it is, of course, the default).
That is great, except that the csv
writer module in Python 2 is not unicode compatible.
Thus, the alternative is to use the (older) Python 2 (c)StringIO
module, for example as follows:
try:
from cStringIO import StringIO
except ModuleNotFoundError: # Python 3
from io import StringIO
and things will be plain text in Python 2, and unicode in Python 3.
Except that now, cStringIO.StringIO
does not have a context manager, and the with
statement will fail. As I mentioned, it is not really necessary, but I was keeping things as close as possible to your original code.
In other words, I could not find a nice way to stay close to the original code without ridiculous hacks.
I've also looked at avoiding the CSV writer completely, which leads to:
text = '
'.join(','.join(str(item).strip("'") for item in items)
for items in data)
with StringIO(text) as fobj:
df = pd.read_csv(fobj)
which is perhaps neater (though a bit less clear), and Python 2/3 compatible. (I don't expect it to work for everything that the csv
module can handle, but here it works fine.)
Why can't pd.DataFrame(...)
do the conversion?
Here, I can only speculate.
I would think the reasoning is that when the input are Python objects (dicts, lists), the input is known, and in hands of the programmer. Therefore, it is unlikely, perhaps even illogical, that that input would contain strings such as 'FALSE'
or ''
. Instead, it would normally contain the objects False
and np.nan
(or math.nan
), since the programmer would already have taken care of the (string) translation.
Whereas for a file (CSV or other), the input can be anything: your colleague might send an Excel CSV file, or someone else sends you a Gnumeric CSV file. I don't know how standardised CSV files are, but you'd probably need some code to allow for exceptions, and overall for the conversion of the strings to Python (NumPy) format.
So in that sense, it is actually illogical to expect pd.DAtaFrame(...)
to accept just anything: instead, it should accept something that is properly formatted.
You might argue for a convenience method that takes a list like yours, but a list is not a CSV file (which is just a bunch of characters, including newlines). Plus, I expect there is the option for pd.read_csv
to read the files in chunks (perhaps even line by line), which becomes harder if you'd feed it a string with newlines instead (you can't really read that line by line, as you would have to split it on newlines and keep all the lines in memory. And you already have the full string in memory somewhere, instead of on disk. But I digress).
Besides, the StringIO
trick is just a few lines to precisely perform this trick.