I have a slightly broken CSV file that I want to pre-process before reading it with pandas.read_csv(), i.e. do some search/replace on it.
I tried to open the file and and do the pre-processing in a generator, that I then hand over to read_csv():
def in_stream():
with open("some.csv") as csvfile:
for line in csvfile:
l = re.sub(r'","',r',',line)
yield l
df = pd.read_csv(in_stream())
Sadly, this just throws a
ValueError: Invalid file path or buffer object type: <class 'generator'>
Although, when looking at Panda's source, I'd expect it to be able to work on iterators, thus generators.
I only found this [article] (Using a custom object in pandas.read_csv()), outlining how to wrap a generator into a file-like object, but it seems to only work on files in byte-mode.
So in the end I'm looking for a pattern to build a pipeline that opens a file, reads it line-by-line, allows pre-processing and then feeds it into e.g. pandas.read_csv().
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…