So I have this dataset where there are sometimes random newline characters entered into some cells, and I need to delete them.
this is what I've tried:
with open ('filepath') as inf, open('filepath', 'w') as outf:
for line in inf:
outf.write(line.replace('
', ''))
Unfortunately, this removed ALL newline characters, including the ones at the end of the row, which turns my csv file into a big one-liner
Does anyone know how I can only delete the random newline characters and not the 'real' endline characters?
Edit: If it helps, each 'real' new line starts with a 6 digit string of numbers (besides for the header line). Maybe some regex pattern that looks ahead to detect if there's some number string could work?
Edit2: I've tried using pandas to edit it with:
df = pd.read_csv(filepath)
for i in df.columns:
if df[i].dtype==np.object:
df[i] = df[i].str.replace('
','')
weirdly, this works if I copy the stuff inside the .csv into a new text file, but it doesn't work on my original csv file, and I'm not sure why.
Final Edit:
So big thanks to DDS for his help. Managed to get it to work using this:
num_cols = 48
buf = ""
with open (filepath) as inf, open (filepath, 'w') as outf:
for line in inf:
if len(line.split(',')) < num_cols:
buf += line.replace('
', '')
if len(buf.split(',')) == num_cols:
outf.write(buf+'
')
else: continue
buf = ""
else:
outf.write(line)
question from:
https://stackoverflow.com/questions/65932262/how-to-remove-newline-characters-in-csv-columns-without-removing-the-endrow-new 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…