Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

python - How to remove newline characters in csv columns, without removing the endrow newline character?

So I have this dataset where there are sometimes random newline characters entered into some cells, and I need to delete them.

this is what I've tried:

with open ('filepath') as inf, open('filepath', 'w') as outf:
    for line in inf:
        outf.write(line.replace('
', ''))

Unfortunately, this removed ALL newline characters, including the ones at the end of the row, which turns my csv file into a big one-liner

Does anyone know how I can only delete the random newline characters and not the 'real' endline characters?

Edit: If it helps, each 'real' new line starts with a 6 digit string of numbers (besides for the header line). Maybe some regex pattern that looks ahead to detect if there's some number string could work?

Edit2: I've tried using pandas to edit it with:

df = pd.read_csv(filepath)

for i in df.columns:
    if df[i].dtype==np.object:
        df[i] = df[i].str.replace('
','')

weirdly, this works if I copy the stuff inside the .csv into a new text file, but it doesn't work on my original csv file, and I'm not sure why.

Final Edit:

So big thanks to DDS for his help. Managed to get it to work using this:

num_cols = 48

buf = ""

with open (filepath) as inf, open (filepath, 'w') as outf:
    for line in inf:
        if len(line.split(',')) < num_cols:
            buf += line.replace('
', '')
            if len(buf.split(',')) == num_cols:
                outf.write(buf+'
')
            else: continue
            buf = ""
        else:
            outf.write(line)
question from:https://stackoverflow.com/questions/65932262/how-to-remove-newline-characters-in-csv-columns-without-removing-the-endrow-new

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Assuming you know the number of fields per line and no field contains the csv separator (comma): you could do like this:

    number_of_columns_in_the_table = 5 #assuming a line has 5 columns
    with open ('filepath') as inf, open('filepath', 'w') as outf:
        for line in inf:
            #check if the number of "splits equals the nummber of fields"
            if len(line.split(',')) < number_of_columns_in_the_table
               
 outf.write(line.replace('
', ''))
            else:
                outf.write(line)

EDIT

number_of_columns_in_the_table = 5 #assuming a line has 5 columns
    with open ('filepath') as inf, open('filepath', 'w') as outf:
        for line in inf:
            #check if the number of "splits equals the nummber of fields"
            if len(line.split(',')) < number_of_columns_in_the_table
               buf += line.replace('
', '');
           if len(line.split(',')) == number_of_columns_in_the_table
               outf.write( buf)
            else:
                outf.write(line)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...