bash - Script to find duplicates in a csv file

Question

Welcome To Ask or Share your Answers For Others

bash - Script to find duplicates in a csv file

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

bash - Script to find duplicates in a csv file

I have a 40 MB csv file with 50,000 records. Its a giant product listing. Each row has close to 20 fields. [Item#, UPC, Desc, etc]

How can I,

a) Find and Print duplicate rows. [This file is a large appended file, so I have multiple headers included in the file which I need to remove, so I wanted to know exact rows which are duplicate first.]

b) Find and Print duplicate rows based on a column. [See if a UPC is assigned to multiple products]

I need to run the command or script on the server and I have Perl and Python installed. Even bash script or command will work for me too.

I dont need to preserve the order of the rows. etc

I tried,

sort largefile.csv | uniq -d

to get the duplicates, But I am not getting the expected answer.

Ideally I would like bash script or command, but if any one has any other suggestion, that would be great too.

Thanks

See: Remove duplicate rows from a large file in Python over on Stack Overflow

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:17:23+0000

Find and print duplicate rows in Perl:

perl -ne 'print if $SEEN{$_}++' < input-file

Find and print rows with duplicate columns in Perl -- let's say the 5th column of where fields are separated by commas:

perl -F/,/ -ane 'print if $SEEN{$F[4]}++' < input-file

Categories

bash - Script to find duplicates in a csv file

bash - Script to find duplicates in a csv file

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags