Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
88 views
in Technique[技术] by (71.8m points)

python - Shell script to remove duplicate lines from a file

I have a file with output similar to below

<header>
    Jacob||Pune||ABC Corp||HR||33000||Lane-4, Opposite school
    Jacob||Montreal||Titan||Manager||63000||Lane-3, Near mall
    Reese||Nairobi||Reliance||Producer||35000||Sector-A, Behind post office 
    Travis||Colombo||Warner Bros||Director||7800||Near Jantar Mantar
    Jacob||Montreal||Titan||HR||63000||Lane-3, Near mall
<footer>

The file comprises of header, footer and in between data rows.

I want to remove duplicate rows from the file. The logic to determine if a row is duplicate or not, is to check if the combination of col 1 and col 4.

If you see there are 3 rows with value Jacob in column 1 but 2 rows have HR as value in col4.

So only 2 rows(1 & 5) are duplicate based on combination of col1 and col4. So 5th row should be removed.

How to write the shell script & python script for the same. I want the solution in both shell and python scripts.

question from:https://stackoverflow.com/questions/65883572/shell-script-to-remove-duplicate-lines-from-a-file

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Using awk:

awk -F '|' '/<header>/ { delete map;print;next } { if (map[$1,$7]!="1") { print $0 } map[$1,$7]="1" }' file

Set the field delimiter to "|" and then where "" is encountered in the line, delete the array called map and skip to the next line. In all other cases, check to see if the 1st and 7th fields exist as indexes in the two dimensional array map. If they don't print the line. In all cases, set the first and second indexes of the map array to the 1st and 7th fields respectively.

Output:

<header>
    Jacob||Pune||ABC Corp||HR||33000||Lane-4, Opposite school
    Jacob||Montreal||Titan||Manager||63000||Lane-3, Near mall
    Reese||Nairobi||Reliance||Producer||35000||Sector-A, Behind post office
    Travis||Colombo||Warner Bros||Director||7800||Near Jantar Mantar
<footer>

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...