How to remove duplicate words from a string in a Bash script?

Question

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

I have a string containing duplicate words, for example:

abc, def, abc, def

How can I remove the duplicates? The string that I need is:

abc, def

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:27:30+0000

We have this test file:

$ cat file
abc, def, abc, def

To remove duplicate words:

$ sed -r ':a; s/([[:alnum:]]+)(.*)1/12/g; ta; s/(, )+/, /g; s/, *$//' file
abc, def

:a

This defines a label a.
s/([[:alnum:]]+)(.*)1/12/g

This looks for a duplicated word consisting of alphanumeric characters and removes the second occurrence.
ta

If the last substitution command resulted in a change, this jumps back to label a to try again.

In this way, the code keeps looking for duplicates until none remain.
s/(, )+/, /g; s/, *$//

These two substitution commands clean up any left over comma-space combinations.

For Mac OSX or other BSD system, try:

sed -E -e ':a' -e 's/([[:alnum:]]+)(.*)1/12/g' -e 'ta' -e 's/(, )+/, /g' -e 's/, *$//' file

sed easily handles input either from a file, as shown above, or from a shell string as shown below:

$ echo 'ab, cd, cd, ab, ef' | sed -r ':a; s/([[:alnum:]]+)(.*)1/12/g; ta; s/(, )+/, /g; s/, *$//'
ab, cd, ef