regex - Replace all whitespace with a line break/paragraph mark to make a word list

Question

Welcome To Ask or Share your Answers For Others

regex - Replace all whitespace with a line break/paragraph mark to make a word list

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:58:32+0000

For reasonably modern versions of sed, edit the standard input to yield the standard output with

$ echo 'τ?χνη βιβλ?ο γη κ?πο?' | sed -E -e 's/[[:blank:]]+/
/g'
τ?χνη
βιβλ?ο
γη
κ?πο?

If your vocabulary words are in files named lesson1 and lesson2, redirect sed’s standard output to the file all-vocab with

sed -E -e 's/[[:blank:]]+/
/g' lesson1 lesson2 > all-vocab

What it means:

The character class [[:blank:]] matches either a single space character or a single tab character.
- Use [[:space:]] instead to match any single whitespace character (commonly space, tab, newline, carriage return, form-feed, and vertical tab).
- The + quantifier means match one or more of the previous pattern.
- So [[:blank:]]+ is a sequence of one or more characters that are all space or tab.
The in the replacement is the newline that you want.
The /g modifier on the end means perform the substitution as many times as possible rather than just once.
The -E option tells sed to use POSIX extended regex syntax and in particular for this case the + quantifier. Without -E, your sed command becomes sed -e 's/[[:blank:]]+/ /g'. (Note the use of + rather than simple +.)

Perl Compatible Regexes

For those familiar with Perl-compatible regexes and a PCRE-capable sed, use s+ to match runs of at least one whitespace character, as in

sed -E -e 's/s+/
/g' old > new

or

sed -e 's/s+/
/g' old > new

These commands read input from the file old and write the result to a file named new in the current directory.

Maximum portability, maximum cruftiness

Going back to almost any version of sed since Version 7 Unix, the command invocation is a bit more baroque.

$ echo 'τ?χνη βιβλ?ο γη κ?πο?' | sed -e 's/[ ][ ]*/
/g'
τ?χνη
βιβλ?ο
γη
κ?πο?

Notes:

Here we do not even assume the existence of the humble + quantifier and simulate it with a single space-or-tab ([ ]) followed by zero or more of them ([ ]*).
Similarly, assuming sed does not understand for newline, we have to include it on the command line verbatim.
- The and the end of the first line of the command is a continuation marker that escapes the immediately following newline, and the remainder of the command is on the next line.
  - Note: There must be no whitespace preceding the escaped newline. That is, the end of the first line must be exactly backslash followed by end-of-line.
- This error prone process helps one appreciate why the world moved to visible characters, and you will want to exercise some care in trying out the command with copy-and-paste.

Note on backslashes and quoting

The commands above all used single quotes ('') rather than double quotes (""). Consider:

$ echo '\\' ""
\\ \

That is, the shell applies different escaping rules to single-quoted strings as compared with double-quoted strings. You typically want to protect all the backslashes common in regexes with single quotes.

Categories

regex - Replace all whitespace with a line break/paragraph mark to make a word list