Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
808 views
in Technique[技术] by (71.8m points)

regex - Problem with perl multiline matching

I'm trying to use a perl one-liner to update some code that spans multiple lines and am seeing some strange behavior. Here's a simple text file that shows the problem I'm seeing:

ABCD    START
         STOP    EFGH

I expected the following to work but it doesn't end up replacing anything:

perl -pi -e 's/STARTs+STOP/REPLACE/s' input.txt

After doing some experimenting I found that the s+ in the original regex will match the newline but not any of the whitespace on the 2nd line, and adding a second s+ doesn't work either. So for now I'm doing the following workaround, which is to add an intermediate regex that only removes the newline:

perl -pi -e 's/STARTs+/START/s' input.txt

This creates the following intermediate file:

ABCD    START            STOP    EFGH

Then I can run the original regex (although the /s is no longer needed):

perl -pi -e 's/STARTs+STOP/REPLACE/s' input.txt

This creates the final, desired file:

ABCD    REPLACE    EFGH

It seems like the intermediate step should not be necessary. Am I missing something?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

perl -p processes the file one line at a time. The regex you have is correct, but it is never matched against the multi-line string.

A simple strategy, assuming the file will fit in memory, is to read the whole thing (do this without -p):

$/ = undef;
$file = <>;
$file =~ s/STARTs+STOP/REPLACE/sg;
print $file;

Note, I have added the /g modifier to specify global replacement.

As a shortcut for all that extra boilerplate, you can use your existing script with the -0777 option: perl -0777pi -e 's/STARTs+STOP/REPLACE/sg'. Adding /g is still needed if you may need to make multiple replacements within the file.

A hiccup that you might run into, although not with this regex: if the regex were START.+STOP, and a file contains multiple START/STOP pairs, greedy matching of .+ will eat everything from the first START to the last STOP. You can use non-greedy matching (match as little as possible) with .+?.

If you want to use the ^ and $ anchors for line boundaries anywhere in the string, then you also need the /m regex modifier.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...