Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
269 views
in Technique[技术] by (71.8m points)

c# - Word Wrapping with Regular Expressions

EDIT FOR CLARITY - I know there are ways to do this in multiple steps, or using LINQ or vanilla C# string manipulation. The reason I am using a single regex call, is because I wanted practice with complex regex patterns. - END EDIT

I am trying to write a single regular expression that will perform word wrapping. It's extremely close to the desired output, but I can't quite get it to work.

Regex.Replace(text, @"(?<=^|G)(.{1,20}(s|$))", "$1
", RegexOptions.Multiline)

This is correctly wrapping words for lines that are too long, but it's adding a line break when there already is one.

Input

"This string is really long. There are a lot of words in it.
Here's another line in the string that's also very long."

Expected Output

"This string is 
really long. There 
are a lot of words 
in it.
Here's another line 
in the string that's 
also very long."

Actual Output

"This string is 
really long. There 
are a lot of words 
in it.

Here's another line 
in the string that's 
also very long.
"

Note the double " " between sentences where the input already had a line break and the extra " " that was put at the end.

Perhaps there's a way to conditionally apply different replacement patterns? I.E. If the match ends in " ", use replace pattern "$1", otherwise, use replace pattern "$1 ".

Here's a link to a similar question for wrapping a string with no white space that I used as a starting point. Regular expression to find unbroken text and insert space

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This was quick-tested in Perl.

Edit - This regex code simulates the word wrap used (good or bad) in MS-Windows Notepad.exe

 # MS-Windows  "Notepad.exe Word Wrap" simulation
 # ( N = 16 )
 # ============================
 # Find:     @"(?:((?>.{1,16}(?:(?<=[^S
])[^S
]?|(?=
?
)|$|[^S
]))|.{1,16})(?:
?
)?|(?:
?
|$))"
 # Replace:  @"$1
"
 # Flags:    Global     

 # Note - Through trial and error discovery, it apparears Notepad accepts an extra whitespace
 # (possibly in the N+1 position) to help alignment. This matters not because thier viewport hides it.
 # There is no trimming of any whitespace, so the wrapped buffer could be reconstituted by inserting/detecting a
 # wrap point code which is different than a linebreak.
 # This regex works on un-wrapped source, but could probably be adjusted to produce/work on wrapped buffer text.
 # To reconstitute the source all that is needed is to remove the wrap code which is probably just an extra "
".

 (?:
      # -- Words/Characters 
      (                       # (1 start)
           (?>                     # Atomic Group - Match words with valid breaks
                .{1,16}                 #  1-N characters
                                        #  Followed by one of 4 prioritized, non-linebreak whitespace
                (?:                     #  break types:
                     (?<= [^S
] )        # 1. - Behind a non-linebreak whitespace
                     [^S
]?              #      ( optionally accept an extra non-linebreak whitespace )
                  |  (?= 
? 
 )            # 2. - Ahead a linebreak
                  |  $                       # 3. - EOS
                  |  [^S
]               # 4. - Accept an extra non-linebreak whitespace
                )
           )                       # End atomic group
        |  
           .{1,16}                 # No valid word breaks, just break on the N'th character
      )                       # (1 end)
      (?: 
? 
 )?           # Optional linebreak after Words/Characters
   |  
      # -- Or, Linebreak
      (?: 
? 
 | $ )        # Stand alone linebreak or at EOS
 )

Test Case The wrap width N is 16. Output matches Notepad's and over a variety of widths.

 $/ = undef;

 $string1 = <DATA>;

 $string1 =~ s/(?:((?>.{1,16}(?:(?<=[^S
])[^S
]?|(?=
?
)|$|[^S
]))|.{1,16})(?:
?
)?|(?:
?
|$))/$1
/g;

 print $string1;

 __DATA__
 hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
 bbbbbbbbbbbbbbbbEDIT FOR CLARITY - I                    know there are  ways to do this in   multiple steps, or using LINQ or vanilla C#
 string manipulation. 

 The reason I am using a single regex call, is because I wanted practice. with complex
 regex patterns. - END EDIT
 pppppppppppppppppppUf

Output >>

 hhhhhhhhhhhhhhhh
 hhhhhhhhhhhhhhh
 bbbbbbbbbbbbbbbb
 EDIT FOR CLARITY 
 - I              
       know there 
 are  ways to do 
 this in   
 multiple steps, 
 or using LINQ or 
 vanilla C#
 string 
 manipulation. 

 The reason I am 
 using a single 
 regex call, is 
 because I wanted 
 practice. with 
 complex
 regex patterns. 
 - END EDIT
 pppppppppppppppp
 pppUf

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...