Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
546 views
in Technique[技术] by (71.8m points)

C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters

i would appreciate your help on this, since i do not know which range of characters to use, or if there is a character class like [[:cntrl:]] that i have found in ruby?

by means of non printable, i mean delete all characters that are not shown in ie output, when one prints the input string. Please note, i look for a c# regex, i do not have a problem with my code

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You may remove all control and other non-printable characters with

s = Regex.Replace(s, @"p{C}+", string.Empty);

The p{C} Unicode category class matches all control characters, even those outside the ASCII table because in .NET, Unicode category classes are Unicode-aware by default.

Breaking it down into subcategories

  • To only match basic control characters you may use p{Cc}+, see 65 chars in the Other, Control Unicode category. It is equal to a [u0000-u0008u000E-u001Fu007F-u0084u0086-u009F u0009-u000D u0085]+ regex.
  • To only match 161 other format chars including the well-known soft hyphen (u00AD), zero-width space (u200B), zero-width non-joiner (u200C), zero-width joiner (u200D), left-to-right mark (u200E) and right-to-left mark (u200F) use p{Cf}+. The equivalent including astral place code points is a (?:[xADu0600-u0605u061Cu06DDu070Fu08E2u180Eu200B-u200Fu202A-u202Eu2060-u2064u2066-u206FuFEFFuFFF9-uFFFB]|uD804[uDCBDuDCCD]|uD80D[uDC30-uDC38]|uD82F[uDCA0-uDCA3]|uD834[uDD73-uDD7A]|uDB40[uDC01uDC20-uDC7F])+ regex.
  • To match 137,468 Other, Private Use control code points you may use p{Co}+, or its equivalent including astral place code points, (?:[uE000-uF8FF]|[uDB80-uDBBEuDBC0-uDBFE][uDC00-uDFFF]|[uDBBFuDBFF][uDC00-uDFFD])+.
  • To match 2,048 Other, Surrogate code points that include some emojis, you may use p{Cs}+, or [uD800-uDFFF]+ regex.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...