Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.5k views
in Technique[技术] by (71.8m points)

c# - Regular expression to remove XML tags and their content

I have the following string and I would like to remove <bpt *>*</bpt> and <ept *>*</ept> (notice the additional tag content inside them that also needs to be removed) without using a XML parser (overhead too large for tiny strings).

The big <bpt i="1" x="1" type="bold"><b></bpt>black<ept i="1"></b></ept> <bpt i="2" x="2" type="ulined"><u></bpt>cat<ept i="2"></u></ept> sleeps.

Any regex in VB.NET or C# will do.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you just want to remove all the tags from the string, use this (C#):

try {
    yourstring = Regex.Replace(yourstring, "(<[be]pt[^>]+>.+?</[be]pt>)", "");
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

EDIT:

I decided to add on to my solution with a better option. The previous option would not work if there were embedded tags. This new solution should strip all <**pt*> tags, embedded or not. In addition, this solution uses a back reference to the original [be] match so that the exact matching end tag is found. This solution also creates a reusable Regex object for improved performance so that each iteration does not have to recompile the Regex:

bool FoundMatch = false;

try {
    Regex regex = new Regex(@"<([be])pt[^>]+>.+?</1pt>");
    while(regex.IsMatch(yourstring) ) {
        yourstring = regex.Replace(yourstring, "");
    }
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

ADDITIONAL NOTES:

In the comments a user expressed worry that the '.' pattern matcher would be cpu intensive. While this is true in the case of a standalone greedy '.', the use of the non-greedy character '?' causes the regex engine to only look ahead until it finds the first match of the next character in the pattern versus a greedy '.' which requires the engine to look ahead all the way to the end of the string. I use RegexBuddy as a regex development tool, and it includes a debugger which lets you see the relative performance of different regex patterns. It also auto comments your regexes if desired, so I decided to include those comments here to explain the regex used above:

    // <([be])pt[^>]+>.+?</1pt>
// 
// Match the character "<" literally ?<?
// Match the regular expression below and capture its match into backreference number 1 ?([be])?
//    Match a single character present in the list "be" ?[be]?
// Match the characters "pt" literally ?pt?
// Match any character that is not a ">" ?[^>]+?
//    Between one and unlimited times, as many times as possible, giving back as needed (greedy) ?+?
// Match the character ">" literally ?>?
// Match any single character that is not a line break character ?.+??
//    Between one and unlimited times, as few times as possible, expanding as needed (lazy) ?+??
// Match the characters "</" literally ?</?
// Match the same text as most recently matched by backreference number 1 ?1?
// Match the characters "pt>" literally ?pt>?

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...