Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
400 views
in Technique[技术] by (71.8m points)

Unicode Regex; Invalid XML characters

The list of valid XML characters is well known, as defined by the spec it's:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

My question is whether or not it's possible to make a PCRE regular expression for this (or its inverse) without actually hard-coding the codepoints, by using Unicode general categories. An inverse might be something like [p{Cc}p{Cs}p{Cn}], except that improperly covers linefeeds and tabs and misses some other invalid characters.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I know this isn't exactly an answer to your question, but it's helpful to have it here:

Regular Expression to match valid XML Characters:

[u0009u000au000du0020-uD7FFuE000-uFFFD]

So to remove invalid chars from XML, you'd do something like

// filters control characters but allows only properly-formed surrogate sequences
private static Regex _invalidXMLChars = new Regex(
    @"(?<![uD800-uDBFF])[uDC00-uDFFF]|[uD800-uDBFF](?![uDC00-uDFFF])|[x00-x08x0Bx0Cx0E-x1Fx7F-x9FuFEFFuFFFEuFFFF]",
    RegexOptions.Compiled);

/// <summary>
/// removes any unusual unicode characters that can't be encoded into XML
/// </summary>
public static string RemoveInvalidXMLChars(string text)
{
    if (string.IsNullOrEmpty(text)) return "";
    return _invalidXMLChars.Replace(text, "");
}

I had our resident regex / XML genius, he of the 4,400+ upvoted post, check this, and he signed off on it.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...