I know this isn't exactly an answer to your question, but it's helpful to have it here:
Regular Expression to match valid XML Characters:
[u0009u000au000du0020-uD7FFuE000-uFFFD]
So to remove invalid chars from XML, you'd do something like
// filters control characters but allows only properly-formed surrogate sequences
private static Regex _invalidXMLChars = new Regex(
@"(?<![uD800-uDBFF])[uDC00-uDFFF]|[uD800-uDBFF](?![uDC00-uDFFF])|[x00-x08x0Bx0Cx0E-x1Fx7F-x9FuFEFFuFFFEuFFFF]",
RegexOptions.Compiled);
/// <summary>
/// removes any unusual unicode characters that can't be encoded into XML
/// </summary>
public static string RemoveInvalidXMLChars(string text)
{
if (string.IsNullOrEmpty(text)) return "";
return _invalidXMLChars.Replace(text, "");
}
I had our resident regex / XML genius, he of the 4,400+ upvoted post, check this, and he signed off on it.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…