For those curious about how Alan Moore's regex works (and yes, it does work), I've taken the liberty of commented it so it can be read by mere mortals:
function process_data_alan($text) //
{
$re = '%# Collapse ws everywhere but in blacklisted elements.
(?> # Match all whitespans other than single space.
[^S ]s* # Either one [
fv] and zero or more ws,
| s{2,} # or two or more consecutive-any-whitespace.
) # Note: The remaining regex consumes no text at all...
(?= # Ensure we are not in a blacklist tag.
(?: # Begin (unnecessary) group.
(?: # Zero or more of...
[^<]++ # Either one or more non-"<"
| < # or a < starting a non-blacklist tag.
(?!/?(?:textarea|pre))
)*+ # (This could be "unroll-the-loop"ified.)
) # End (unnecessary) group.
(?: # Begin alternation group.
< # Either a blacklist start tag.
(?>textarea|pre)
| z # or end of file.
) # End alternation group.
) # If we made it here, we are not in a blacklist tag.
%ix';
$text = preg_replace($re, " ", $text);
return $text;
}
I'm new around here, but I can see right off that Alan is quite good at regex. I would only add the following suggestions.
- There is an unnecessary capture group which can be removed.
- Although the OP did not say so, the
<SCRIPT>
element should be added to the <PRE>
and <TEXTAREA>
blacklist.
- Adding the
'S'
PCRE "study" modifier speeds up this regex by about 20%.
- There is an alternation group in the lookahead which is ripe for applying Friedl's "unrolling-the-loop" efficiency construct.
- On a more serious note, this same alternation group: (i.e.
(?:[^<]++|<(?!/?(?:textarea|pre)))*+
) is susceptible to excessive PCRE recursion on large target strings, which can result in a stack-overflow causing the Apache/PHP executable to silently seg-fault and crash with no warning. (The Win32 build of Apache httpd.exe
is particularly susceptible to this because it has only 256KB stack compared to the *nix executables, which are typically built with 8MB stack or more.) Philip Hazel (the author of the PCRE regex engine used in PHP) discusses this issue in the documentation: PCRE DISCUSSION OF STACK USAGE. Although Alan has correctly applied the same fix as Philip shows in this document (applying a possessive plus to the first alternative), there will still be a lot of recursion if the HTML file is large and has a lot of non-blacklisted tags. e.g. On my Win32 box (with an executable having a 256KB stack), the script blows up with a test file of only 60KB. Note also that PHP unfortunately does not follow the recommendations and sets the default recursion limit way too high at 100000. (According to the PCRE docs this should be set to a value equal to the stack size divided by 500).
Here is an improved version which is faster than the original, handles larger input, and gracefully fails with a message if the input string is too large to handle:
// Set PCRE recursion limit to sane value = STACKSIZE / 500
// ini_set("pcre.recursion_limit", "524"); // 256KB stack. Win32 Apache
ini_set("pcre.recursion_limit", "16777"); // 8MB stack. *nix
function process_data_jmr1($text) //
{
$re = '%# Collapse whitespace everywhere but in blacklisted elements.
(?> # Match all whitespans other than single space.
[^S ]s* # Either one [
fv] and zero or more ws,
| s{2,} # or two or more consecutive-any-whitespace.
) # Note: The remaining regex consumes no text at all...
(?= # Ensure we are not in a blacklist tag.
[^<]*+ # Either zero or more non-"<" {normal*}
(?: # Begin {(special normal*)*} construct
< # or a < starting a non-blacklist tag.
(?!/?(?:textarea|pre|script))
[^<]*+ # more non-"<" {normal*}
)*+ # Finish "unrolling-the-loop"
(?: # Begin alternation group.
< # Either a blacklist start tag.
(?>textarea|pre|script)
| z # or end of file.
) # End alternation group.
) # If we made it here, we are not in a blacklist tag.
%Six';
$text = preg_replace($re, " ", $text);
if ($text === null) exit("PCRE Error! File too big.
");
return $text;
}
p.s. I am intimately familiar with this PHP/Apache seg-fault problem, as I was involved with helping the Drupal community while they were wrestling with this issue. See: Optimize CSS option causes php cgi to segfault in pcre function "match". We also experienced this with the BBCode parser on the FluxBB forum software project.
Hope this helps.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…