Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
205 views
in Technique[技术] by (71.8m points)

php - Regular expression preg_quote symbols are not detected

I have a dictionary of swear words in the database, and the following works great

preg_match_all("/".$f."(?:ing|er|es|s)?/si",$t,$m,PREG_SET_ORDER);

$t is the input text and simply, $f = preg_quote("punk"); "punk" is from the database dictionary, so at this point in the loop the expression is as follows

preg_match_all("/punk(?:ing|er|es|s)?/si",$t,$m,PREG_SET_ORDER);

preg_quote replaces symbols eg. # with \# so that the expression is escaped, but when the dictionary is checking eg. "F@CK" or "A$$" these symbols are not detected in the input string with the above expression, I have both a$$ and f@ck in the dictionary, but they do not work. If I remove preg_quote() on the word, the regular expression is invalid as these symbols are not escaped.

Any suggestions on how I can detect "a$$" ???

Edit:

So I guess the expression that is not working as intended would be eg.

preg_match_all("/f@ck(?:ing|er|es|s)?/si",$t,$m,PREG_SET_ORDER);

Which should find f@ck in $t

UPDATE:

This is my usage, simply put; if there are matches in $m replace them with "****", this whole block is inside a loop through each word in the dictionary, $f is the dictionary word and $t is the input

$f = preg_quote($f);
preg_match_all("/$f(?:ing|er|es|s)?/si",$t,$m,PREG_SET_ORDER);
if (count($m) > 0) {
     $t = preg_replace("/($f(?:ing|er|es|s)?)/si","*****",$t);
}

UPDATE: Behold, the var_dump:

preg_quote($f) = string(5) "a$$"
$t = string(18) "You're such an a$$"
expression = string(29) "/a$$(?:ing|er|es|s)?/si"

UPDATE: This is only happening when words end with a symbol. I tested "a$$hole" and it’s fine, but "a$$" doesn't work.

ANOTHER UPDATE: Try this simplified version, $words being a make-shift dictionary

$words = array("a$$","asshole","a$$hole","f@ck","f#ck","f*ck");
$text = "Input whatever you feel like here eg. a$$";

foreach ($words as $f) {
   $f = preg_quote($f,"/");
   $text = preg_replace("/".$f."(?:ing|er|es|s)?/si",
                         str_repeat("*",strlen($f)),
                        $t);
}

I should expect to see "Input whatever you feel like here eg. ***" as a result.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Cannot Be Done

I'm sorry, but this “problem” is truly impossible to solve. Consider these:

  • ???? ??is U+A730.1D1C.1D04.1D0B, "N{LATIN LETTER SMALL CAPITAL F}N{LATIN LETTER SMALL CAPITAL U}N{LATIN LETTER SMALL CAPITAL C}N{LATIN LETTER SMALL CAPITAL K}"
  • ???? ??is U+1DA0.1D58.1D9C.1D4F, "N{MODIFIER LETTER SMALL F}N{MODIFIER LETTER SMALL U}N{MODIFIER LETTER SMALL C}N{MODIFIER LETTER SMALL K}"
  • ???????? ??is U+1D4BB.1D4CA.1D4B8.1D4C0, "N{MATHEMATICAL SCRIPT SMALL F}N{MATHEMATICAL SCRIPT SMALL U}N{MATHEMATICAL SCRIPT SMALL C}N{MATHEMATICAL SCRIPT SMALL K}"
  • ???????? ??is U+1D58B.1D59A.1D588.1D590, "N{MATHEMATICAL BOLD FRAKTUR SMALL F}N{MATHEMATICAL BOLD FRAKTUR SMALL U}N{MATHEMATICAL BOLD FRAKTUR SMALL C}N{MATHEMATICAL BOLD FRAKTUR SMALL K}"
  • ?? ?? ?? ?? ??is U+1D4D5.1D4B0.1D49E.1D4A6, "N{MATHEMATICAL BOLD SCRIPT CAPITAL F}N{MATHEMATICAL SCRIPT CAPITAL U}N{MATHEMATICAL SCRIPT CAPITAL C}N{MATHEMATICAL SCRIPT CAPITAL K}"
  • ? ? ? ? ??is U+24D5.24E4.24D2.24DA, "N{CIRCLED LATIN SMALL LETTER F}N{CIRCLED LATIN SMALL LETTER U}N{CIRCLED LATIN SMALL LETTER C}N{CIRCLED LATIN SMALL LETTER K}"
  • Γ????? ??is U+393.335.10335.13DF.13E6, "N{GREEK CAPITAL LETTER GAMMA}N{COMBINING SHORT STROKE OVERLAY}N{GOTHIC LETTER QAIRTHRA}N{CHEROKEE LETTER TLI}N{CHEROKEE LETTER TSO}"
  • ?μ?? ??is U+192.3BC.255.464, "N{LATIN SMALL LETTER F WITH HOOK}N{GREEK SMALL LETTER MU}N{LATIN SMALL LETTER C WITH CURL}N{CYRILLIC CAPITAL LETTER IOTIFIED E}"
  • Г?ЦСК ??is U+413.335.426.421.41A, "N{CYRILLIC CAPITAL LETTER GHE}N{COMBINING SHORT STROKE OVERLAY}N{CYRILLIC CAPITAL LETTER TSE}N{CYRILLIC CAPITAL LETTER ES}N{CYRILLIC CAPITAL LETTER KA}"
  • ???? ??is U+493.1D7E.23C.199, "N{CYRILLIC SMALL LETTER GHE WITH STROKE}N{LATIN SMALL CAPITAL LETTER U WITH STROKE}N{LATIN SMALL LETTER C WITH STROKE}N{LATIN SMALL LETTER K WITH HOOK}"
  • ?υ?Κ ??is U+3DC.3C5.3DA.39A, "N{GREEK LETTER DIGAMMA}N{GREEK SMALL LETTER UPSILON}N{GREEK LETTER STIGMA}N{GREEK CAPITAL LETTER KAPPA}"
  • Ж?U? ??is U+416.2183.55.11BF, "N{CYRILLIC CAPITAL LETTER ZHE}N{ROMAN NUMERAL REVERSED ONE HUNDRED}N{LATIN CAPITAL LETTER U}N{HANGUL JONGSEONG KHIEUKH}"
  • ??n? ??is U+29E.254.6E.25F, "N{LATIN SMALL LETTER TURNED K}N{LATIN SMALL LETTER OPEN O}N{LATIN SMALL LETTER N}N{LATIN SMALL LETTER DOTLESS J WITH STROKE}"

It Gets Worse

And if you think those are easy, just try coping with all of these:

???00????, F?????K, K???Ц?? , ??????K???, ??∞???k, f??????K, ??oo???? , ????¢?K, ??????????, ???ù???? , f???????, F????????, F?∞????Ж , ???@?????, ?????????, F?Ц?¢???, f?oo????, ???oo?¢?Ж , ???υ???Κ , ??ú?*??, ?????c?K, ??????k, ???U?????, Ж???μ??, F?????k, ?????C??, ??00????, ??U?c??, ???∞???? , ???????? , ????????, ???????? , F????????, f?00??????, ??u?С?K, f??????Κ , f?μ???K, ?????c??, f????????, F?μ?¢???, ???????? , Κ?¢?oo??, ??μ????, ??????Ж , ???????F, F?@?C?? , ????u?F, ????C?k, ??μ????, F????C???, f???¢??, ??00??????, ??υ???K, ???????К , ???oo????, ????????, ??n????K, ??μ???К , F?∞?????, ???????Κ , ?????????, ??U?C??, ??υ????, ??????C???, ??U?????, ??U?????, ???U????, F?@?C?К , ????????, ??U????К , ??U?*??, ???Ц?c?Κ , ??U?????, ????????, ?????*?K, ???n????? , ??00?С?К , ???Ц????k, ??c?Ц??, ?????????, ??ǔ???? , F????????, ??????υ??, ??????*??, ???00????Ж , Κ?C??????, ??U?С?K, ????????Κ , ??U???? , ???∞????, ??U?К??, ??υ????, ??∞?Ж??, ?????????, F?U????, ????????Ж , ??????????, ??n?*?K, ???oo?c?? , ??U?¢??, ??u?C??, K?¢?μ?? , ????K??, F?U?c?k, F?Ц????? , ???U????, ??????????, ?????????? , ?????C?К , ????*?? , ??????? , ???????, ????С?K, ?????*??, ??∞????, ???n?*??, ??μ?????, k??????, ?????????, ??Ц????, ????????, ????????, ?????*??, F??????K, ????????, ??u?????, ??c???F, ????????Κ , K???Ц???, ?????c?? , ??@?c?Κ , ??Ц????, ???????? , ??????¢, F????????, ?????????, ?????????, ???U????, ??υ????, F?????Κ , ???????, ????????, ?????????, f???????, ???U????K, ????*???, ??@????, ??u?????, f?U???k, ???00????, ???υ?С?K, F???????, ??oo???? , ????????, ??υ???Κ , ??U?????, ?????????, ??????????, ???Ц???К , f?@???? , ????U??, ?????c?? , F???C???, ???00?????, ??00????К , ????????, F????c??, ??oo???K, f???С??, ??Ц?c???, ??????c?Ж , ?????????, ??C?n??, ??U?????, ???00?K??, ??????????, ???Ц?C???, ???Ц?¢???, ????c?k, ?????¢?k, ???????, ?????????k, ??U???K, ?????????, ???????? , Ж???U???, ??υ?*??, ????????k, ???U?С?? , ??????C?Ж , ??μ????, ??n???? , ??μ???Ж , ??00?????, ?????????, ??ù?Ж???, ????U???, k?C????, ??n????, ???????, F???????, ????????, f?U?Ж?? , F???????, F?u????Κ , F?00?????, ??μ????, ????????K, ??n???Ж , F?@?????, ???????К , ???U?C??, F?U????? , ???00?????, ?????c?К , ?????????, ????????Κ , ???U????Ж , ????????, ???????f, ???U?C?K, F?@?C??, ????С?k, ??u?*??, ????????, ???00???K, ???υ??????, ????*??, ??U???Ж , ??U????, ??u?С?? , ????????, ??μ????? , ??@???К , ??υ????, ????oo??, F????????, ?????C??, ???U?????, ??∞?C??, ?????*?K, ???u????, ??U?????, ??U?????, ???n??????, ??Ц?C??, ?????????, K?¢???f, ?????????, ?????00???, ??U????k, ???u?¢?? , ??????*??, ?????С??, ??????C??, ???@????Κ , ??С?????, ????????, ????????, F?Ц????, ????К??, ??υ?¢???, ????U??, ????????, ?????*?K, ?????????, F????????, ???@?????, ?????*???, ?????????, F????¢??, ???????, ??00?c??, ???00???K, ???υ???Κ , ??μ???Ж , ????????, ???????, ????????, ???????, ??n?????, ??μ???k, ???Ц???Κ , ??μ?????, f???????, ?????μ??, ??С??????, ??∞??????, ????????, ??μ???k, f?oo?K??, ????????С , ??n????K, ???00?????, ??μ?????, ???c?∞??, ??Ц???? , ?????????, F?00????? , ??@???К , ...

And that’s not all: there are at least a bazingatillion more where those came from. Do you see now why this fundamentally cannot be done?

Full Disclosure

Because I don't believe in security through obscurity, here's the program that generates all those:

#!/usr/bin/env perl
#
# unifuck - print infinite permutations of fuck in unicode aliases
#
# Tom Christiansen <[email protected]>
# Mon May 23 09:37:27 MDT 2011

use strict;
use warnings;
use charnames ":full";

use Unicode::Normalize;

binmode(STDOUT, ":utf8");

our(@diddle, @fuck, %fuck); # initted down below
while (my($f,$u,$c,$k) = splice(@fuck, 0, 4)) {
    $fuck{F}{$f}++;
    $fuck{U}{$u}++;
    $fuck{C}{$c}++;
    $fuck{K}{$k}++;
} 

my @F = keys %{ $fuck{F} };
my @U = keys %{ $fuck{U} };
my @C = keys %{ $fuck{C} };
my @K = keys %{ $fuck{K} };

while (1) { 
    my $f = $F[rand @F];
    my $u = $U[rand @U];
    my $c = $C[rand @C];
    my $k = $K[rand @K];

    for ($f,$u,$c,$k) {  
        next if length > 1;
        next if /p{EA=W}/;
        next if /pM/;
        next if /p{InEnclosedAlphanumerics}/;
        s/$/$diddle[rand @diddle]/          if rand(100) < 15;
        s/$/N{COMBINING ENCLOSING KEYCAP}/ if rand(100) <  1;
    }

    if    (             0) {                                       }
    elsif (rand(100) <  5) {     $u        = q(@)                  } 
    elsif (rand(100) <  5) {        $c     = q(*)                  } 
    elsif (rand(100) < 10) {       ($c,$k) = ($k,$c)               } 
    elsif (rand(100) < 15) { ($f,$u,$c,$k) = reverse ($f,$u,$c,$k) }

    print NFC("$f $u $c $k
");
}

BEGIN {

    # ok to have repeats in each position, since they'll be counted only once
    # per unique strings
    @fuck = (

        "N{LATIN CAPITAL LETTER F}",
        "N{LATIN CAPITAL LETTER U}",
        "N{LATIN CAPITAL LETTER C}",
        "N{LATIN CAPITAL LETTER K}",

        "N{LATIN SMALL LETTER F}",
        "N{LATIN SMALL LETTER U}",
        "N{LATIN SMALL LETTER C}",
        "N{LATIN SMALL LETTER K}",

        "N{LATIN SMALL LETTER F}",
        "N{INFINITY}",
        "N{LATIN SMALL LETTER C}",
        "N{LATIN SMALL LETTER K}",

        "N{LATIN SMALL LETTER F}",
        "N{LATIN SMALL LETTER O}N{LATIN SMALL LETTER O}",
        "N{LATIN SMALL LETTER C}",
        "N{KELVIN SIGN}",

        "N{LATIN SMALL LETTER F}",
        "N{DIGIT ZERO}N{DIGIT ZERO}",
        "N{CENT SIGN}",
        "N{LATIN CAPITAL LETTER K}",

        "N{LATIN LETTER SMALL CAPITAL F}",
        "N{LATIN LETTER SMALL CAPITAL U}",
        "N{LATIN LETTER SMALL CAPITAL C}",
        "N{LATIN LETTER SMALL CAPITAL K}",

        "N{MODIFIER LETTER SMALL F}",
        "N{MODIFIER LETTER SMALL U}",
        "N{MODIFIER LETTER SMALL C}",
        "N{MODIFIER LETTER SMALL K}",

        "N{MATHEMATICAL SCRIPT SMALL F}",
        "N{MATHEMATICAL SCRIPT SMALL U}",
        "N{MATHEMATICAL SCRIPT SMALL C}",
        "N{MATHEMATICAL SCRIPT SMALL K}",

        "N{MATHEMATICAL BOLD FRAKTUR CAPITAL F}",
        "N{MATHEMATICAL BOLD FRAKTUR CAPITAL U}",
        "N{MATHEMATICAL BOLD FRAKTUR CAPITAL C}",
        "N{MATHEMATICAL BOLD FRAKTUR CAPITAL K}",

        "N{MATHEMATICAL BOLD FRAKTUR SMALL F}",
        "N{MATHEMATICAL BOLD FRAKTUR SMALL U}",
        "N{MATHEMATICAL BOLD FRAKTUR SMALL C}",
        "N{MATHEMATICAL BOLD FRAKTUR SMALL K}",

        "N{MATHEMATICAL BOLD SCRIPT CAPITAL F}",
        "N{MATHEMATICAL SCRIPT CAPITAL U}",
        "N{MATHEMATICAL SCRIPT CAPITAL C}",
        "N{MATHEMATICAL SCRIPT CAPITAL K}",

        "N{CIRCLED LATIN SMALL LETTER F}",
        "N{CIRCLED LATIN SMALL LETTER U}",
        "N{CIRCLED LATIN SMALL LETTER C}",
        "N{CIRCLED LATIN SMALL LETTER K}",

        "N{PARENTHESIZED LATIN SMALL LETTER F}",
        "N{PARENTHESIZED LATIN SMALL LETTER U}",
        "N{PARENTHESIZED LATIN SMALL LETTER C}",
        "N{PARENTHESIZED LATIN SMALL LETTER K}",

        "N{GREEK CAPITAL LETTER GAMMA}N{COMBINING SHORT STROKE OVERLAY}",
        "N{GOTHIC LETTER QAIRTHRA}",
        "N{CHEROKEE LETTER TLI}",
        "N{CHEROKEE LETTER TSO}",

        "N{LATIN SMALL LETTER F WITH HOOK}",
        "N{GREEK SMALL LETTER MU}",
        "N{LATIN SMALL LETTER C WITH CURL}",
        "N{CYRILLIC CAPITAL LETTER IOTIFIED E}",

        "N{CYRILLIC CAPITAL LETTER GHE}N{COMBINING SHORT STROKE OVERLAY}",
        "N{CYRILLIC CAPITAL LETTER TSE}",
        "N{CYRILLIC CAPITAL LETTER ES}",
        "N{CYRILLIC CAPITAL LETTER KA}",

        "N{CYRILLIC SMALL LETTER GHE WITH STROKE}",
        "N{LATIN SMALL CAPITAL LETTER U WITH STROKE}",
        "N{LATIN SMALL LETTER C WITH STROKE}",
        "N{LATIN SMALL LETTER K WITH HOOK}",

        "N{GREEK LETTER DIGAMMA}",
        "N{GREEK SMALL LETTER UPSILON}",
        "N{GREEK LETTER STIGMA}",
        "N{GREEK CAPITAL LETTER KAPPA}",

        "N{HANGUL JONGSEONG KHIEUKH}",
        "N{LATIN CAPITAL LETTER U}",
        "N{ROMAN NUMERAL REVERSED ONE HUNDRED}",
        "N{CYRILLIC CAPITAL LETTER ZHE}",

        "N{LATIN SMALL LETTER DOTLESS J WITH STROKE}",
        "N{LATIN SMALL LETTER N}",
        "N{LATIN SMALL LETTER OPEN O}",
        "N{LATIN SMALL LETTER TURNED K}",

        "N{FULLWIDTH LATIN CAPITAL LETTER F}",
        "N{FULLWIDTH LATIN CAPITAL LETTER U}",
        "N{FULLWIDTH LATIN CAPITAL LETTER C}",
        "N{FULLWIDTH LATIN CAPITAL LETTER K}",

    );

    @diddle = (
        "N{COMBINING GRAVE ACCENT}",
        "N{COMBINING ACUTE ACCENT}",
        "N{COMBINING CIRCUMFLEX ACCENT}",
        "N{COMBINING TILDE}",
        "N{COMBINING BREVE}",
        "N{COMBINING DOT ABOVE}",
        "N{COMBINING DIAERESIS}",
        "N{COMBINING CARON}",
        "N{COMBINING CANDRABINDU}",
        "N{COMBINING INVERTED BREVE}",
        "N{COMBINING GRAVE TONE MARK}",
        "N{COMBINING ACUTE TONE MARK}",
        "N{COMBINING GREEK PERISPOMENI}",
        "N{COMBINING FERMATA}",
        "N{COMBINING SUSPENSION MARK}",
    );

}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...