Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
714 views
in Technique[技术] by (71.8m points)

php - Extracting urls from @font-face by searching within @font-face for replacement

I have a web service that rewrites urls in css files so that they can be served via a CDN.

The css files can contain urls to images or fonts.

I currently have the following regex to match ALL urls within the css file:

(url(s*(['"]?+))((?!(https?:|data:|../|/))S+)((2)s*))

However, I now want to introduce support for custom fonts and need to target the urls within @font-fontface:

@font-face {
  font-family: 'FontAwesome';
  src: url("fonts/fontawesome-webfont.eot?v=4.0.3");
  src: url("fonts/fontawesome-webfont.eot?#iefix&v=4.0.3") format("embedded-opentype"), url("fonts/fontawesome-webfont.woff?v=4.0.3") format("woff"), url("fonts/fontawesome-webfont.ttf?v=4.0.3") format("truetype"), url("fonts/fontawesome-webfont.svg?v=4.0.3#fontawesomeregular") format("svg");
  font-weight: normal;
  font-style: normal;
}

I then came up with the following:

@font-faces*{.*(url(s*(['"]?+))((?!(https?:|data:|../|/))S+)((2)s*))s*}

The problem is that this matches everything and not just the urls inside. I thought I can use lookbehind like so:

(?<=@font-faces*{.*)(url(s*(['"]?+))((?!(https?:|data:|../|/))S+)((2)s*))(?<=-s*})

Unfortunately, PCRE (which PHP uses) does not support variable repetition within a lookbehind, so I am stuck.

I do not wish to check for fonts by their extension as some fonts have the .svg extension which can conflict with images with the .svg extension.

In addition, I would also like to modify my original regex to match all other urls that are NOT within an @font-face:

.someclass {
  background: url('images/someimage.png') no-repeat;
}

Since I am unable to use lookbehinds, how can I extract the urls from those within a @font-face and those that are not within a @font-face?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Disclaimer : You're maybe off using a library, because it's tougher than you think. I also want to start this answer on how to match URL's that are not within @font-face {}. I also suppose/define that the brackets {} are balanced within @font-face {}.
Note : I'm going to use "~" as delimiters instead of "/", this will releave me from escaping later on in my expressions. Also note that I will be posting online demos from regex101.com, on that site I'll be using the g modifier. You should remove the g modifier and just use preg_match_all().
Let's use some regex Fu !!!

Part 1 : matching url's that are not within @font-face {}

1.1 Matching @font-face {}

Oh yes, this might sound "weird" but you will notice later on why :)
We'll need some recursive regex here:

@font-faces*    # Match @font-face and some spaces
(                # Start group 1
   {            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   }            # Match }
)                # End group 1

demo

1.2 Escaping @font-face {}

We'll use (*SKIP)(*FAIL) just after the previous regex, it will skip it. See this answer to get an idea how it works.

demo

1.3 Matching url()

We'll use something like this:

urls*(         # Match url, optionally some whitespaces and then (
s*              # Match optionally some whitespaces
("|'|)           # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
(?!["']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\]|\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
2               # Match what was matched in group 2
s*              # Match optionally some whitespaces
)               # Match )

Note that I'm using 2 because I've appended this to the previous regex which has group 1.
Here's another use of ("|')(?:[^\]|\.)*?1.

demo

1.4 Matching the value inside url()

You might have guessed we need to use some lookaround-fu, the problem is with a lookbehind since it needs to be fixed length. I've got a workaround for that, I'll introduce you to the K escape sequence. It will reset the beginning of the match to the current position in the token list. more-info
Well let's drop K somewhere in our expression and use a lookahead, our final regex will be :

@font-faces*    # Match @font-face and some spaces
(                # Start group 1
   {            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   }            # Match }
)                # End group 1
(*SKIP)(*FAIL)   # Skip it
|                # Or
urls*(         # Match url, optionally some whitespaces and then (
s*              # Match optionally some whitespaces
("|'|)           # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
K               # Reset the match
(?!["']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\]|\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?=              # Lookahead
   2            # Match what was matched in group 2
   s*           # Match optionally some whitespaces
   )            # Match )
)

demo

1.5 Using the pattern in PHP

We'll need to escape some things like quotes, backslashes \\ = , use the right function and the right modifiers:

$regex = '~
@font-faces*    # Match @font-face and some spaces
(                # Start group 1
   {            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   }            # Match }
)                # End group 1
(*SKIP)(*FAIL)   # Skip it
|                # Or
urls*(         # Match url, optionally some whitespaces and then (
s*              # Match optionally some whitespaces
("|'|)          # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
K               # Reset the match
(?!["']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\]|\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?=              # Lookahead
   2            # Match what was matched in group 2
   s*           # Match optionally some whitespaces
   )            # Match )
)
~xs';

$input = file_get_contents($css_file);
preg_match_all($regex, $input, $m);
echo '<pre>'. print_r($m[0], true) . '</pre>';

demo

Part 2 : matching url's that are within @font-face {}

2.1 Different approach

I want to do this part in 2 regexes because it will be a pain to match URL's that are within @font-face {} while taking care of the state of braces {} in a recursive regex.

And since we already have the pieces we need, we'll only need to apply them in some code:

  1. Match all @font-face {} instances
  2. Loop through these and match all url()'s

2.2 Putting it into code

$results = array(); // Just an empty array;
$fontface_regex = '~
@font-faces*    # Match @font-face and some spaces
(                # Start group 1
   {            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   }            # Match }
)                # End group 1
~xs';

$url_regex = '~
urls*(         # Match url, optionally some whitespaces and then (
s*              # Match optionally some whitespaces
("|'|)          # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
K               # Reset the match
(?!["']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\]|\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?=              # Lookahead
   1            # Match what was matched in group 2
   s*           # Match optionally some whitespaces
   )            # Match )
)
~xs';

$input = file_get_contents($css_file);

preg_match_all($fontface_regex, $input, $fontfaces); // Get all font-face instances
if(isset($fontfaces[0])){ // If there is a match then
    foreach($fontfaces[0] as $fontface){ // Foreach instance
        preg_match_all($url_regex, $fontface, $r); // Let's match the url's
        if(isset($r[0])){ // If there is a hit
            $results[] = $r[0]; // Then add it to the results array
        }
    }
}
echo '<pre>'. print_r($results, true) . '</pre>'; // Show the results

demo

????????????????????????????????????????????????????????????????????Join the regex chatroom !


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...