I don't know a regex engine that can capture a quantifier. However, it is possible with PCRE or Perl to use some tricks to check if you have the same number of characters. With your example:
@@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"
you can check if
@
=
-
/
are balanced with this pattern that uses
the famous Qtax trick, (are you ready?):
the "possessive-optional self-referencing group"
~(?<!@)((?:@(?=[^=]*(2?+=)[^-]*(3?+-)[^/]*(4?+/)))+)(?!@)(?=[^=]*2(?!=)[^-]*3(?!-)[^/]*4(?!/))~
pattern details:
~ # pattern delimiter
(?<!@) # negative lookbehind used as an @ boundary
( # first capturing group for the @
(?:
@ # one @
(?= # checks that each @ is followed by the same number
# of = - /
[^=]* # all that is not an =
(2?+=) # The possessive optional self-referencing group:
# capture group 2: backreference to itself + one =
[^-]*(3?+-) # the same for -
[^/]*(4?+/) # the same for /
) # close the lookahead
)+ # close the non-capturing group and repeat
) # close the first capturing group
(?!@) # negative lookahead used as an @ boundary too.
# this checks the boundaries for all groups
(?=[^=]*2(?!=)[^-]*3(?!-)[^/]*4(?!/))
~
The main idea
The non-capturing group contains only one @
. Each time this group is repeated a new character is added in capture groups 2, 3 and 4.
the possessive-optional self-referencing group
How does it work?
( (?: @ (?= [^=]* (2?+ = ) .....) )+ )
At the first occurence of the @ character the capture group 2 is not yet defined, so you can not write something like that (2 =)
that will make the pattern fail. To avoid the problem, the way is to make the backreference optional: 2?
The second aspect of this group is that the number of character =
matched is incremented at each repetition of the non capturing group, since an =
is added each time. To ensure that this number always increases (or the pattern fails), the possessive quantifier forces the backreference to be matched first before adding a new =
character.
Note that this group can be seen like that: if group 2 exists then match it with the next =
( (?(2)2) = )
The recursive way
~(?<!@)(?=(@(?>[^@=]+|(?-1))*=)(?!=))(?=(@(?>[^@-]+|(?-1))*-)(?!-))(?=(@(?>[^@/]+|(?-1))*/)(?!/))~
You need to use overlapped matches, since you will use the @ part several times, it is the reason why all the pattern is inside lookarounds.
pattern details:
(?<!@) # left @ boundary
(?= # open a lookahead (to allow overlapped matches)
( # open a capturing group
@
(?> # open an atomic group
[^@=]+ # all that is not an @ or an =, one or more times
| # OR
(?-1) # recursion: the last defined capturing group (the current here)
)* # repeat zero or more the atomic group
= #
) # close the capture group
(?!=) # checks the = boundary
) # close the lookahead
(?=(@(?>[^@-]+|(?-1))*-)(?!-)) # the same for -
(?=(@(?>[^@/]+|(?-1))*/)(?!/)) # the same for /
The main difference with the precedent pattern is that this one doesn't care about the order of =
-
and /
groups. (However you can easily make some changes to the first pattern to deal with that, with character classes and negative lookaheads.)
Note: For the example string, to be more specific, you can replace the negative lookbehind with an anchor (^
or A
). And if you want to obtain the whole string as match result you must add .*
at the end (otherwise the match result will be empty as playful notices it.)