This is an excellent question, and it took me a while to see the point of the lazy ??
quantifier myself.
? - Optional (greedy) quantifier
The usefulness of ?
is easy enough to understand. If you wanted to find both http
and https
, you could use a pattern like this:
https?
This pattern will match both inputs, because it makes the s
optional.
?? - Optional (lazy) quantifier
??
is more subtle. It usually does the same thing ?
does. It doesn't change the true/false result when you ask: "Does this input satisfy this regex?" Instead, it's relevant to the question: "Which part of this input matches this regex, and which parts belong in which groups?" If an input could satisfy the pattern in more than one way, the engine will decide how to group it based on ?
vs. ??
(or *
vs. *?
, or +
vs. +?
).
Say you have a set of inputs that you want to validate and parse. Here's an (admittedly silly) example:
Input:
http123
https456
httpsomething
Expected result:
Pass/Fail Group 1 Group 2
Pass http 123
Pass https 456
Pass http something
You try the first thing that comes to mind, which is this:
^(http)([a-zd]+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass http s456 No
Pass http something Yes
They all pass, but you can't use the second set of results because you only wanted 456
in Group 2.
Fine, let's try again. Let's say Group 2 can be letters or numbers, but not both:
(https?)([a-z]+|d+)
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass https omething No
Now the second input is fine, but the third one is grouped wrong because ?
is greedy by default (the +
is too, but the ?
came first). When deciding whether the s
is part of https?
or [a-z]+|d+
, if the result is a pass either way, the regex engine will always pick the one on the left. So Group 2 loses s
because Group 1 sucked it up.
To fix this, you make one tiny change:
(https??)([a-z]+|d+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass http something Yes
Essentially, this means: "Match https
if you have to, but see if this still passes when Group 1 is just http
." The engine realizes that the s
could work as part of [a-z]+|d+
, so it prefers to put it into Group 2.