Regex101 uses the PCRE library (by default), so its debugger is tied to PCRE's behavior. The PCRE library supports an auto-callout option through its PCRE_AUTO_CALLOUT
flag, which invokes a callback function at every step of matching. I'm 99.99% sure that's how regex101's debugger works. Each callout invocation registers as a step in the debugger.
Testing with regex101
Now, take a look at the different steps involved, first, without the u
option:
Notice how the text cursor jumps from the Start Game
part directly to the Stake
part.
What happens when you add u
?
Notice the jump doesn't occur anymore, and this is where the additional steps come from.
What does the u
option do? It sets PCRE_UTF8 | PCRE_UCP
- yes, both of those at once, and this PCRE_UCP
flag is important here.
Here's what the pcreunicode docs say:
Case-insensitive matching applies only to characters whose values are less than 128, unless PCRE is built with Unicode property support. A few Unicode characters such as Greek sigma have more than two codepoints that are case-equivalent. Up to and including PCRE release 8.31, only one-to-one case mappings were supported, but later releases (with Unicode property support) do treat as case-equivalent all versions of characters such as Greek sigma.
Because of the additional complexity in handling case insensitivity in Unicode mode, the engine cannot just skip all of the text. To verify this is the culprit, let's try this with the gmu
flags (that is, with u
but without i
):
The optimization is applied even with u
, and this pretty much confirms the hypothesis.
Observations
The slow path you're seeing looks just like if PCRE_NO_START_OPTIMIZE
had been used (and this is probably part of what regex101's disable internal engine optimizations does, along with PCRE_NO_AUTO_POSSESS
, which is not relevant here).
More about it from the docs:
There are a number of optimizations that pcre_exec()
uses at the start of a match, in order to speed up the process. For example, if it is known that an unanchored match must start with a specific character, it searches the subject for that character, and fails immediately if it cannot find it, without actually running the main matching function.
For some reason (which will become apparent later), PCRE fails to register an s
or k
letter as a required starting character, and cannot use the anchoring optimization. All the other letters of the latin alphabet work just fine in this regard.
This is why Stake
requires more steps than winnings
: the engine just doesn't skip the checks.
Testing with PCRE directly
Here's a test program I used with PCRE 8.38, which is the latest PCRE1 version available as of this post.
#include <stdio.h>
#include <string.h>
#include <pcre.h>
static int calloutCount;
static int callout_handler(pcre_callout_block *c) {
++calloutCount;
return 0;
}
static void test_run(const char* pattern, pcre* re, pcre_extra* extra) {
int rc, startOffset;
int ovector[3 * 3];
pcre_callout = callout_handler;
calloutCount = 0;
startOffset = 0;
const char *subject = "Start Game, Credit: £200.00game num: 1, Stake: £2.00Spinning Reels:NINE SEVEN KINGKING STAR ACEQUEEN JACK KINGtotal winnings: £0.00End Game, Credit: £198Start...";
for (;;) {
rc = pcre_exec(re, extra, subject, strlen(subject), startOffset, 0, ovector, sizeof(ovector) / sizeof(int));
if (rc < 0)
break;
startOffset = ovector[1];
}
printf("%-30s %s => %i
", pattern, extra ? "(studied) " : "(not studied)", calloutCount);
}
static void test(const char* pattern) {
pcre *re;
const char *error;
int erroffset;
pcre_extra *extra;
re = pcre_compile(pattern, PCRE_AUTO_CALLOUT | PCRE_CASELESS | PCRE_MULTILINE | PCRE_UTF8 | PCRE_UCP, &error, &erroffset, 0);
if (re == 0)
return;
extra = pcre_study(re, 0, &error);
test_run(pattern, re, extra);
if (extra)
test_run(pattern, re, NULL);
pcre_free_study(extra);
pcre_free(re);
}
int main(int argc, char **argv) {
printf("PCRE version: %s
", pcre_version());
test("(Stake: £)(\d+(?:\.\d+)?)");
test("(winnings: £)(\d+(?:\.\d+)?)");
return 0;
}
The output I got is the following:
PCRE version: 8.38 2015-11-23
(Stake: £)(d+(?:.d+)?) (studied) => 40
(Stake: £)(d+(?:.d+)?) (not studied) => 302
(winnings: £)(d+(?:.d+)?) (studied) => 21
(winnings: £)(d+(?:.d+)?) (not studied) => 21
Here, we can see that studying the pattern makes a difference in the first case, but not in the second one.
Studying a pattern means the following:
Studying a pattern does two things: first, a lower bound for the length of subject string that is needed to match the pattern is computed. This does not mean that there are any strings of that length that match, but it does guarantee that no shorter strings match. The value is used to avoid wasting time by trying to match strings that are shorter than the lower bound. You can find out the value in a calling program via the pcre_fullinfo()
function.
Studying a pattern is also useful for non-anchored patterns that do not have a single fixed starting character. A bitmap of possible starting bytes is created. This speeds up finding a position in the subject at which to start matching. (In 16-bit mode, the bitmap is used for 16-bit values less than 256. In 32-bit mode, the bitmap is used for 32-bit values less than 256.)
From the results and the docs description, you can conclude that PCRE considers the S
character is not an anchoring character, and in caseless Unicode mode, it requires a bitmap to be created. The bitmap allows the optimization to be applied.
Now, here's the PCRE2 version, compiled against PCRE2 v10.21, which is the latest release as of this post. The results will be unsurprising as PCRE2 always studies the patterns you supply it, no questions asked.
#include <stdio.h>
#include <string.h>
#define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>
static int callout_handler(pcre2_callout_block *c, void *data) {
++*((int*)data);
return 0;
}
static void test(const char* pattern) {
pcre2_code *re;
int error;
PCRE2_SIZE erroffset;
pcre2_match_context *match_context;
pcre2_match_data *match_data;
int rc, startOffset = 0;
int calloutCount = 0;
PCRE2_SIZE *ovector;
const PCRE2_SPTR subject = (PCRE2_SPTR)"Start Game, Credit: £200.00game num: 1, Stake: £2.00Spinning Reels:NINE SEVEN KINGKING STAR ACEQUEEN JACK KINGtotal winnings: £0.00End Game, Credit: £198Start...";
re = pcre2_compile((PCRE2_SPTR)pattern, PCRE2_ZERO_TERMINATED, PCRE2_AUTO_CALLOUT | PCRE2_CASELESS | PCRE2_MULTILINE | PCRE2_UTF | PCRE2_UCP, &error, &erroffset, 0);
if (re == 0)
return;
match_context = pcre2_match_context_create(0);
pcre2_set_callout(match_context, callout_handler, &calloutCount);
match_data = pcre2_match_data_create_from_pattern(re, 0);
ovector = pcre2_get_ovector_pointer(match_data);
for (;;) {
rc = pcre2_match(re, subject, PCRE2_ZERO_TERMINATED, startOffset, 0, match_data, match_context);
if (rc < 0)
break;
startOffset = ovector[1];
}
printf("%-30s => %i
", pattern, calloutCount);
pcre2_match_context_free(match_context);
pcre2_match_data_free(match_data);
pcre2_code_free(re);
}
int main(int argc, char **argv) {
char version[256];
pcre2_config(PCRE2_CONFIG_VERSION, &version);
printf("PCRE version: %s
", version);
test("(Stake: £)(\d+(?:\.\d+)?)");
test("(winnings: £)(\d+(?:\.\d+)?)");
return 0;
}
And here's the result:
PCRE version: 10.21 2016-01-12
(Stake: £)(d+(?:.d+)?) => 40
(winnings: £)(d+(?:.d+)?) => 21
Yup. Same results as in PCRE1 when studying was used.
Implementation details
Let's take a look at the implementation details, shall we? PCRE compiles a pattern to opcodes, which we can read with the pcretest
program.
Here's the input file:
/(Stake: £)(d+(?:.d+)?)/iDW8
Start Game, Credit: £200.00game num: 1, Stake: £2.00Spinning Reels:NINE SEVEN KINGKING STAR ACEQUEEN JACK KINGtotal winnings: £0.00End Game, Credit: £198Start...
/(winnings: £)(d+(?:.d+)?)/iDW8
Start Game, Credit: £200.00game num: 1, Stake: £2.00Spinning Reels:NINE SEVEN KINGKING STAR ACEQUEEN JACK KINGtotal winnings: £0.00End Game, Credit: £198Start...
Its format is simple: A pattern in delimiters followed by options, and one or more input strings.
The options are:
i
- PCRE_CASELESS
D
- turn on debug mode: show the opcodes and pattern information
W
- PCRE_UCP
8
- PCRE_UTF8
And the result is...
PCRE version 8.38 2015-11-23
/(Stake: £)(d+(?:.d+)?)/iDW8
------------------------------------------------------------------
0 55 Bra
3 24 CBra 1
8 clist 0053 0073 017f
11 /i ta
15 clist 004b 006b 212a
18 /i e: x{a3}
27 24 Ket
30 22 CBra 2
35 prop Nd ++
39 Brazero
40 9 Bra
43 /i .
45 prop Nd ++
49 9 Ket
52 22 Ket
55 55 Ket
58 End
------------------------------------------------------------------
Capturing subpattern count = 2
Options: caseless utf ucp
No first char
Need char = ' '
Start Game, Credit: £200.00game num: 1, Stake: £2.00Spinning Reels:NINE SEVEN KINGKING STAR ACEQUEEN JACK KINGtotal winnings: £0.00End Game, Credit: £198Start...
0: Stake: x{a3}2.00
1: Stake: x{a3}
2: 2.00
/(winnings: £)(d+(?:.d+)?)/iDW8
------------------------------------------------------------------
0 60 Bra
3 29 CBra 1
8 /i winning
22 clist 0053 0073 017f
25 /i : x{a3}
32 29 Ket
35 22 CBra 2
40 prop Nd +