Imagine if you can a language that lacked an else
statement, but you wanted to emulate it. Instead of writing
if (condition) { yes part }
else { no part }
You would have to write
if (condition) { yes part }
if (!condition) { no part }
Well, that’s what you have to do here, but in the pattern. What you do in Java without conditionals is you repeat the condition, but negate it, in the ELSE block, which is actually an OR block.
So for example, instead of writing this in a language like Perl with conditional support in pattern:
# definition of using a conditional in the pattern like Perl
#
(?(?<= w) # if there is a word character to the left
(?! w) # then there must be no word character to the right
| (?= w) # else there must be a word character to the right
)
You must in Java write:
# definition of using a duplicated condition like Java
#
(?: (?<= w) # if there is a word character to the left
(?! w) # then there must be no word character to the right
| # ...otherwise...
(?<! w) # if there is no word character to the left
(?= w) # then there must be a word character to the right
)
You may recognize that as being the definition of
. Here then similarly for B
’s definition, first using conditionals:
# definition of B using a conditional in the pattern like Perl
#
(?(?<= w) # if there is a word character to the left
(?= w) # then there must be a word character to the right
| (?! w) # else there must be no word character to the right
)
And now by repeating the (now negated) condition in the OR branch:
# definition of B using a duplicated condition like Java
#
(?: (?<= w) # if there is a word character to the left
(?! w) # then there must be no word character to the right
| # ...otherwise...
(?<! w) # if there is no word character to the left
(?= w) # then there must be a word character to the right
)
Notice how not matter how you roll them, that the respective definitions of
and B
alike rest solely on the definition of w
, never on W
, let alone on s
.
Being able to use conditionals not only saves typing, it also reduces the chance of doing it wrong. They may also be occasions where you do not care to evaluate the condition twice.
Here I make use of that to define several regex subroutines that provide me with a Greeklish atom and boundaries for the same:
(?(DEFINE)
(?<greeklish> [p{Greek}p{Inherited}] )
(?<ungreeklish> [^p{Greek}p{Inherited}] )
(?<greek_boundary>
(?(?<= (?&greeklish))
(?! (?&greeklish))
| (?= (?&greeklish))
)
)
(?<greek_nonboundary>
(?(?<= (?&greeklish))
(?= (?&greeklish))
| (?! (?&greeklish))
)
)
)
Notice how the boundary and nonboundaries use only (&?greeklish)
, never (?&ungreeklish)
? You don’t ever need the non-whatever just to do boundaries. You put the not into your lookarounds instead, just as
and B
both do.
Although in Perl it’s probably easier (albeit less general) just to define a new, custom property, p{IsGreeklish}
(and hence its complement P{IsGreeklish}
):
sub IsGreeklish {
return <<'END';
+utf8::IsGreek
+utf8::IsInherited
END
}
You won’t be able to translate either of those into Java though, albeit not so much because of Java’s lack of support for conditionals, but rather because its pattern language doesn’t allow (DEFINE)
blocks or regex subroutine calls like (?&greeklish)
— and indeed, your patterns cannot even recurse in Java. Nor can you in Java define custom properties like p{IsGreeklish}
.
And of course conditionals in Perl regexes can be more than lookarounds: they can even be code blocks to execute — which is why you certainly don’t want to be forced to evaluate the same condition twice, lest it have side-effects. That doesn’t apply to Java, because it can’t do that. You can’t intermix pattern and code, which limits you more than you might think before you get in the habit of doing so.
There are really a huge whole lot of things you can do with the Perl regex engine that you can do in no other language, and this is just some of that. It’s no wonder that the greatly expanded Regexes chapter in the new 4th edition of Programming Perl, when coupled with the completely rewritten Unicode chapter which now immediately follows the Regexes chapter (having been promoted into part of the inner core), have a combined page count of something like 130 pages, so double the length of the old chapter on pattern matching from the 3rd edition.
What you’ve just seen above is part of the new 4th edition, which should be in print next month or so.