What follows is a perl function I wrote years ago. It is a smart tokenizer that recognizes some instances of things being stuck together that maybe shouldn't be. For example, given the input on the left, it divides the string as shown on the right:
abc123 -> abc|123
abcABC -> abc|ABC
ABC123 -> ABC|123
123abc -> 123|abc
123ABC -> 123|ABC
AbcDef -> Abc|Def (e.g. CamelCase)
ABCDef -> ABC|Def
1stabc -> 1st|abc (recognize valid ordinals)
1ndabc -> 1|ndabc (but not invalid ordinals)
11thabc -> 11th|abc (recognize that 11th - 13th are different than 1st - 3rd)
11stabc -> 11|stabc
I'm now doing some machine learning experimentation, and I'd like to do some experiments that use this tokenizer. But first, I will need to port it from Perl to Python. The key to this code is the loop that uses the G anchor, something which I hear does not exist in python. I've tried googling for how this is done in Python, but I am not sure what exactly to search for, so I'm having trouble finding an answer.
How would you write this function in Python?
sub Tokenize
# Breaks a string into tokens using special rules,
# where a token is any sequence of characters, be they a sequence of letters,
# a sequence of numbers, or a sequence of non-alpha-numeric characters
# the list of tokens found are returned to the caller
{
my $value = shift;
my @list = ();
my $word;
while ( $value ne '' && $value =~ m/
G # start where previous left off
([^a-zA-Z0-9]*) # capture non-alpha-numeric characters, if any
([a-zA-Z0-9]*?) # capture everything up to a token boundary
(?: # identify the token boundary
(?=[^a-zA-Z0-9]) # next character is not a word character
| (?=[A-Z][a-z]) # Next two characters are upper lower
| (?<=[a-z])(?=[A-Z]) # lower followed by upper
| (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
# ordinal boundaries
| (?<=^1(?i:st)) # first
| (?<=[^1][1](?i:st)) # first but not 11th
| (?<=^2(?i:nd)) # second
| (?<=[^1]2(?i:nd)) # second but not 12th
| (?<=^3(?i:rd)) # third
| (?<=[^1]3(?i:rd)) # third but not 13th
| (?<=1[123](?i:th)) # 11th - 13th
| (?<=[04-9](?i:th)) # other ordinals
# non-ordinal digit-letter boundaries
| (?<=^1)(?=[a-zA-Z])(?!(?i)st) # digit-letter but not first
| (?<=[^1]1)(?=[a-zA-Z])(?!(?i)st) # digit-letter but not 11th
| (?<=^2)(?=[a-zA-Z])(?!(?i)nd) # digit-letter but not first
| (?<=[^1]2)(?=[a-zA-Z])(?!(?i)nd) # digit-letter but not 12th
| (?<=^3)(?=[a-zA-Z])(?!(?i)rd) # digit-letter but not first
| (?<=[^1]3)(?=[a-zA-Z])(?!(?i)rd) # digit-letter but not 13th
| (?<=1[123])(?=[a-zA-Z])(?!(?i)th) # digit-letter but not 11th - 13th
| (?<=[04-9])(?=[a-zA-Z])(?!(?i)th) # digit-letter but not ordinal
| (?=$) # end of string
)
/xg )
{
push @list, $1 if $1 ne '';
push @list, $2 if $2 ne '';
}
return @list;
}
I did try using re.split() with a variation on the above. However, split() refuses to split on a zero-width match (an ability that should be possible if one really knows what one is doing).
I did come up with a solution to this specific problem, but not to the general problem of "how do I use G based parsing" - I have some sample code that does regexes within loops that are anchored using G and then in the body it uses another match anchored at G to see which way to proceed with the parse. So I'm still looking for an answer.
That said, here is my final working code for translating the above to Python:
import re
IsA = lambda s: '[' + s + ']'
IsNotA = lambda s: '[^' + s + ']'
Upper = IsA( 'A-Z' )
Lower = IsA( 'a-z' )
Letter = IsA( 'a-zA-Z' )
Digit = IsA( '0-9' )
AlphaNumeric = IsA( 'a-zA-Z0-9' )
NotAlphaNumeric = IsNotA( 'a-zA-Z0-9' )
EndOfString = '$'
OR = '|'
ZeroOrMore = lambda s: s + '*'
ZeroOrMoreNonGreedy = lambda s: s + '*?'
OneOrMore = lambda s: s + '+'
OneOrMoreNonGreedy = lambda s: s + '+?'
StartsWith = lambda s: '^' + s
Capture = lambda s: '(' + s + ')'
PreceededBy = lambda s: '(?<=' + s + ')'
FollowedBy = lambda s: '(?=' + s + ')'
NotFollowedBy = lambda s: '(?!' + s + ')'
StopWhen = lambda s: s
CaseInsensitive = lambda s: '(?i:' + s + ')'
ST = '(?:st|ST)'
ND = '(?:nd|ND)'
RD = '(?:rd|RD)'
TH = '(?:th|TH)'
def OneOf( *args ):
return '(?:' + '|'.join( args ) + ')'
pattern = '(.+?)' +
OneOf(
# ABC | !!! - break at whitespace or non-alpha-numeric boundary
PreceededBy( AlphaNumeric ) + FollowedBy( NotAlphaNumeric ),
PreceededBy( NotAlphaNumeric ) + FollowedBy( AlphaNumeric ),
# ABC | Abc - break at what looks like the start of a word or sentence
FollowedBy( Upper + Lower ),
# abc | ABC - break when a lower-case letter is followed by an upper case
PreceededBy( Lower ) + FollowedBy( Upper ),
# abc | 123 - break between words and digits
PreceededBy( Letter ) + FollowedBy( Digit ),
# 1st | oak - recognize when the string starts with an ordinal
PreceededBy( StartsWith( '1' + ST ) ),
PreceededBy( StartsWith( '2' + ND ) ),
PreceededBy( StartsWith( '3' + RD ) ),
# 1st | abc - contains an ordinal
PreceededBy( IsNotA( '1' ) + '1' + ST ),
PreceededBy( IsNotA( '1' ) + '2' + ND ),
PreceededBy( IsNotA( '1' ) + '3' + RD ),
PreceededBy( '1' + IsA( '123' ) + TH ),
PreceededBy( IsA( '04-9' ) + TH ),
# 1 | abcde - recognize when it starts with or contains a non-ordinal digit/letter boundary
PreceededBy( StartsWith( '1' ) ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
PreceededBy( StartsWith( '2' ) ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
PreceededBy( StartsWith( '3' ) ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
PreceededBy( IsNotA( '1' ) + '1' ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
PreceededBy( IsNotA( '1' ) + '2' ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
PreceededBy( IsNotA( '1' ) + '3' ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
PreceededBy( '1' + IsA( '123' ) ) + FollowedBy( Letter ) + NotFollowedBy( TH ),
PreceededBy( IsA( '04-9' ) ) + FollowedBy( Letter ) + NotFollowedBy( TH ),
# abcde | $ - end of the string
FollowedBy( EndOfString )
)
matcher = re.compile( pattern )
def tokenize( s ):
return matcher.findall( s )
See Question&Answers more detail:
os