regex - Should we consider using range [a-z] as a bug?

Question

Welcome To Ask or Share your Answers For Others

regex - Should we consider using range [a-z] as a bug?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - Should we consider using range [a-z] as a bug?

In my locale (et_EE) [a-z] means:

abcdefghijklmnopqrs?z

So, 6 ASCII chars (tuvwxy) and one from Estonian alphabet (?) are not included. I see a lot modules which are still using regexes like

/A[0-9A-Z_a-z]+z/

For me it seems wrong way to define range of ASCII alphanumeric chars and i think it should be replaced with:

/Ap{PosixAlnum}+z/

Is the first one still considered idiomatic way? Or accepted solution? Or a bug?

Or has last one some caveats?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:50:07+0000

Back in the old Perl 3.0 days, everything was ASCII, and Perl reflected that. w meant the same thing as [0-9A-Z_a-z]. And, we liked it!

However, Perl is no longer bound to ASCII. I've stopped using [a-z] a while ago because I got yelled at when programs I wrote didn't work with languages that weren't English. You must have imagined my surprise as an American to discover that there are at least several thousand people in this world who don't speak English.

Perl has better ways of handling [0-9A-Z_a-z] anyway. You can use the [[:alnum:]] set or simply use w which should do the right thing. If you must only have lowercase characters, you can use [[:lower:]] instead of [a-z] (Which assumes an English type of language). (Perl goes to some lengths to get [a-z] mean just the 26 characters a, b, c, ... z even on EBCDIC platforms.)

If you need to specify ASCII only, you can add the /a qualifier. If you mean locale specific, you should compile the regular expression within the lexical scope of a 'use locale'. (Avoid the /l modifier, as that applies only to the regular expression pattern, and nothing else. For example in 's/[[:lower:]]/U$&/lg', the pattern is compiled using locale, but the U is not. This probably should be considered a bug in Perl, but it is the way things currently work. The /l modifier is really only intended for internal bookkeeping, and should not be typed-in directly.) Actually, it is better to translate your locale data upon input into the program, and translate it back on output, while using Unicode internally. If your locale is one of the new-fashioned UTF-8 ones, a new feature in 5.16 'use locale ":not_characters"' is available to allow the other portions of your locale work seamlessly in Perl.

$word =~ /^[[:alnum:]]+$/   # $word contains only Posix alphanumeric characters.
$word =~ /^[[:alnum:]]+$/a  # $word contains only ASCII alphanumeric characters.
{ use locale;
  $word =~ /^[[:alnum:]]+$/;# $word contains only alphanum characters for your locale
}

Now, is this a bug? If the program doesn't work as intended, it is a bug plain and simple. If you really want the ASCII sequence, [a-z], then the programmer should have used [[:lower:]] with the /a qualifier. If you want all possible lowercase characters including those in other languages, you should simply use [[:lower:]].

Categories

regex - Should we consider using range [a-z] as a bug?

regex - Should we consider using range [a-z] as a bug?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags