Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
430 views
in Technique[技术] by (71.8m points)

php - preg_match and (non-English) Latin characters?

I have a XHTML form where I ask people to enter their full name. I then match that with preg_match() using this pattern: /^[p{L}s]+$/

On my local server running PHP 5.2.13 (PCRE 7.9 2009-04-11) this works fine. On the webhost running PHP 5.2.10 (PCRE 7.3 2007-08-28) it doesn't match when the entered string contains the Danish Latin character ? ( http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%F8&mode=char ).

Is this a bug? Is there a work around?

Thank you in advance!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

So, the problem is as presumed. You are not using the /u modifier. This means that PCRE will not look for UTF-8 characters.

In any case, this is how it should be done:

var_dump(preg_match('/^[p{L}s]+$/u', "?")); 

And works on all my versions. There might be a bug in others, but that's not likely here.

Your problem is that this also works:

var_dump(preg_match('/^[p{L}s]+$/', utf8_decode("?")));

Notice that this uses ISO-8859-1 instead of UTF-8, and leaves out the /u modifier. The result is int(1). Obviously PCRE interprets the Latin-1 ? as matching p{L} when in non-/unicode mode. (Most of the single-byte xA0-xFF are letter symbols in Latin-1, and the 8-bit code point as the same as in Unicode, so that's actually ok.)

Conclusion: Your input is actually ISO-8859-1. That's why it accidentally worked for you without the /u. Change that, and be eaxact with input charsets.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...