Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
264 views
in Technique[技术] by (71.8m points)

javascript - In what JS engines, specifically, are toLowerCase & toUpperCase locale-sensitive?

In the code of some libraries (e.g. AngularJS, the link leads to the specific lines in the code), I can see that custom case-conversion functions are used instead of the standard ones. It's justified by an assumption that in browsers with Turkish locale, the standard functions don't work as expected:

console.log("SCRIPT".toLowerCase()); // "scr?pt"
console.log("script".toUpperCase()); // "SCR?PT"

But is it really or was it ever the case? Do the browsers really behave this way? If so, which of them do? What about node.js? Other JS engines?

The existance of the toLocaleLowerCase and toLocaleUpperCase methods implies that toLowerCase and toUpperCase are locale-invariant, doesn't it?

For what browsers, specifically, does the Angular team retain this check in the code: if ('i' !== 'I'.toLowerCase())...?


If your browser (device) uses the Turkish or Azerbaijan locale, please run this snippet and write me if you discover that the issue indeed exists.

if ('i' !== 'I'.toLowerCase()) {
  document.write('Ooops! toLowerCase is locale-sensitive in your browser. ' +
    'Please write your user-agent in the comments to this question: ' +
    navigator.userAgent); 
} else {
  document.write('toLowerCase isn't locale-sensitive in your browser. ' +
    'Everything works as expected!');
}
<html lang="tr">
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Note: Please, note that I couldn't test it!


As per ECMAScript specification:

String.prototype.toLowerCase ( )

[...]

For the purposes of this operation, the 16-bit code units of the Strings are treated as code points in the Unicode Basic Multilingual Plane. Surrogate code points are directly transferred from S to L without any mapping.

The result must be derived according to the case mappings in the Unicode character database (this explicitly includes not only the UnicodeData.txt file, but also the SpecialCasings.txt file that accompanies it in Unicode 2.1.8 and later).

[...]

String.prototype.toLocaleLowerCase ( )

This function works exactly the same as toLowerCase except that its result is intended to yield the correct result for the host environment’s current locale, rather than a locale-independent result. There will only be a difference in the few cases (such as Turkish) where the rules for that language conflict with the regular Unicode case mappings.

[...]

And as per Unicode Character Database Special Casing:

[...]

Format

The entries in this file are in the following machine-readable format:

<code>; <lower>; <title>; <upper>; (<condition_list>;)? # <comment>

Unconditional mappings

[...]

Preserve canonical equivalence for I with dot. Turkic is handled below.

0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE

[...]

Language-Sensitive Mappings These are characters whose full case mappings depend on language and perhaps also context (which characters come before or after). For more information see the header of this file and the Unicode Standard.

Lithuanian

Lithuanian retains the dot in a lowercase i when followed by accents.

Remove DOT ABOVE after "i" with upper or titlecase

0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE

Introduce an explicit dot above when lowercasing capital I's and J's whenever there are more accents above. (of the accents used in Lithuanian: grave, acute, tilde above, and ogonek)

0049; 0069 0307; 0049; 0049; lt More_Above; # LATIN CAPITAL LETTER I

004A; 006A 0307; 004A; 004A; lt More_Above; # LATIN CAPITAL LETTER J

012E; 012F 0307; 012E; 012E; lt More_Above; # LATIN CAPITAL LETTER I WITH OGONEK

00CC; 0069 0307 0300; 00CC; 00CC; lt; # LATIN CAPITAL LETTER I WITH GRAVE

00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE

0128; 0069 0307 0303; 0128; 0128; lt; #LATIN CAPITAL LETTER I WITH TILDE

Turkish and Azeri

I and i-dotless; I-dot and i are case pairs in Turkish and Azeri The following rules handle those cases.

0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE

0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE

When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i. This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE

0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

When lowercasing, unless an I is before a dot_above, it turns into a dotless i.

0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I

0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I

When uppercasing, i turns into a dotted capital I

0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I

0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I

Note: the following case is already in the UnicodeData.txt file.

0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I

EOF

Also, as per JavaScript for Absolute Beginners (by Terry McNavage):

> "I".toLowerCase() // "i"
> "i".toUpperCase() // "I"
> "I".toLocaleLowerCase() // "<dotless-i>"
> "i".toLocaleUpperCase() // "<dotted-I>"

Note: toLocaleLowerCase() and toLocaleUpperCase() convert case based on your OS settings. You'd have to change those settings to Turkish for the previous sample to work. Or just take my word for it!

And as per bobince's comment over Convert JavaScript String to be all lower case? question:

Accept-Language and navigator.language are two completely separate settings. Accept-Language reflects the user's chosen preferences for what languages they want to receive in web pages (and this setting is unfortuately inaccessible to JS). navigator.language merely reflects which localisation of the web browser was installed, and should generally not be used for anything. Both of these values are unrelated to the system locale, which is the bit that decides what toLocaleLowerCase() will do; that's an OS-level setting out of scope of the browser's prefs.


So, setting lang="tr-TR" to html won't reflect a real test case, since it's an OS setting that's required to reproduce the special casing example.

I think that only lowercasing dotted-I or uppercasing dotless-i would be locale specific when using toLowerCase() or toUpperCase().

As per those credible/official sources, I think you're right: 'i' !== 'I'.toLowerCase() would always evaluate to false.

But, as I said, I couldn't test it here.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...