Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
384 views
in Technique[技术] by (71.8m points)

javascript - What is a surrogate pair?

I came across this code in a javascript open source project.

validator.isLength = function (str, min, max) 
    // match surrogate pairs in string or declare an empty array if none found in string
    var surrogatePairs = str.match(/[uD800-uDBFF][uDC00-uDFFF]/g) || [];
    // subtract the surrogate pairs string length from main string length
    var len = str.length - surrogatePairs.length;
    // now compare string length with min and max ... also make sure max is defined(in other words, max param is optional for function)
    return len >= min && (typeof max === 'undefined' || len <= max);
};

As far as I understand, the above code is checking the length of the string but not taking the surrogate pairs into account. So:

  1. Is my understanding of the code correct?

  2. What are surrogate pairs?

I have thus far only figured out that this is related to encoding.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
  1. Yes. Your understanding is correct. The function returns the length of the string in Unicode Code Points.

  2. JavaScript is using UTF-16 to encode its strings. This means two bytes (16-bit) are used to represent one Unicode Code Point.

    Now there are characters (like the Emojis) in Unicode that have a that high code point so that they cannot be stored in 2 bytes (16bit) so they need to get encoded into two UTF-16 characters (4 bytes). These are called surrogate pairs.

Try this

var len = "??".length // There is an emoji in the string (if you don’t see it)

vs

var str = "??"
var surrogatePairs = str.match(/[uD800-uDBFF][uDC00-uDFFF]/g) || [];
var len = str.length - surrogatePairs.length;

In the first example len will be 2 because the Emoji consists of two 2 UTF-16 characters. In the second example len will be 1.

You might want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...