Skip to main content

Strings and Unicode

ECMAScript spec requires String.charCodeAt() to return a 16-bit value. Unicode code points outside of 16-bit (mostly emoticons, but also some historical alphabets, and rare Chinese/Japanese/Korean ideograms) are represented as surrogate pairs of two 16-bit code units. The String.length returns the number of UTF-16 code units in the string.

If ES was designed today they would probably return up to 21-bit values from charCodeAt(), or possibly use yet another abstraction since even with full 21-bit Unicode, several code points can still combine into a single glyph (character displayed on the screen).

In DeviceScript, the method String.charCodeAt() returns Unicode code point (up to 21 bits), not UTF-16 character. Similarly, String.length will return the number of 21-bit code points. Thus, "🗽".length === 1 and "🗽".charCodeAt(0) === 0x1f5fd, and also "\uD83D\uDDFD".length === 1 since "\uD83D\uDDFD" === "🗽" which may be confusing.

Also string construction by concatnation quadratic, however you can use String.join() which is linear in the size of output.

See also discussion.