关于js的字符串编码

The String type is the set of all finite ordered sequences of zero or more 16-bit unsigned integer values (“elements”). The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a code unit value (see Clause 6). Each element is regarded as occupying a position within the sequence. These positions are indexed with nonnegative integers. The first element (if any) is at position 0, the next element (if any) at position 1, and so on. The length of a String is the number of elements (i.e., 16-bit values) within it. The empty String has length zero and therefore contains no elements.
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

按照ECMA标准,无论引擎底层如何实现,js的字符串看起来都应该是UTF-16编码的字符串,并且每个字符串单元代表一个UTF-16的双字节
例如:
>”中”.length
1
(不知道这是啥特殊字符,发到wordpress有问题,直接贴图,那我是怎么输进去的呢,后面说转义符)
“中”编码成UTF-16为一个双字节0x4E2D,所以长度为1
长的像口的字符编码成UTF-16占4字节 0xD950 0xDF21,占用两个双字节,长度为2
(下面用”口”代替这个特殊字符)
>”口”[0]
“”
>”口”.charCodeAt(0).toString(16)
“d950”
所以字符串的长度显然是占用双字节的个数,而非我以前想当然认为的实际字符的个数。。。

\udddd形式的转义同样用来表示一个双字节,而非字符本身,”口”用转义符来表示的话:

最后我要的结论是:

对于像objective c里unichar一样的UTF-16编码格式的字符串可以通过@“\\u04x”直接得到json转义串~


已发布

分类

来自

标签:

评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据