Resolving string types and character encoding support in JavaScript

  • 2021-06-29 10:13:15
  • OfStack

Definition
A string is zero or more characters that line up in a single or double quotation mark.


'abc'
"abc"

Double quotation marks can be used inside a single quotation string.Inside a double quotation string, you can use single quotation marks.

'key = "value"'
"It's a long journey"

Both of the above are legal strings.

If you want to use single quotation marks inside a single quotation string (or double quotation marks inside a double quotation string), you must precede the internal single quotation marks (or double quotation marks) with a backslash to escape.


'Did she say \'Hello\'?'
// "Did she say 'Hello'?"

"Did she say \"Hello\"?"
// "Did she say "Hello"?"

Since HTML language attribute values use double quotation marks, many projects agree that JavaScript language strings should only use single quotation marks, and this tutorial follows this convention.Of course, using only double quotation marks is fine.It's important to stick to one style and not mix them.

Strings can only be written in one line by default, and splitting them into multiple lines will result in an error.


'a
b
c'
// SyntaxError: Unexpected token ILLEGAL

The above code divides a string into three lines and JavaScript will error.

If long strings must be split into multiple lines, backslashes can be used at the end of each line.


var longString = "Long \
long \
long \
string";

longString
// "Long long long string"

The code above indicates that after adding a backslash, a string that was originally written on one line can be divided into multiple lines.However, the output is a single line, and the effect is exactly the same as that written on the same line.Note that backslashes must be followed by line breaks and no other characters (such as spaces) will cause errors.

Join operators (+) can join multiple single-line strings, break long strings into multiple lines to write, and output a single line.


var longString = 'Long '
 + 'long '
 + 'long '
 + 'string';

If you want to output a multiline string, there is a workaround that uses multiline comments.


(function () { /*
line 1
line 2
line 3
*/}).toString().split('\n').slice(1, -1).join('\n')
// "line 1
// line 2
// line 3"

In the example above, the output string is multiple lines.

Escape
Backslashes (\) have a special meaning within the string and are used to represent some special characters, so they are also called escape characters.

Special characters that need to be escaped with backslashes include the following:

0 null (u0000) b Back Key (u0008) f Page Break (u000C) n Line Break (\u000A) r Enter Key (u000D) \t tab (\u0009) v Vertical Tab (u000B) \'Single Quote (u0027) \"Double Quotes (u0022) Backslash (u005C)

These characters are preceded by backslashes to give special meaning.


console.log('1\n2')
// 1
// 2

In the code above, \n means line break and the output is split into two lines.

Backslash has three other special uses.

(1)HHH

The backslash is followed by three octal numbers (000 to 377) representing one character.HHH corresponds to the Unicode code point for that character, such as for copyright symbols.Obviously, this method only outputs 256 characters.

(2)xHH

x is followed by two 106-digit numbers (00 to FF) representing one character.HH corresponds to the Unicode code point for that character, such as \xA9 for copyright symbols.This method can only output 256 characters.

(3)uXXXX

\u is followed by four 106-digit numbers (0000 to FFFF) representing one character.HHHH corresponds to the Unicode code point for that character, such as\u00A9 for copyright symbols.

Below are examples of these three special characters.


'\251' // "©"
'\xA9' // "©"
'\u00A9' // "©"

'\172' === 'z' // true
'\x7A' === 'z' // true
'\u007A' === 'z' // true

If a backslash is used in front of a non-special character, the backslash is omitted.


'\a'
// "a"

In the code above, a is a normal character, and the backslash before it has no special meaning. The backslash is automatically omitted.

If a backslash is required in the normal contents of a string, it needs to be preceded by another backslash to escape itself.


'key = "value"'
"It's a long journey"
0

Strings and Arrays
Strings can be treated as character arrays, so you can use the square bracket operators of the arrays to return characters from a location (position numbers start at 0).


'key = "value"'
"It's a long journey"
1

Returns undefined if the number in square brackets exceeds the length of the string or is not a number at all.


'key = "value"'
"It's a long journey"
2

However, the similarity between strings and arrays is only that.In fact, you cannot change a single character in a string.


var s = 'hello';

delete s[0];
s // "hello"

s[1] = 'a';
s // "hello"

s[5] = '!';
s // "hello"

The code above indicates that a single character inside a string cannot be changed or added or deleted, and these operations will silently fail.

Strings are similar to character arrays in that they are automatically converted to a string object when bracketed.

length attribute
The length property returns the length of the string, which cannot be changed.


'key = "value"'
"It's a long journey"
4

The code above indicates that the length property of the string cannot be changed, but no error will be reported.

character set
JavaScript uses the Unicode character set, that is, within JavaScript, all characters are represented by Unicode.

Not only does Unicode store characters internally, but Unicode can also be used directly in the program. All characters can be written as uxxxx, where xxxx represents the Unicode encoding of the character.For example, u00A9 represents a copyright symbol.


'key = "value"'
"It's a long journey"
5

Inside JavaScript, each character is stored in 16-bit (that is, 2 bytes) UTF-16 format.That is, the unit character length of JavaScript is fixed to 16-bit length, which is 2 bytes.

However, UTF-16 has two lengths: 16 bits (2 bytes) for characters between U+0000 and U+FFFF;For characters between U+10000 and U+10FFFF, the length is 32 bits (that is, 4 bytes), and the first two bytes are between 0xD800 and 0xDBFF, and the last two bytes are between 0xDC00 and 0xDFFF.For example, the corresponding character for U+1D306 is 𝌆,It is written as UTF-16, 0xD834 0xDF06.The browser correctly recognizes these four bytes as one character, but the character length inside JavaScript is always fixed to 16 bits, and the four bytes are treated as two characters.


'key = "value"'
"It's a long journey"
6

The code above shows that for a character between U+10000 and U+10FFFF, JavaScript is always treated as two characters (the length attribute of the character is 2), the regular expression used to match a single character fails (JavaScript considers more than one character here), the charAt method cannot return a single character, and the charCodeAt method returns the corresponding 10-bit value for each byte.

So this must be taken into account when dealing with it.For 4-byte Unicode characters, assume that C is the Unicode number of the character, H is the first two bytes, and L is the last two bytes, the conversion relationship between them is as follows.


'key = "value"'
"It's a long journey"
7

The following regular expression recognizes all UTF-16 characters.


([\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])

Since the JavaScript engine (strictly the ES5 specification) does not automatically recognize Unicode characters on the auxiliary plane (numbers greater than 0xFFFF), all string processing functions will encounter such characters, which will produce incorrect results.To complete string correlation, you must determine whether a character falls between 0xD800 and 0xDFFF.

Below are the functions that correctly handle string traversal.


'key = "value"'
"It's a long journey"
9

Other string operations, such as substitution (String.prototype.replace), intercept substring (String.prototype.substring, String.prototype.slice), must be handled similarly.

Base64 Transcoding
Base64 is an encoding method that converts any character to a printable character.This encoding method is not mainly used for encryption, but to simplify the processing of the program without special characters.

JavaScript provides two native Base64-related methods.

btoa(): String or binary value to Base64 encoding atob():Base64 Encoding to Original Encoding

var string = 'Hello World!';

btoa(string) // "SGVsbG8gV29ybGQh"
atob('SGVsbG8gV29ybGQh') // "Hello World!"
 These two methods are not suitable for non- ASCII Code character, will error. 

btoa(' Hello ')
// Uncaught DOMException: The string to be encoded contains characters outside of the Latin1 range.
 To make non ASCII Code character to Base64 Encoding, must be inserted in the middle 1 This transcoding link, then use these two methods. 

function b64Encode(str) {
 return btoa(encodeURIComponent(str));
}

function b64Decode(str) {
 return decodeURIComponent(atob(str));
}

b64Encode(' Hello ') // "JUU0JUJEJUEwJUU1JUE1JUJE"
b64Decode('JUU0JUJEJUEwJUU1JUE1JUJE') // " Hello "

Related articles: