A detailed introduction to the encoding of PHP strings

  • 2020-06-01 08:41:29
  • OfStack

As you know, different character encodings take up different bytes in memory. For example, the ASCII encoding character takes up 1 byte, the Chinese character UTF-8 encoding is 3 bytes, and the GBK encoding is 2 bytes.

PHP also comes with several string interception functions, among which substr and mb_substr are commonly used.

When using substr to intercept Chinese characters, garbled code occurs because substr is intercepted by byte. That is, the Chinese code UTF-8 is intercepted by substr. Only 1/3 of the Chinese code will be intercepted. Of course, some scrambled codes will appear.

The parameter $encoding in mb_substr (string $str, int $start [, int $length [, string $encoding]]) can specify the encoding or, if omitted, use the internal character encoding.

If you do not know the encoding format of the string, you can use mb_detect_encoding to check:

$encoding = mb_detect_encoding ($string array (" ASCII, "' UTF - 8 '," GB2312 ', "GBK", 'BIG5'));

And then:

mb_substr ( string $str , int $start [, int $length [, string $encoding ]] )

If you implement mb_substr yourself, the efficiency is not very good.

Encoding-related php functions are used

ord(substr($str, $i, 1)) > 0xa0)

The < pre > style = "PADDING - BOTTOM: 0 px; LINE - HEIGHT: 24 px; BACKGROUND - COLOR: rgb (241254221); MARGIN - TOP: 0 px; PADDING - LEFT: 0 px; PADDING - RIGHT: 0 px; FONT-FAMILY: arial,'courier new',courier, song style,monospace; WORD - WRAP: break - word; WHITE - SPACE: pre - wrap; MARGIN - BOTTOM: 10 px; COLOR: rgb (51,51,51); FONT - SIZE: 14 px; PADDING - TOP: 0px" id= recommend-content-850307366 class=" recommend-text mb-10 "name="code">ord($string) utf8 is 3 bytes. So if the code is greater than 256, that's the Chinese character. < / pre >


Regular characters:

Matching Chinese characters: preg_match_all('/[\ x80-\ xff]? . / ', $string, $match);

preg_match_all("/[/x01-/x7f]+/", $string, $match);


Code conversion

iconv ( string $in_charset , string $out_charset , string $str )

For example, GB2312 to UTF-8: iconv("GB2312"," UTF-8 ",$text)
url coding urlencode
All non-alphanumeric characters in the encoded string returned except -_. Are replaced with a percent sign (%) followed by two base 106 digits, and Spaces are encoded with a plus sign (+). This encoding is identical to the encoding of WWW form POST data and to the encoding of application/ x-www-form-urlencoded media types.

It should be noted, however, that only part of URL should be encoded, otherwise the colons and backslashes in URL will also be escaped.

There are two types of URLEncode, one is the traditional Encode based on GB2312, and the other is Encode based on UTF-8. Such as:

$url = ' China ';  
echo urlencode($url );  
//UTF-8: %E4%B8%AD%E5%9B%BD  
//GB2312:%D6%D0%B9%FA  

For example, we use a browser to open baidu and search for "China". In the address bar, we can see: http://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD & rsv_bp=0 & ch= & tn=baidu & bar= & rsv_spt=3 & ie=utf-8 & rsv_sug3=16 & rsv_sug=0 & rsv_sug4=302 & rsv_sug1=11 & inputT=22928

That is, we see "China" automatically translated by the browser to: % E4% B8% AD % E5% 9B % BD.

The difference between urlencode and rawurlencode: urlencode encodes the space as a plus sign "+", and rawurlencode encodes the space as a plus sign "%20".

url decoded urldecode and rawurldecode 1. When decoding, urldecode() and rawurldecode() can be used. Accordingly, rawurldecode() will not decode the plus sign ('+') into a space, while urldecode() can. 2. The decoded strings of urldecode() and rawurldecode() are encoded in UTF-8 format. If URL contains non-UTF-8 encoded Chinese, the decoded strings will be converted. As follows, set the php file to gb2312. You will see that part 1 is garbled and part 1 is normal. $url = 'China';
echo $a = urldecode(urlencode($url)) ,' ';
echo iconv('gb2312', 'utf-8', $a);
The & # 65533; й & # 65533; China


Related articles: