mysql string length calculation implementation code of gb2312+utf8

  • 2020-05-12 06:22:45
  • OfStack

PHP's handling of Chinese strings 1 is a problem for programmers who are new to PHP's development. The following is a brief analysis of PHP's treatment of Chinese string length:

PHP's built-in functions such as strlen() and mb_strlen() all count the length of a string by counting the number of bytes it takes, with one English character taking up one byte. Ex. :

$enStr = 'Hello,China!';
echo strlen ($enStr); // output: 12

In Chinese, however, there are two codes for Chinese website 1: gbk/gb2312 or utf-8. utf-8 is popular with many webmasters because it is compatible with more characters. gbk and utf-8 have different encodings for Chinese, resulting in a difference in the bytes taken by Chinese in the gbk and utf-8 encodings.

Under gbk encoding, each Chinese character takes 2 bytes, for example:

$zhStr = 'hello, China! ';
echo strlen ($zhStr); // output: 12

Under utf-8 encoding, each Chinese character takes 3 bytes, for example:

$zhStr = 'hello China! ';
echo strlen ($zhStr); // output: 18

So how do you calculate the length of this set of Chinese strings? Someone might say gbk gets the length of the Chinese string divided by 2, utf-8 gets it divided by 3, right? But you have to consider that strings are not honest, and that 99% of the time they are a mix of English and Chinese.

This is a piece of code in WordPress. The main idea is to decompose the string into individual units using regularness, and then calculate the number of units, namely the length of the string. The code is as follows (only strings encoded by utf-8 can be processed) :
 
$zhStr = ' Hello, China! '; 
$str = 'Hello, China! '; 

//  Calculate the length of the Chinese string  
function utf8_strlen($string = null) { 
//  Decompose a string into units  
preg_match_all("/./us", $string, $match); 
//  Number of units returned  
return count($match[0]); 
} 
echo utf8_strlen($zhStr); //  Output: 6 
echo utf8_strlen($str); //  Output: 9 

The length of the string encoded by UTF8 is obtained by utf8_strlen
 
/* 
*  Used for UTF8 Coded program  
*  Get the length of the string, 1 Chinese characters 3 A length of  
* itlearner annotation  
*/ 
function utf8_strlen($str) { 
$count = 0; 
for($i = 0; $i < strlen($str); $i++){ 
$value = ord($str[$i]); 
if($value > 127) { 
$count++; 
if($value >= 192 && $value <= 223) $i++; 
elseif($value >= 224 && $value <= 239) $i = $i + 2; 
elseif($value >= 240 && $value <= 247) $i = $i + 3; 
else die('Not a UTF-8 compatible string'); 
} 
$count++; 
} 
return $count; 
} 

Related articles: