PHP Method for Obtaining Sino English Mixed String Length

  • 2021-06-28 12:02:37
  • OfStack

Tonight, when writing a frame's form validation class, it's natural to think about the strlen function in PHP to determine if a string length is within a specified range.


$str = 'Hello world!';
echo strlen($str); //  output 12

However, in PHP's own functions, strlen and mb_strlen calculates the length by calculating the bytes occupied by a string. The bytes occupied by Chinese are different under different encoding conditions.Chinese characters account for 2 bytes under GBK/GB2312 and 3 bytes under UTF-8.

$str = ' Hello World! ';
echo strlen($str); // GBK or GB2312 Down Output 12 , UTF-8 Down Output 18

We often need to judge the length of a string by the number of characters, not bytes, such as this PHP code under UTF-8:

$name = ' Zhang Tingchang ';
$len = strlen($name);
//  output  FALSE Because UTF-8 lower 3 Chinese accounts for 9 Bytes 
if($len >= 3 && $len <= 8){
 echo 'TRUE';
}else{
 echo 'FALSE';
}

So what is a convenient and practical way to get the length of a string with Chinese characters?The number of Chinese characters can be calculated regularly, divided by 2 under GBK/GB2312 encoding, 3 under UTF-8 encoding, and finally the length of non-Chinese strings, which is too cumbersome.

WordPress is such a piece of code as follows:


$str = 'Hello World! ';
preg_match_all('/./us', $str, $match);
echo count($match[0]); //  output 9

The idea is to split a string into individual characters using a regular expression and calculate directly the number of matched characters using count, which is what we want.

However, the above code cannot process the Chinese character string of GBK/GB2312 under UTF-8 encoding, because the Chinese character of GBK/GB2312 will be recognized as two characters and the number of Chinese characters calculated will double, so I think of this method:


$tmp = @iconv('gbk', 'utf-8', $str);
if(!empty($tmp)){
 $str = $tmp;
}
preg_match_all('/./us', $str, $match);
echo count($match[0]);

Compatible with GBK/GB2312 and UTF-8 encoding, passing small data tests, but not yet sure if it is completely correct, look forward to a bull's finger 12.

This is intended to make the framework compatible with multiple encoding formats, but generally in daily development, a project can already determine which encoding to use, so the following functions can be used to easily obtain the string length:


int iconv_strlen ( string $str [, string $charset = ini_get("iconv.internal_encoding") ] )


Related articles: