The Solution of Realizing Chinese Character String Interception without Garbled Code in PHP

  • 2021-10-13 06:58:37
  • OfStack

In PHP, substr () function intercept with Chinese string, may appear garbled, this is because the number of bytes occupied by 1 byte in Chinese and Western, and the length parameter of substr is calculated according to bytes. In GB2312 coding, 1 Chinese occupies 2 bytes and English is 1 byte, while in UTF-8 coding, 1 Chinese may occupy 2 or 3 bytes, and English or half corner punctuation occupies 1 byte.

Directly using PHP function substr to intercept Chinese characters may cause garbled codes, mainly because substr may abruptly "saw" a Chinese character into two halves. Solution:

1. mb_substr interception using mbstring extension library will not appear garbled.

2. Write the interception function by yourself, but the efficiency is not as high as that of mbstring extension library.

3. If you are only outputting intercepted strings, you can do this as follows: substr ($str, 0, 30). chr (0).

=============================

substr () function can divide text, but if the text to be divided includes Chinese characters, it often encounters problems. At this time, you can use mb_substr () /mb_strcut. The usage of mb_substr ()/mb_strcut is similar to substr (), except that one more parameter should be added at the end of mb_substr ()/mb_strcut to set the string encoding. However, all servers like php_mbstring. dll need to be opened in php.ini.

For example:


<?php
echo mb_substr(' In this way 1 Come on, my string won't be garbled ^_^', 0, 7, 'utf-8');
?>
 Output: This 1 Come to my words 
<?php
echo mb_strcut(' In this way 1 Come on, my string won't be garbled ^_^', 0, 7, 'utf-8');
?>

Output: This 1

As can be seen from the above example, mb_substr splits characters by word, while mb_strcut splits characters by byte, but neither of them produces half a character phenomenon.

=============================

Method of Intercepting Chinese Character Strings without Garbled Codes by PHP


function GBsubstr($string, $start, $length) {
if(strlen($string)>$length){
  $str=null;
  $len=$start+$length;
  for($i=$start;$i<$len;$i++){
  if(ord(substr($string,$i,1))>0xa0){
   $str.=substr($string,$i,2);
   $i++;
  }else{
   $str.=substr($string,$i,1);
  }
  }
  return $str.'...';
}else{
  return $string;
}
}

The Method of Realizing Chinese Character String Interception without Garbled Code--Applicable to utf-8


function substr_text($str, $start=0, $length, $charset="utf-8", $suffix="")
{
if(function_exists("mb_substr")){
return mb_substr($str, $start, $length, $charset).$suffix;
}
elseif(function_exists('iconv_substr')){
return iconv_substr($str,$start,$length,$charset).$suffix;
}
$re['utf-8'] = "/[\x01-\x7f]|[\xc2-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xff][\x80-\xbf]{3}/";
$re['gb2312'] = "/[\x01-\x7f]|[\xb0-\xf7][\xa0-\xfe]/";
$re['gbk']  = "/[\x01-\x7f]|[\x81-\xfe][\x40-\xfe]/";
$re['big5']  = "/[\x01-\x7f]|[\x81-\xfe]([\x40-\x7e]|\xa1-\xfe])/";
preg_match_all($re[$charset], $str, $match);
$slice = join("",array_slice($match[0], $start, $length));
return $slice.$suffix;
}

Summarize


Related articles: