Compare the version of the intercepting string function php for discuz and ecshop

2020-05-19 04:24:49
OfStack

The following is the source code for two versions of the function and a simple test, and finally I will give a more practical string interception function. It is important to note that the string interception issues discussed here are all for UTF-8 encoded Chinese strings.
discuz version

 
/** 
* [discuz]  Based on the PHP Is not installed  mb_substr  And so on the extension intercepts the string, if intercepts the Chinese text, then press 2 Character calculation  
* @param $string  The string to intercept  
* @param $length  The number of characters to intercept  
* @param $dot  Replace the truncated end string  
* @return  Returns the truncated string  
*/ 
function cutstr($string, $length, $dot = '...') { 
//  If the string is smaller than the length to intercept, return it directly  
//  Used here strlen There are major drawbacks to getting the length of a string, such as intercepting the string "happy New Year. 4 Chinese characters,  
//  So you have to know this 4 The number of bytes of Chinese characters, otherwise the returned string might be "happy New Year" ... "  
if (strlen($string) <= $length) { 
return $string; 
} 
//  Convert to the original string htmlspecialchars 
$pre = chr(1); 
$end = chr(1); 
$string = str_replace ( array ('&', '"', '<', '>' ), array ($pre . '&' . $end, $pre . '"' . $end, $pre . '<' . $end, $pre . '>' . $end ), $string ); 
$strcut = ''; //  Initializes the return value  
//  If it is utf-8 coding ( This judgment is a little incomplete , It is possible that utf8) 
if (strtolower ( CHARSET ) == 'utf-8') { 
//  Initial continuous loop pointer $n, The last 1 Word number $tn, The number of characters intercepted $noc 
$n = $tn = $noc = 0; 
while ( $n < strlen ( $string ) ) { 
$t = ord ( $string [$n] ); 
if ($t == 9 || $t == 10 || (32 <= $t && $t <= 126)) { 
//  If it's an English half Angle symbol, etc ,$n A pointer back 1 position ,$tn The last word is 1 position  
$tn = 1; 
$n++; 
$noc++; 
} elseif (194 <= $t && $t <= 223) { 
//  If it is 2 Byte character $n A pointer back 2 position ,$tn The last word is 2 position  
$tn = 2; 
$n += 2; 
$noc += 2; 
} elseif (224 <= $t && $t <= 239) { 
//  If it is 3 byte ( Can be understood as Chinese words ),$n Move backward 3 position ,$tn The last word is 3 position  
$tn = 3; 
$n += 3; 
$noc += 2; 
} elseif (240 <= $t && $t <= 247) { 
$tn = 4; 
$n += 4; 
$noc += 2; 
} elseif (248 <= $t && $t <= 251) { 
$tn = 5; 
$n += 5; 
$noc += 2; 
} elseif ($t == 252 || $t == 253) { 
$tn = 6; 
$n += 6; 
$noc += 2; 
} else { 
$n++; 
} 
//  If you exceed the number you want, you break out of the continuous loop  
if ($noc >= $length) { 
break; 
} 
} 
//  This place is the last 1 A word to remove , In order to proceed $dot 
if ($noc > $length) { 
$n -= $tn; 
} 
$strcut = substr ( $string, 0, $n ); 
} else { 
//  Is not utf-8 The full Angle of the code moves backwards 2 position  
for ($i = 0; $i < $length; $i ++) { 
$strcut .= ord ( $string [$i] ) > 127 ? $string [$i] . $string [++ $i] : $string [$i]; 
} 
} 
//  Let's go back to the original htmlspecialchars 
$strcut = str_replace( array ($pre . '&' . $end, $pre . '"' . $end, $pre . '<' . $end, $pre . '>' . $end ), array ('&', '"', '<', '>' ), $strcut ); 
$pos = strrpos ( $strcut, chr ( 1 ) ); 
if ($pos !== false) { 
$strcut = substr ( $strcut, 0, $pos ); 
} 
return $strcut . $dot; //  And then I'm going to add the intercept $dot The output  
}

discuz version of the biggest drawback is that use strlen to obtain the length of the original string, and used to and the incoming to intercept length parameter (in bytes), because the UTF - the number of bytes of the eight Chinese characters is not fixed, so they will face the dilemma: if you want to capture four Chinese characters should specify how to intercept length? 8 bytes or 12 bytes? . This is unpredictable, and it is precisely because of this problem that discuz's cutstr actually has bug. It can be seen from the test results below:

 
$str1 = " To see a thousand miles away "; 
echo my_cutstr($str1, 10, "...")."\n"; //  Output: want to be poor ... [ This is a 1 a bug What causes it? ] 
echo my_cutstr($str1, 15, "...")."\n"; //  Output: want to be poor

The reason for bug mentioned above is that when the cutstr function intercepts characters, it counts 1 Chinese character as 2 characters, so 5 Chinese characters are 10 characters, and the length of the original string is 15 bytes. Therefore, cutstr thinks that it "successfully" intercepts 10 characters from the 15-character string, and then adds "tail". To solve this bug problem, you only need to determine whether the substring returned under 1 is the same as the original string, and if so, no "tail" is added.
ecshop version

 
/** 
* [ecshop]  Based on the PHP the  mb_substr . iconv_substr  These two extensions are used to intercept strings, and the Chinese characters are both pressed 1 Character length calculation;  
*  This function only applies to utf-8 Encoded Chinese string.  
* 
* @param $str  Original string  
* @param $length  The number of characters intercepted  
* @param $append  Replace the truncated end string  
* @return  Returns the truncated string  
*/ 
function sub_str($str, $length = 0, $append = '...') { 
$str = trim($str); 
$strlength = strlen($str); 
if ($length == 0 || $length >= $strlength) { 
return $str; 
} elseif ($length < 0) { 
$length = $strlength + $length; 
if ($length < 0) { 
$length = $strlength; 
} 
} 
if ( function_exists('mb_substr') ) { 
$newstr = mb_substr($str, 0, $length, 'utf-8'); 
} elseif ( function_exists('iconv_substr') ) { 
$newstr = iconv_substr($str, 0, $length, 'utf-8'); 
} else { 
//$newstr = trim_right(substr($str, 0, $length)); 
$newstr = substr($str, 0, $length); 
} 
if ($append && $str != $newstr) { 
$newstr .= $append; 
} 
return $newstr; 
}

The feature and disadvantage of ecshop version is that it counts Chinese characters as one character. If the original string does not contain Chinese characters, such as abcd1234, and if it is intended to intercept 4 Chinese characters or 8 English characters, then the version using ecshop will not get the desired result, and the return value is abcd. Here are the simple test results:

 
$str1 = " The day by the mountain, the Yellow River into the sea "; 
echo $str1."\n"; 
echo my_sub_str($str1, 4, "...")."\n"; //  Output: mountain by day ... 
$str2 = " white 1 day 2 In accordance with the 3 mountain 4"; 
echo $str2."\n"; 
echo my_sub_str($str2, 4, "...")."\n"; //  Output: white 1 day 2...

Optimized version
Most of the application scenarios of intercepting Chinese strings are "original strings can be mixed with Chinese, English and Numbers, Chinese characters are calculated as 2 characters, and English Numbers are calculated as 1 character". For this requirement, the following is an implementation version:

 
/** 
*  String interception, Chinese character press 2 Character calculations are supported at the same time GBK and UTF-8 coding  
* @param $string  The string to intercept  
* @param $length  The number of characters to intercept  
* @param $append  Add to the tail after the substring  
* @return  Returns the truncated string  
*/ 
function substring($string, $length, $append = false) { 
if ( $length <= 0 ) { 
return ''; 
} 
//  Detects whether the original string is UTF-8 coding  
$is_utf8 = false; 
$str1 = @iconv("UTF-8", "GBK", $string); 
$str2 = @iconv("GBK", "UTF-8", $str1); 
if ( $string == $str2 ) { 
$is_utf8 = true; 
//  If it is UTF-8 Code, use GBK The coding  
$string = $str1; 
} 
$newstr = ''; 
for ($i = 0; $i < $length; $i ++) { 
$newstr .= ord ($string[$i]) > 127 ? $string[$i] . $string[++$i] : $string[$i]; 
} 
if ( $is_utf8 ) { 
$newstr = @iconv("GBK", "UTF-8", $newstr); 
} 
if ($append && $newstr != $string) { 
$newstr .= $append; 
} 
return $newstr; 
}

The test results are as follows (results 1 for GBK and UTF-8) :

 
$str1 = " The day by the mountain, the Yellow River into the sea "; 
echo substring($str1, 4, "...")."\n"; //  Output: day ... 
echo substring($str1, 5, "...")."\n"; //  Output: by day ... 
$str2 = "12 white 34 day 56 In accordance with the 78 mountain "; 
echo substring($str2, 4, "...")."\n"; //  Output: 12 white ... 
echo substring($str2, 5, "...")."\n"; //  Output: 12 white 3...

Author: edwardlost' blog