How do I get the most frequent substring in a Chinese string

  • 2020-08-22 21:56:02
  • OfStack

The length of the substring can be set by itself (for example, 4 characters or 5 characters in a row).


$str =' I am Chinese, I am a foreigner, I am Korean, I am American, I am Chinese, I am British, I am Chinese, I am a foreigner ';
Count_string($str,5);
function Count_string($sstr,$length)
{
 $cnt_tmp = 0;
 $cnt = 0;
 $str = '';
 $str_tmp = array();
 $str_arr = array();
 mb_internal_encoding("gb2312");
 $max_length = (mb_strlen($sstr)-$length);

 // Get the set of substrings 
 for($i=0;$i<=$max_length;$i++)
 {
  $str_tmp[] =  mb_substr($sstr, $i, $length);
 }
 // Remove the repeating substring  
 $str_tmp = array_unique($str_tmp);

 // Count occurrences 
 foreach($str_tmp as $key=>$value)
 {
  $cnt_tmp = mb_substr_count($sstr,$value);
  if($cnt_tmp>=$cnt) 
  {
   $cnt = $cnt_tmp;
   $str_arr[$value] = $cnt;   
  }
 }

 // Processing results in multiple outcomes 
 foreach($str_arr as $key=>$value)
 {
  if($value == $cnt)
  {$str .=$key."<br>";}
 }
 echo ' The most common substring is :<br>'.$str.'<br> occurrences :'.$cnt;
}


Related articles: