Detailed explanation of PHP automatic character set determination and transcoding

  • 2020-06-22 23:59:41
  • OfStack

The principle is simple because gb2312/gbk is two bytes in Chinese, which have a range of values, while es2EN-8 is three bytes, and each byte also has a range. In English, regardless of the code case, is less than 128, only one byte (except for all angles).
If it is a file form encoding check, you can also direct check utf-8's BOM information. Without further comment, let's go directly to the function, which is used to check and transcode strings.

<?php
function safeEncoding($string,$outEncoding ='UTF-8')    
{    
 $encoding = "UTF-8";    
 for($i=0;$i<strlen($string);$i++)    
 {    
  if(ord($string{$i})<128)    
        continue;    

  if((ord($string{$i})&224)==224)    
  {    
     // The first 1 The number of bytes passed     
       $char = $string{++$i};    
     if((ord($char)&128)==128)    
       {    
             // The first 2 The number of bytes passed     
           $char = $string{++$i};    
             if((ord($char)&128)==128)    
           {    
                $encoding = "UTF-8";    
                break;    
           }    
         }    
   }    

  if((ord($string{$i})&192)==192)    
       {    
           // The first 1 The number of bytes passed     
          $char = $string{++$i};    
         if((ord($char)&128)==128)    
           {    
            //  The first 2 The number of bytes passed     
                $encoding = "GB2312";    
    break;    
   }    
      }    
 }    

 if(strtoupper($encoding) == strtoupper($outEncoding))    
  return $string;    
 else   
        return iconv($encoding,$outEncoding,$string);    
}
?>


Related articles: