C++ implements a method to determine whether a string is in UTF8 or GBK format

  • 2020-06-01 10:23:14
  • OfStack

This article demonstrates an example of how the C++ implementation determines whether a string is in UTF8 or GBK format. I will share it with you for your reference as follows:

When dealing with external data, it is very likely that the data format is not the same, which may lead to garbled code and even cause some programs to fail. Given that utf8 is the more widely used format for most systems, it is important to determine if it is in the utf8 format.

Here is a function to determine if a string is utf8:


bool is_str_utf8(const char* str)
{
  unsigned int nBytes = 0;//UFT8 available 1-6 Byte encoding ,ASCII with 1 bytes 
  unsigned char chr = *str;
  bool bAllAscii = true;
  for (unsigned int i = 0; str[i] != '\0'; ++i){
    chr = *(str + i);
    // Determine whether ASCII coding , If it is not , It could be UTF8,ASCII with 7 A coding , The highest bit is marked as 0,0xxxxxxx
    if (nBytes == 0 && (chr & 0x80) != 0){
      bAllAscii = false;
    }
    if (nBytes == 0) {
      // If it is not ASCII code , It should be multi-byte , Count bytes 
      if (chr >= 0x80) {
        if (chr >= 0xFC && chr <= 0xFD){
          nBytes = 6;
        }
        else if (chr >= 0xF8){
          nBytes = 5;
        }
        else if (chr >= 0xF0){
          nBytes = 4;
        }
        else if (chr >= 0xE0){
          nBytes = 3;
        }
        else if (chr >= 0xC0){
          nBytes = 2;
        }
        else{
          return false;
        }
        nBytes--;
      }
    }
    else{
      // The non-first byte of a multibyte character , Should be  10xxxxxx
      if ((chr & 0xC0) != 0x80){
        return false;
      }
      // It goes down to zero 
      nBytes--;
    }
  }
  // A transgression UTF8 Encoding rules 
  if (nBytes != 0) {
    return false;
  }
  if (bAllAscii){ // If all of them are ASCII,  Is also UTF8
    return true;
  }
  return true;
}

About utf8 1 general introduction and 2 base format can refer to baidu baike. The same method is used for the judgment of GBK. The specific code is as follows:


bool is_str_gbk(const char* str)
{
  unsigned int nBytes = 0;//GBK available 1-2 Byte encoding , Two Chinese  , English 1 a 
  unsigned char chr = *str;
  bool bAllAscii = true; // If all of them are ASCII,
  for (unsigned int i = 0; str[i] != '\0'; ++i){
    chr = *(str + i);
    if ((chr & 0x80) != 0 && nBytes == 0){//  Determine whether ASCII coding , If it is not , It could be GBK
      bAllAscii = false;
    }
    if (nBytes == 0) {
      if (chr >= 0x80) {
        if (chr >= 0x81 && chr <= 0xFE){
          nBytes = +2;
        }
        else{
          return false;
        }
        nBytes--;
      }
    }
    else{
      if (chr < 0x40 || chr>0xFE){
        return false;
      }
      nBytes--;
    }//else end
  }
  if (nBytes != 0) {   // Goes against the rules 
    return false;
  }
  if (bAllAscii){ // If all of them are ASCII,  Is also GBK
    return true;
  }
  return true;
}

That's the right way to write it. However, given that the current utf8 is generally 3 bytes for Chinese 1, and the encoding rules of utf8 overlap, if it is utf8, using the above function, there will be an embarrassing problem. When Chinese characters are odd and correct, if they are even, they cannot be distinguished.

Finally: if anyone has a better way to determine whether a string is in GBK format, please let me know.

I hope this article is helpful to you C++ programming.


Related articles: