The C language determines whether an char* is encoded with utf8

  • 2020-05-24 05:55:09
  • OfStack

The C language determines whether an char* is an utf8 encoding

I changed it 1 time in ASCII, and the pure ASCII encoding string also returns true, because UTF8 and ASCII are compatible

Example code:


int utf8_check(const char* str, size_t length) { 
  size_t i; 
  int nBytes; 
  unsigned char chr; 
 
  i = 0; 
  nBytes = 0; 
  while (i < length) { 
    chr = *(str + i); 
 
    if (nBytes == 0) { // Count bytes  
      if ((chr & 0x80) != 0) { 
        while ((chr & 0x80) != 0) { 
          chr <<= 1; 
          nBytes++; 
        } 
        if ((nBytes < 2) || (nBytes > 6)) { 
          return 0; // The first 1 At least 1 byte 110x xxxx 
        } 
        nBytes--; // Minus what you're taking 1 bytes  
      } 
    } else { // Multiple bytes in addition to the first 1 The remaining byte of a byte  
      if ((chr & 0xC0) != 0x80) { 
        return 0; // The rest of the bytes 10xx xxxx In the form of  
      } 
      nBytes--; 
    } 
    i++; 
  } 
  return (nBytes == 0); 
} 

Thank you for reading, I hope to help you, thank you for your support of this site!


Related articles: