Chinese character string interception in C++

2020-04-01 21:40:17
OfStack


const char *str = "test test test";
while(*str)
{
//All that is needed here is to determine that the first byte is greater than 0x80, provided that the GBK string is valid
//The reason is that if the first byte is greater than 0x80, it must be combined with the next byte to form a Chinese character
//So there's no need to judge the next byte
//Again, the prerequisite is to enter a valid GBK string
if(*str > 0x80)
{
//Chinese character, counter ++
str += 2;//Is the Chinese character naturally should direct +2
}
else
{
str++ ; 
}
}

See the string conversion function below.


/** 
*  with getBytes(encoding) : returns one of the strings byte An array of  
*  when b[0] for  63 Should be transcoding error  
* A , Chinese character string with unscrambled code:  
* 1 , encoding with GB2312 When every byte Is negative;  
* 2 , encoding with ISO8859_1 When, b[i] Is full of 63 .  
* B , garbled Chinese character string:  
* 1 , encoding with ISO8859_1 When every byte It's also a negative number;  
* 2 , encoding with GB2312 When, b[i] For the most part 63 .  
* C , English string  
* 1 , encoding with ISO8859_1 and GB2312 When every byte Is greater than 0 ;  
*  Summary: given a string, use getBytes("iso8859_1") 
* 1 And if b[i] There are 63 , without transcoding;  A-2 
* 2 And if b[i] Whole is greater than the 0 , then it is an English string without transcoding;  B-1 
* 3 And if b[i] There are less than 0 Of, so already garbled code, want transcode.  C-1 
*/ 
private static String toGb2312(String str) { 
if (str == null) return null; 
String retStr = str; 
byte b[]; 
try { 
b = str.getBytes("ISO8859_1"); 
for (int i = 0; i < b.length; i++) { 
byte b1 = b[i]; 
if (b1 == 63) 
break; //1 
else if (b1 > 0) 
continue;//2 
else if (b1 < 0) { //Cannot be 0,0 is the end of a string
retStr = new String(b, "GB2312"); 
break; 
} 
} 
} catch (UnsupportedEncodingException e) { 
// e.printStackTrace(); 
} 
return retStr; 
}


unsigned char *str = "test test test";
int length;
int i;

length = strlen(str);
for (i = 0; i < length - 1; i++)
{
if ( *str >= 0x81 && *str <= 0xFE
&& *(str + 1) >= 0x40 && *(str + 1) <= 0xFE)
{
//Chinese characters
}
}

unsignedchar*str="test test test";//Try replacing the string with "han A" and the result is 2

Someone said: "a GBK character to take up two char space (two bytes), and the value in the first byte is less than 0. You can judge whether it is a Chinese character or not."
1. Why is the value of the first byte less than 0?
2. Is it safe to judge that if the first byte is less than 0, that byte and the next byte form a Chinese character?
3. Because someone said that the GBK code of Chinese characters has high and low digit, the first is low digit, right? Is it safer to have the first byte between 160-254 and the second between 64-254 than the method mentioned in 2?
4. If the character set in DB is SIMPLIFIED chinese_china.zhs16gbk, is this the GBK character set? GBK compatible GB2312

It seems that some characters in some sets of characters are three bytes long

"By judging if the first byte is less than 0, the byte and the next byte form a Chinese character."

//GBK Chinese code range
/ / 81 - A0, 40-80-7 e FE
/ / AA - AF, 40-80-7 e A0
/ / B0 - D6, 40-80-7 e FE
/ / D7, 40-80-7 e F9
/ / D8 - F7, 40-80-7 e FE
/ / F8 - FE, 40-80-7 e A0
For example, // 81-a0, 40-7e 80-fe
The ASCII code for characters should be within the three intervals of 129-160, 64-126, 128-254

4,
In the work, encountered to intercept the string on the screen to display, because the string with Chinese characters, if the interception is not good, will cause garbled code, wrote the following function

In uclinux and VC6.0 test can pass.

The view plaincopy to clipboardprint?


 /* Intercepting string 
 name : The string to intercept 
 store: The string to store 
 len: The length to intercept 
 */ 
 void split_name( char * name , char * store , int len ) 
 { 
     int i= 0 ; 
     char strTemp[L(NAMEL)]={0}; 
     if ( strlen(name)
     { 
         strcpy( store, name );  *name=0; 
         return ; 
     } 
     //Start at the first byte
     while( i < len ) 
     { 
         if ( name[i]>>7&1 && name[i+1]>>7&1 )       //if ( name[i] < 0 && name[i+1] < 0 ) 
             i = i + 2 ; 
         else 
             i = i + 1 ; 
     } 
     i = i > len ? i-3 :i-1; 
     strncpy( store , name , i+1 ); //Intercept the previous I +1 bit
     *(store+i+1)=0; 
     strcpy( strTemp , name + i + 1 ); 
     strcpy( name , strTemp ); 
 }