php regular expression matching Chinese problem analysis summary

  • 2020-05-16 06:27:32
  • OfStack

 
$str = ' People's Republic of China 123456789abcdefg'; 
echo preg_match("/^[u4e00-u9fa5_a-zA-Z0-9]{3,15}$",$strName); 

Run 1 through the code above and see what it says.

Warning: preg_match(): Compilation failed: PCRE does not support L, l, N, P, p, U, u, or X at offset 3 in F:wwwrootphptest.php on line 2
Originally, the following Perl escape sequences were not supported in PHP regular expressions: L, l, N, P, p, U, u, or X

In UTF-8 mode, "x{... } ", the contents in curly braces are strings representing base 106 digits.

The original base 106 escape sequence xhh matches a double-byte UTF-8 character if its value is greater than 127.
So,
You can do it this way
 
preg_match("/^[x80-xff_a-zA-Z0-9]{3,15}$",$strName); 
preg_match('/[x{2460}-x{2468}]/u', $str); 


Match internal code Chinese characters
According to the test method he provided, the code is as follows:

 
$str = "php programming "; 
if (preg_match("/^[x{2460}-x{2468}]+$/u",$str)) { 
print(" The string is all in Chinese "); 
} else { 
print(" The string is not all Chinese "); 
} 


I found that I was still wrong about whether I was Chinese or not. However, since x represents base 106 data, why is it different from the range x4e00-x9fa5 provided in js? So I changed it to the following code:

 
$str = "php programming "; 
if (preg_match("/^[x4e00-x9fa5]+$/u",$str)) { 
print(" The string is all in Chinese "); 
} else { 
print(" The string is not all Chinese "); 
} 


I thought it would be a success, but to my surprise, warning came into being again:
Warning: preg_match() [function.preg-match]: Compilation failed: invalid UTF-8 string at offset 6 in test.php on line 3

It seems that there is a wrong way of expression again, so I compared the way of expression of the article 1, and wrapped "{" and "} "on both sides of" 4e00 "and" 9fa5 "respectively. I ran once, and found that it was really accurate:

 
$str = "php programming "; 
if (preg_match("/^[x{4e00}-x{9fa5}]+$/u",$str)) { 
print(" The string is all in Chinese "); 
} else { 
print(" The string is not all Chinese "); 
} 


Knowing the final correct expression of utf-8 encoding matching Chinese characters with regular expressions in php -- /^[x{4e00}-x{9fa5}]+$/u,

And finally

 
//if (preg_match( " /^[".chr(0xa1)."-".chr(0xff)."]+$/ " , $str)) { // Only in the GB2312 Use under circumstances  
if (preg_match( " /^[x7f-xff]+$/ " , $str)) { // Compatible with gb2312,utf-8 
echo  "Enter correctly" ; 
} else { 
echo  "Misinput" ; 
} 


Double-byte character encoding range

1. GBK (GB2312/GB18030)
x00-xff GBK double byte encoding range
x20-x7f ASCII
xa1 - xff gb2312 in Chinese
x80 - xff gbk in Chinese

2. UTF-8 (Unicode)

u4e00-u9fa5 (Chinese)
x3130 - x318F (Korean
xAC00-xD7A3 (Korean)
u0800-u4e00 (Japanese)

Related articles: