php regular expression matching Chinese problem analysis summary
- 2020-05-16 06:27:32
- OfStack
$str = ' People's Republic of China 123456789abcdefg';
echo preg_match("/^[u4e00-u9fa5_a-zA-Z0-9]{3,15}$",$strName);
Run 1 through the code above and see what it says.
Warning: preg_match(): Compilation failed: PCRE does not support L, l, N, P, p, U, u, or X at offset 3 in F:wwwrootphptest.php on line 2
Originally, the following Perl escape sequences were not supported in PHP regular expressions: L, l, N, P, p, U, u, or X
In UTF-8 mode, "x{... } ", the contents in curly braces are strings representing base 106 digits.
The original base 106 escape sequence xhh matches a double-byte UTF-8 character if its value is greater than 127.
So,
You can do it this way
preg_match("/^[x80-xff_a-zA-Z0-9]{3,15}$",$strName);
preg_match('/[x{2460}-x{2468}]/u', $str);
Match internal code Chinese characters
According to the test method he provided, the code is as follows:
$str = "php programming ";
if (preg_match("/^[x{2460}-x{2468}]+$/u",$str)) {
print(" The string is all in Chinese ");
} else {
print(" The string is not all Chinese ");
}
I found that I was still wrong about whether I was Chinese or not. However, since x represents base 106 data, why is it different from the range x4e00-x9fa5 provided in js? So I changed it to the following code:
$str = "php programming ";
if (preg_match("/^[x4e00-x9fa5]+$/u",$str)) {
print(" The string is all in Chinese ");
} else {
print(" The string is not all Chinese ");
}
I thought it would be a success, but to my surprise, warning came into being again:
Warning: preg_match() [function.preg-match]: Compilation failed: invalid UTF-8 string at offset 6 in test.php on line 3
It seems that there is a wrong way of expression again, so I compared the way of expression of the article 1, and wrapped "{" and "} "on both sides of" 4e00 "and" 9fa5 "respectively. I ran once, and found that it was really accurate:
$str = "php programming ";
if (preg_match("/^[x{4e00}-x{9fa5}]+$/u",$str)) {
print(" The string is all in Chinese ");
} else {
print(" The string is not all Chinese ");
}
Knowing the final correct expression of utf-8 encoding matching Chinese characters with regular expressions in php -- /^[x{4e00}-x{9fa5}]+$/u,
And finally
//if (preg_match( " /^[".chr(0xa1)."-".chr(0xff)."]+$/ " , $str)) { // Only in the GB2312 Use under circumstances
if (preg_match( " /^[x7f-xff]+$/ " , $str)) { // Compatible with gb2312,utf-8
echo "Enter correctly" ;
} else {
echo "Misinput" ;
}
Double-byte character encoding range
1. GBK (GB2312/GB18030)
x00-xff GBK double byte encoding range
x20-x7f ASCII
xa1 - xff gb2312 in Chinese
x80 - xff gbk in Chinese
2. UTF-8 (Unicode)
u4e00-u9fa5 (Chinese)
x3130 - x318F (Korean
xAC00-xD7A3 (Korean)
u0800-u4e00 (Japanese)