Detailed Explanation of PCRE Regular Parsing Code in PHP

  • 2021-12-09 08:19:23
  • OfStack

1. Preface

In the previous blog, there is an analysis of the character set. This is not the character set thing, in PHP many functions of the default processing is UTF-8 encoding format in unicode. So don't talk too much nonsense, just get down to business.

2. Analysis of PHP function mb_split


<?php
$preg_strings = ' Test, test, 1.  Under ';
$preg_str = mb_split(' , ', $preg_strings);
print_r($preg_str);

Output effect


Array(

  [0] =>  Measure 

  [1] =>  Try 

  [2] => 1

  [3] =>  Under )

This function defaults to the underlying parsing, parsing in the encoding format of UTF-8. The character $preg_strings is divided by the hexadecimal dot of UNICODE of the separator (,).

3. Analysis of PHP function preg_split

Split the string "under test 1"


<?php
$strings = ' Test 1 Under ';
$mb_arr = preg_split('//u', $strings, -1, PREG_SPLIT_NO_EMPTY);
print_r($mb_arr);

The printed result is as follows:


Array(

  [0] =>  Measure 

  [1] =>  Try 

  [2] => 1

  [3] =>  Under 

)

4./u parsing in PCRE

In PHP, regular delimiters can be #,%,/, and so on.

In a regular, there are sometimes modifiers after it. So what do they all mean?

For example:


%[\x{4e00}-\x{9fa5}]+%u

Where the latter modifier u code table is matched with regular matching in the encoding format of utf-8.

Example 1:


 <?php
 $strings = ' Test 1 Under ';
 $is_true = preg_match_all('%[\x{4e00}-\x{9fa5}]+%u', $strings, $match);
var_dump($is_true);

The printed result is as follows:


Array(

  [0] => Array

    (

      [0] =>  Test 1 Under 

    )

)

What does [\ x {4e00}-\ x {9fa5}] mean here?

In PHP regularity\ x is used to denote hexadecimal.

The UNICODE code points in Chinese are 4E00-9FFF (all in hexadecimal here)

Therefore, regular matching is written in the interval [], [\ x {4E00}-\ x {9FFF}]

The effects of these two regularities are identical.


Related articles: