of encoding conversion and regular matching based on a point of experience note data processing after preg_match_all collection

2020-12-19 20:57:06
OfStack

1. Use curl to realize off-site acquisition

Specific please refer to my 1 note: https: / / www ofstack. com article / 46432. htm

2. Code conversion
First, find the code used by the collected website by looking at the source code, and transcode it through mb_convert_encoding function.

Specific usage:


// The source character is $str 

// The following is known as the original code GBK Converting, utf-8 
mb_convert_encoding($str, "UTF-8", "GBK"); 

// The following unknown source code is passed auto After automatic detection, the conversion code is utf-8 
mb_convert_encoding($str, "UTF-8", "auto");

3. In order to better avoid the obstacle of indefinite factors such as newline character and space, it is necessary to clear the newline character, space character and TAB character in the collected source code


// methods 1 , the use of str_replace To replace  
$contents = str_replace("\r\n", '', $contents); // Clear the newline character  
$contents = str_replace("\n", '', $contents); // Clear the newline character  
$contents = str_replace("\t", '', $contents); // Clear tabs  
$contents = str_replace(" ", '', $contents); // Clear space character  

// methods 2 , using regular expressions for substitution  
$contents = preg_replace("/([\r\n|\n|\t| ]+)/",'',$contents);

4. Find out the code segments to be obtained through regular expression matching, and use preg_match_all to achieve the matching


 Function interpretation:  
int preg_match_all ( string pattern, string subject, array matches [, int flags] ) 
pattern That's the normal expression  
subject That is, the original text to be looked up  
matches Is an array used to store the output  
flags Is the storage mode, including:  
    PREG_PATTERN_ORDER;  // The whole array is 2 Dimensional array, $arr1[0] Is an array of matching strings with boundaries, $arr1[1] The array of matching strings formed by removing the bounds  
    PREG_SET_ORDER;  // The whole array is 2 Dimensional array, $arr2[0][0] Is the first 1 A string containing a matching boundary, $arr2[0][1] Is the first 1 A string of matches formed by removing the boundary, followed by an array, and so on  
    PREG_OFFSET_CAPTURE;  // The whole array is 3 Dimensional array, $arr3[0][0][0] Is the first 1 A string containing a matching boundary, $arr3[0][0][1] Is the first to 1 The offset of the boundary of the matching string (not counting the boundary), and so on, $arr2[1][0][0] Is the first 1 A string containing a matching boundary, $arr3[1][0][1] Is the first to 1 The offset of the boundary of the matching string (including the boundary) ; 

// The practical application  
preg_match_all('/<pclass=\"content\">(.*?)<\/p>/',$contents, $out, PREG_SET_ORDER); 
$out All matched elements are retrieved  
$out[0][0] Will be included <pclass=\"content\"></p> The full length of the inside character  
$out[0][1] Will be included only (.*?) The matched character segment in parentheses  

// And so on, no n The matching fields can be obtained in the following way  
$out[n-1][1] 

// If most parentheses are stored in a regular expression, the first clause of the sentence is obtained m And the way I'm going to match it is  
$out[n-1][m]

5. If you want to remove the html tag after getting the character to be found, you can use the function strip_tags which comes with PHP to realize it conveniently


// case  
$result=strip_tags($out[0][1]);