The use of the regular processing function get_matches based on curl data acquisition

  • 2020-06-01 08:26:45
  • OfStack

According to the first two posts:

The use of single page collection function get_html based on curl data collection

The use of single page parallel acquisition function get_htmls based on curl data acquisition

We have obtained the html file we need, now we need to process the obtained file to obtain the data we need to collect.

For the parsing of html documents, there is no parsing class like XML, because HTML documents have many unmatched tags and are not strict. This is where the other helper classes come in. simplehtmldom is a parsing class that operates on HTML documents in a similar way to JQuery. It is convenient to get the desired data, but the speed is slow. This is not the focus of our discussion here. I mainly use regularization to match the data I need to collect, so I can get the information I need to collect very quickly.

Considering that get_html can judge the returned data, but get_htmls cannot, the following two functions are written for the convenience of mode and call:


function get_matches($pattern,$html,$err_msg,$multi=false,$flags=0,$offset=0){
     if(!$multi){
         if(!preg_match($pattern,$html,$matches,$flags,$offset)){
             echo $err_msg."!  The error message : ".get_preg_err_msg()."\n";
             return false;
         }
     }else{
         if(!preg_match_all($pattern,$html,$matches,$flags,$offset)){
             echo $err_msg."!  The error message : ".get_preg_err_msg()."\n";
             return false;
         }
     }
     return $matches;
 }
 function get_preg_err_msg(){
     $error_code = preg_last_error();
     switch($error_code){
         case PREG_NO_ERROR :
             $err_msg = 'PREG_NO_ERROR';
             break;
         case PREG_INTERNAL_ERROR:
             $err_msg = 'PREG_INTERNAL_ERROR';
             break;
         case PREG_BACKTRACK_LIMIT_ERROR:
             $err_msg = 'PREG_BACKTRACK_LIMIT_ERROR';
             break;
         case PREG_RECURSION_LIMIT_ERROR:
             $err_msg = 'PREG_RECURSION_LIMIT_ERROR';
             break;
         case PREG_BAD_UTF8_ERROR:
             $err_msg = 'PREG_BAD_UTF8_ERROR';
             break;
         case PREG_BAD_UTF8_OFFSET_ERROR:
             $err_msg = 'PREG_BAD_UTF8_OFFSET_ERROR';
             break;
         default:
             return ' An unknown error !';
     }
     return $err_msg.': '.$error_code;
 }

You can call:

$url = 'http://www.baidu.com';
 $html = get_html($url);
 $matches = get_matches('!<a[^<]+</a>!',$html,' No link found ',true);
 if($matches){
     var_dump($matches);
 }

Or call:

$urls = array('http://www.baidu.com','http://www.hao123.com');
 $htmls = get_htmls($urls);
 foreach($htmls as $html){
     $matches = get_matches('!<a[^<]+</a>!',$html,' No link found ',true);
     if($matches){
         var_dump($matches);
     }
 }

Can get the information they need, regardless of the single page sampling or multi-page, PHP finally can only handle one page, because of using get_matches can be carried out on the return value to judge true and false, get the correct data, because met when using regular than regular back problems, increase get_preg_err_msg to prompt the regular information.

When collecting data, it is usually to collect the list page, and then collect the content page according to the content page link obtained from the list page, or more layers, so there will be a lot of loop nesting, and the control of the code will feel inadequate. Can we separate the code for the collection list page from the code for the collection content page, or more layers of the collection code, or even simplify the loop?


Related articles: