The use of single page parallel acquisition function get_htmls based on curl data acquisition

  • 2020-06-01 08:26:54
  • OfStack

The get_html() in the first article is used to realize simple data collection. Since the transmission time of data collection is 1 and 1 execution, it will be the total time for all pages to download. If 1 page is assumed to be 1 second, then 10 pages will be 10 seconds. Fortunately, curl also provides parallel processing.

To write a function of parallel collection, it is necessary to know what kind of page to be collected and what request to be used for the page to write a relatively common function.


Functional requirement analysis:

Return what?

Of course every 1 page of html collection into an array

What parameters are passed?

When we wrote get_html(), we learned that we could use the options array to pass more curl arguments, so the ability to write many pages at once to capture functions would have to be preserved.

What type of parameter?

Whether requesting a web page HTML or calling the Internet api interface, get and post always request the same page or interface, but with different parameters. Then the type of the parameter is:

get_htmls($url,$options);

Is $url string

$options is a 2-dimensional array with 1 array per page.

So that seems to solve the problem. But I can't find anywhere in curl's manual where get's parameters are passed, so I can only pass $url as an array and add 1 method parameter


The prototype of the function is set get_htmls($urls,$options = array, $method = 'get'); The code is as follows:


function get_htmls($urls, $options = array(), $method = 'get'){
     $mh = curl_multi_init();
     if($method == 'get'){//get Way by value   The most commonly used 
         foreach($urls as $key=>$url){
             $ch = curl_init($url);
             $options[CURLOPT_RETURNTRANSFER] = true;
             $options[CURLOPT_TIMEOUT] = 5;
             curl_setopt_array($ch,$options);
             $curls[$key] = $ch;
             curl_multi_add_handle($mh,$curls[$key]);
         }
     }elseif($method == 'post'){//post Way by value  
         foreach($options as $key=>$option){
             $ch = curl_init($urls);
             $option[CURLOPT_RETURNTRANSFER] = true;
             $option[CURLOPT_TIMEOUT] = 5;
             $option[CURLOPT_POST] = true;
             curl_setopt_array($ch,$option);
             $curls[$key] = $ch;
             curl_multi_add_handle($mh,$curls[$key]);
         }
     }else{
         exit(" Parameter error !\n");
     }
     do{
         $mrc = curl_multi_exec($mh,$active);
         curl_multi_select($mh);// To reduce CPU pressure   Comment out the CPU Pressure gets bigger 
     }while($active);
     foreach($curls as $key=>$ch){
         $html = curl_multi_getcontent($ch);
         curl_multi_remove_handle($mh,$ch);
         curl_close($ch);
         $htmls[$key] = $html;
     }
     curl_multi_close($mh);
     return $htmls;
 }

The usual get request is made by changing the url parameter, and because our function is for data collection. Must be classified collection, so the url is similar to this:

http://www.baidu.com/s?wd=shili & pn=0 & ie=utf-8

http://www.baidu.com/s?wd=shili & pn=10 & ie=utf-8

http://www.baidu.com/s?wd=shili & pn=20 & ie=utf-8

http://www.baidu.com/s?wd=shili & pn=30 & ie=utf-8

http://www.baidu.com/s?wd=shili & pn=50 & ie=utf-8

The above five pages are very regular, only the value of pn has changed.


$urls = array();
 for($i=1; $i<=5; $i++){
     $urls[] = 'http://www.baidu.com/s?wd=shili&pn='.(($i-1)*10).'&ie=utf-8';
 }
 $option[CURLOPT_USERAGENT] = 'Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0';
 $htmls = get_htmls($urls,$option);
 foreach($htmls as $html){
     echo $html;// To get here html  You can process the data 
 }

Simulate common post requests:

Write one post.php file as follows:


 if(isset($_POST['username']) && isset($_POST['password'])){
     echo ' The user name is : '.$_POST['username'].'  The password is : '.$_POST['password'];
 }else{
     echo ' Request error !';
 }

Then call as follows:

$url = 'http://localhost/yourpath/post.php';// Here's your path 
 $options = array();
 for($i=1; $i<=5; $i++){
     $option[CURLOPT_POSTFIELDS] = 'username=user'.$i.'&password=pass'.$i;
     $options[] = $option;
 }
 $htmls = get_htmls($url,$options,'post');
 foreach($htmls as $html){
     echo $html;// To get here html  You can process the data 
 }

In this way, the get_htmls function can basically achieve some data collection functions

Share today here to write bad speak not clear, please give me more advice


Related articles: