Main Methods of Crawling Web Pages in PHP

  • 2021-10-24 19:18:01
  • OfStack

The main process is to get the whole web page and then regularly match it (critical).

PHP crawl the main method of the page, there are several methods is the experience of the predecessors on the Internet, has not been used now, first save it and try it later.

1. The file () function

2. The file_get_contents () function

3.fopen()- > fread()- > fclose () mode

4. curl mode (I mainly use this)

5. fsockopen () function socket mode

6. Plug-ins (e.g. http://sourceforge.net/projects/snoopy/)

7. file () function


<?php
// Definition url
$url='[http://t.qq.com](http://t.qq.com/)';//fiel Function to read an array of contents 
$lines_array=file($url);// Split fractions into strings 
$lines_string=implode('',$lines_array);// Output content 
echo $lines_string;   

2. It is implemented by file_get_contents method, which is relatively simple.

To use file_get_contents and fopen, allow_url_fopen must be open in space. Method: Edit php. ini, set allow_url_fopen = On, and neither fopen nor file_get_contents can open remote files when allow_url_fopen is closed.


$url="[http://news.sina.com.cn/c/nd/2016-10-23/doc-ifxwztru6951143.shtml](http://news.sina.com.cn/c/nd/2016-10-23/doc-ifxwztru6951143.shtml)";
$html=file_get_contents($url);
// If there is Chinese garbled code, use the following code `
//$getcontent = iconv("gb2312", "utf-8",$html);
echo"<textarea style='width:800px;height:600px;'>".$html."</textarea>";

3.fopen()- > fread()- > fclose () mode, not yet used, see the first note


<?php
// Definition url
$url='[http://t.qq.com](http://t.qq.com/)';//fopen With 2 Open in binary mode  
$handle=fopen($url,"rb");// Variable initialization 
$lines_string="";// Loop to read data 
do{
$data=fread($handle,1024);  
if(strlen($data)==0) {`
break; 
} 
$lines_string.=$data;
}while(true);// Shut down fopen Handle, releasing resources 
fclose($handle);// Output content 
echo $lines_string;

4. Use curl implementation (I will use this as a rule).

To use curl, you must open curl in space. Methods: php.ini was modified under windows, and the semicolon before extension=php_curl. dll was removed, and ssleay32.dll and libeay32.dll were copied to C:\ WINDOWS\ system32; curl extensions are installed under Linux.


<?php
header("Content-Type: text/html;charset=utf-8");
date_default_timezone_set('PRC');
$url = "https://***********ycare";// URL to crawl 
$res = curl_get_contents($url);//curl Encapsulation method 
preg_match_all('/<script>(.*?)<\/script>/',$res,$arr_all);// The data in this web page is passed through js Pack it over, so grab it directly js You can 
preg_match_all('/"id"\:"(.*?)",/',$arr_all[1][1],$arr1);// From js Match the desired data in the block 
$list = array_unique($arr1[1]);// (Save) Guarantee non-repetition 
// The following is the same, and the loop can 
for($i=0;$i<=6;$i=$i+2){
  $detail_url = 'ht*****em/'.$list[$i];
  $detail_res = curl_get_contents($detail_url);
  preg_match_all('/<script>(.*?)<\/script>/',$detail_res,$arr_detail);
  preg_match('/"desc"\:"(.*?)",/',$arr_detail[1][1],$arr_content);
  ***
    ***
    ***
  $ret=curl_post('http://**********cms.php',$result);// This script is not placed on the server, so it is good for everyone to understand. 
}
function curl_get_contents($url,$cookie='',$referer='',$timeout=300,$ishead=0) {
  $curl = curl_init();
  curl_setopt($curl, CURLOPT_RETURNTRANSFER,1);
  curl_setopt($curl, CURLOPT_FOLLOWLOCATION,1);
  curl_setopt($curl, CURLOPT_URL,$url);
  curl_setopt($curl, CURLOPT_TIMEOUT,$timeout);
  curl_setopt($curl, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36');
  if($cookie)
  {
    curl_setopt( $curl, CURLOPT_COOKIE,$cookie);
  }
  if($referer)
  {
    curl_setopt ($curl,CURLOPT_REFERER,$referer);
  }
  $ssl = substr($url, 0, 8) == "https://" ? TRUE : FALSE;
  if ($ssl)
  {
    curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
  }
  $res = curl_exec($curl);
  return $res;
  curl_close($curl);
}
//curl post Data to server 
function curl_post($url,$data){
  $ch = curl_init();
  curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
  //curl_setopt($ch,CURLOPT_FOLLOWLOCATION, 1);
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
  curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36');
  curl_setopt($ch,CURLOPT_URL,$url);
  curl_setopt($ch,CURLOPT_POST,true);
  curl_setopt($ch,CURLOPT_POSTFIELDS,$data);
  $output = curl_exec($ch);
  curl_close($ch);
  return $output; 
}
?>

5. fsockopen () function socket mode (never used, you can try it later)

Whether the socket mode can be executed correctly is also related to the settings of the server. You can check which communication protocols are opened by the phpinfo


<?php
$fp = fsockopen("t.qq.com", 80, $errno, $errstr, 30);
if (!$fp) {
  echo "$errstr ($errno)<br />\n";
} else {
  $out = "GET / HTTP/1.1\r\n";
  $out .= "Host: t.qq.com\r\n";
  $out .= "Connection: Close\r\n\r\n";
  fwrite($fp, $out);
  while (!feof($fp)) {
    echo fgets($fp, 128);
  }
  fclose($fp);
}

6. snoopy plug-in, the latest version is Snoopy-1. 2.4. zip Last Update: 2013-05-30, recommended for use

Use the very popular online snoopy to collect, this is a very powerful collection plug-in, and it is very convenient to use, you can also set agent to simulate browser information.

Setting agent is in Snoopy. class. php file line 45, please search in the file "var formula input error _ SERVER ['HTTP_USER_AGENT']; can get browser information, echo out of the content copied to agent inside can be.


<?php
// Introduce snoopy Class file of 
require('Snoopy.class.php');
// Initialization snoopy Class 
$snoopy=new Snoopy;
$url="[http://t.qq.com](http://t.qq.com/)";
// Start collecting content `
$snoopy->fetch($url);
// Save the collected content to $lines_string
$lines_string=$snoopy->results;
// Output content, hey hey, you can also save it on your own server 
echo $lines_string;

Summarize


Related articles: