PHP crawling pages with code parsing recommendations

  • 2020-03-31 20:55:22
  • OfStack

We can't output the data directly. We often need to extract the content and then format it to appear in a more friendly way.
Here is a brief introduction to the main content of this article:

I. main methods of PHP crawling pages:

1. The file () function
2. The file_get_contents () function
3. The fopen () - > Fread () - > The fclose () mode
4. The curl
5. Fsockopen () function socket mode
6. The use of plug-ins, such as: http://sourceforge.net/projects/snoopy/)

Ii. Main ways for PHP to parse HTML or XML code:

1. Regular expressions
2. PHP DOMDocument object
3. Plugins (e.g. PHP Simple HTML DOM Parser)

If you already know a lot about the above, the following can be skimmed over...

PHP crawls the page

1. The file () function
 
<?php 
$url='http://t.qq.com'; 
$lines_array=file($url); 
$lines_string=implode('',$lines_array); 
echo htmlspecialchars($lines_string); 
?> 


2. The file_get_contents () function
With file_get_contents and fopen, the space must be opened with allow_url_fopen. Method: edit php.ini, set allow_url_fopen = On, and when allow_url_fopen is closed, neither fopen nor file_get_contents can open a remote file.
 
<?php 
$url='http://t.qq.com'; 
$lines_string=file_get_contents($url); 
echo htmlspecialchars($lines_string); 
?> 


3. The fopen () - > Fread () - > The fclose () mode

 
<?php 
$url='http://t.qq.com'; 
$handle=fopen($url,"rb"); 
$lines_string=""; 
do{ 
$data=fread($handle,1024); 
if(strlen($data)==0){break;} 
$lines_string.=$data; 
}while(true); 
fclose($handle); 
echo htmlspecialchars($lines_string); 
?> 


4. The curl
Using curl must have space to turn curl on. Method: modify php.ini under Windows, remove the semicolon before extension=php_curl. DLL, and copy ssleay32.dll and libeay32.dll to C:\ Windows \system32; The curl extension is installed under Linux.
 
<?php 
$url='http://t.qq.com'; 
$ch=curl_init(); 
$timeout=5; 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); 
$lines_string=curl_exec($ch); 
curl_close($ch); 
echo htmlspecialchars($lines_string); 
?> 


5. Fsockopen () function socket mode
Whether the socket mode can be executed correctly depends on the setting of the server. Specifically, you can use phpinfo to check which communication protocols the server has opened. For example, my local PHP socket does not open HTTP, so I can only use udp to test.
 
<?php 
$fp = fsockopen("udp://127.0.0.1", 13, $errno, $errstr); 
if (!$fp) { 
echo "ERROR: $errno - $errstr<br />n"; 
} else { 
fwrite($fp, "n"); 
echo fread($fp, 26); 
fclose($fp); 
} 
?> 


6. The plugin
There should be more plugins on the Internet, snoopy plugins are found on the Internet, you can study them if you are interested.

PHP to parse the XML (HTML)

1. Regular expressions:

 
<?php 
$url='http://t.qq.com'; 
$lines_string=file_get_contents($url); 
eregi('<title>(.*)</title>',$lines_string,$title); 
echo htmlspecialchars($title[0]); 
?> 


2. PHP DOMDocument() object
If there is a syntax error in the remote HTML or XML, PHP will report an error when parsing the dom.

 
<?php 
$url='http://www.136web.cn'; 
$html=new DOMDocument(); 
$html->loadHTMLFile($url); 
$title=$html->getElementsByTagName('title'); 
echo $title->item(0)->nodeValue; 
?> 


3. The plug-in
This article takes PHP Simple HTML DOM Parser as an example to briefly introduce it. The syntax of simple_html_dom is similar to jQuery, which allows PHP to manipulate DOM, just as Simple as using jQuery to manipulate DOM.
 
<?php 
$url='http://t.qq.com'; 
include_once('../simplehtmldom/simple_html_dom.php'); 
$html=file_get_html($url); 
$title=$html->find('title'); 
echo $title[0]->plaintext; 
?> 


Of course, the Chinese are creative, and foreigners tend to be ahead in technology, but the Chinese tend to be better in use, often doing things that foreigners wouldn't think of, such as remote fetching and analysis of PHP, which is supposed to facilitate data integration. However, Chinese people like this very much, so is a large number of collection stations, they do not create any valuable content itself, is by grabbing other people's website content, and take it as their own. In baidu input "PHP small" keyword, suggest list is the first "PHP thief program", and then put the same keyword into Google, brother can only smiling without a word.

Related articles: