PHP acquisition class Snoopy capture picture example

  • 2021-07-01 06:51:38
  • OfStack

After two days of using php's Snoopy class, it was found to be very useful. Get all the links in the request page, use fetchlinks directly, use fetchtext to get all the text information (it is still processed by regular expressions), and have other more functions, such as simulating submission of forms.


Usage:

Download the Snoopy class first, and the download address is http://sourceforge.net/projects/snoopy/
First, instantiate an object, and then call the corresponding method to obtain the crawled web page information


include 'snoopy/Snoopy.class.php';
   
$snoopy = new Snoopy();
   
$sourceURL = "https://www.ofstack.com";
$snoopy->fetchlinks($sourceURL);
   
$a = $snoopy->results;

It does not provide a way to get all the picture addresses in the web page. It has a requirement to get the picture addresses in all the article lists in 1 page. Then I wrote one myself, mainly regular where matching is important.


// A regular expression that matches a picture
 $reTag = "/<img[^s]+src="(http://[^"]+).(jpg|png|gif|jpeg)"[^/]*/>/i";


Because of the special requirements, you only need to grab the picture at the beginning of htp://(the picture of the external station may make the anti-theft chain, so you want to grab it locally first)

1. Crawl the specified web page and filter out all the expected article addresses;

2. Loop to grab the article address in the first step, and then use the regular expression of matching pictures to match, and get all the picture addresses in the page that meet the rules;

3. Save the image according to the image suffix and ID (here only gif, jpg)-If the image file exists, delete it before saving it.


<meta http-equiv='content-type' content='text/html;charset=utf-8'>
<?php
    include 'snoopy/Snoopy.class.php';
   
    $snoopy = new Snoopy();
   
    $sourceURL = "http://xxxxx";
    $snoopy->fetchlinks($sourceURL);
   
    $a = $snoopy->results;
    $re = "/d+.html$/";
   
    // Filter the request to get the specified file address
    foreach ($a as $tmp) {
        if (preg_match($re, $tmp)) {
            getImgURL($tmp);
        }
    }
   
    function getImgURL($siteName) {
        $snoopy = new Snoopy();
        $snoopy->fetch($siteName);
       
        $fileContent = $snoopy->results;
       
        // A regular expression that matches a picture
        $reTag = "/<img[^s]+src="(http://[^"]+).(jpg|png|gif|jpeg)"[^/]*/>/i";
       
        if (preg_match($reTag, $fileContent)) {
            $ret = preg_match_all($reTag, $fileContent, $matchResult);
           
            for ($i = 0, $len = count($matchResult[1]); $i < $len; ++$i) {
                saveImgURL($matchResult[1][$i], $matchResult[2][$i]);
            }
        }
    }
   
    function saveImgURL($name, $suffix) {
        $url = $name.".".$suffix;
       
        echo " Requested picture address: ".$url."<br/>";
       
        $imgSavePath = "E:/xxx/style/images/";
        $imgId = preg_replace("/^.+/(d+)$/", "\1", $name);
        if ($suffix == "gif") {
            $imgSavePath .= "emotion";
        } else {
            $imgSavePath .= "topic";
        }
        $imgSavePath .= ("/".$imgId.".".$suffix);
       
        if (is_file($imgSavePath)) {
            unlink($imgSavePath);
            echo "<p style='color:#f00;'> Documents ".$imgSavePath." Already exists and will be deleted </p>";
        }
       
        $imgFile = file_get_contents($url);
        $flag = file_put_contents($imgSavePath, $imgFile);
       
        if ($flag) {
            echo "<p> Documents ".$imgSavePath." Save successfully </p>";
        }
    }
?>

When using php to crawl web pages: content, pictures, links, I think the most important thing is regularity (according to the crawling content and the specified rules to obtain the desired data), the idea is actually relatively simple, and the methods used are not many, just a few (and crawling content or directly calling the methods in the class written by others can)

But what I thought before is that php doesn't seem to implement the following methods. For example, there is an N line in a file (N is very large), and the contents of the lines that conform to the rules need to be replaced. For example, the third line is aaa and needs to be converted into bbbbb. Common practices when you need to modify documents like 1:

1.1 Read the entire file at a time (or line by line), then use a temporary file to save the final conversion and replace the original file

2. Read line by line, use fseek to control the position of the file pointer, and then fwrite writes

Scenario 1 When the file is large, One-time reading is not desirable (reading line by line, then writing to temporary files and replacing the original files is not efficient). Scheme 2 has no problem when the length of the replaced string is less than or equal to the target value, but if it exceeds the target value, there will be problems, which will "cross the boundary" and disturb the data of the next line (it cannot be replaced with new contents like the concept of "selection" in JavaScript).

Here is the code for experimenting with Scenario 2:


<?php
$mode = "r+";
$filename = "d:/file.txt";
$fp = fopen($filename, $mode);
if ($fp) {
 $i = 1;
 while (!feof($fp)) {
    $str = fgets($fp);
    echo $str;
    if ($i == 1) {
      $len = strlen($str);
      fseek($fp, -$len, SEEK_CUR);// The pointer moves forward
      fwrite($fp, "123");
    }
    i++;
  }
  fclose($fp);
}
?>

Read 1 line first, At this point, the file pointer actually refers to the beginning of the next line, Use fseek to move the file pointer back to the beginning of the last line, Then use fwrite for replacement operation, because it is a replacement operation, in the case of not specifying the length, it affects the data of the next 1 row, and what I want is to only operate for this 1 row, such as deleting this 1 row or replacing the whole row with 1. The above example can't meet the requirements, perhaps I haven't found a suitable method …


Related articles: