PHP acquisition procedures principle analysis

  • 2020-03-31 20:30:49
  • OfStack

After a few days of hard thinking, I finally figured out the truth. Write it here, please.
The collection process of the idea is very simple, nothing more than first hit a page, generally is a list of pages, get the address of all the links, and then open a link, looking for what we are interested in, if found, put it into storage or other processing. Here's a very simple example.

The first step is to identify a collection page, usually the column surface. Here the goal is to: / / www.jb51.net/article/11/index.htm. This is a list page, and our purpose is to collect all the articles on this list page.

There is a list page, the first step is to open it and incorporate its contents into our program. We usually use fopen or file_get_contents, but we'll use fopen as an example. How do I open it? Is simple: $source = fopen (" / / www.jb51.net/article/11/index.htm ", "r"); We've actually incorporated the content into our program. Notice that the resulting $source is a resource, not processable text, so use the function fread to read the content into a variable, and this time it's actually editable text. Example:
$content = fread ($source, 99999); The number behind represents the number of bytes. Just fill in a large number. You write $content to a text file with file_put_contents, and you can see that the contents are actually the source code of the web page. Get the page source, and we will analysis the inside of the article links, here to use regular expressions, [recommend regular expression tutorial (/ / www.jb51.net/article/7/all/545.1.htm)]. By looking at the source code, we can see that the links to the articles all look like this. Div class = "in_arttitle >" < A href = "/ / www.jb51.net/article/10/all/273.1.htm" > Encapsulate the database connection code in a function that calls.. < / a>
So we can write regular expressions. $count = preg_match_all (" / < Div class = \ "in_arttitle \" > < A \ shref = \ "(. +?) \ "> (. +?) < \ / a> / ", $content, $art_list);
The array $art_list[1][$s] contains the link to an article. $art_list[2][$s] contains the title of an article. That's half the battle.
Then use the for loop to type each link in turn, and then get the content the same way you get the title. All of the above are similar to the tutorials I found on the Internet, but by the time I got to this for loop tutorial, it was really bad. I haven't found a single article that can explain this. At first, I used js to help with the loop, or an example.
For ($I = 0; $i< 20; 4 I + + {
In the middle is the part of the collection, omitted
One page collected, another page collected
But when you use fopen to open the link again, it won't work. Request failure or something, with js is not good, finally know to use this echo "< HTTP - EQUIV = META REFRESH the CONTENT = '0; URL = aa. PHP? Id = 1 '>" ; Where aa.php is the file name of our program, the number after the id can help us to achieve the loop, collection of multiple pages. That's the key to really cycling
}
The brain is a bit uncomfortable, write a bit messy, will see, in the expert's view this may not be a big deal, but for I wait for rookie, it is very helpful.


Related articles: