How python crawler works

  • 2020-05-26 09:32:08
  • OfStack

1. Working principle of the crawler

Web crawler, Web Spider, is a very graphic name. Compare the Internet to a spider web, so Spider is a spider crawling around the web. A web spider looks for a web page by its url. Start from a certain page of the website (usually the home page), read the content of the page, find other links in the page, and then look for the next page through these links, so 1 straight loop down, until the site all pages have been crawled. If the entire Internet is treated as a website, then the web spider can use this principle to crawl down all the web pages on the Internet. In this way, the web crawler is a crawler, a crawler of web pages. The basic operation of a web crawler is to crawl web pages. So how do you get what you want? Let's start with URL.

The process of crawling the web page is similar to the way readers use the IE browser to browse the web. Let's say you type www.baidu.com into the browser's address bar. The process of opening a web page is actually that the browser, as a "client" for browsing, sends a request to the server, "grabs" the files on the server to the local, and then interprets and presents them. HTML is a markup language that tags content and parses and distinguishes it. The browser's function is to parse the HTML code that we've got, and then turn the original code into the web page that we see directly.

Simple speaking, URL is input on the browser end http: / / www baidu. com this string. Before you can understand URL, you need to understand the concept of URI.

What is URI?

Each available resource on Web, such as HTML documents, images, video clips, programs, etc., is located by a common resource identifier (Universal Resource Identifier, URI).

URI usually consists of three parts:

A naming mechanism for accessing resources; The host name of the resource; The name of the resource itself, represented by a path.

Below URI: http: / / www why. com. cn/myhtml/html1223 /

This is a resource that can be accessed through the HTTP protocol, Located in the host www. webmonkey. com. cn, Access via the path "/html/html40".

2. Understanding and examples of URL

URL is a subset of URI. It is short for Uniform Resource Locator and translates as "unified resource locator". In layman's terms, URL is a string that describes information resources on Internet and is used mainly for various WWW client and server programs. URL can be used to describe various information resources in a unified format, including files, server addresses and directories. The 1 format of URL is (optional with square brackets []) :

protocol :// hostname[:port] / path / [;parameters][?query]#fragment

The format of URL consists of three parts:

Part 1 is the protocol (or service mode). Part 2 is the host IP address (sometimes including the port number) where the resource is stored. Part 3 is the specific addresses of the host resources, such as directories and file names.

Part 1 and part 2 are separated by the "://" symbol, and part 2 and part 3 are separated by the "/" symbol. Parts 1 and 2 are indispensable, and part 3 can sometimes be omitted.

3. URL and URI are simply compared

URI is a lower-level abstraction of URL, a string text standard. In other words, URI is a parent and URL is a subclass of URI. URL is a subset of URI. URI is defined as: 1. URL is defined as: uniform resource locator. The difference between the two is that URI represents the path to the request server and defines such a resource. URL also explains how to access the resource (http://).

Let's look at two small examples of URL.

1. Example of URL of HTTP protocol:

Use the hypertext transfer protocol HTTP to provide hypertext information service resources.

Example: http: / / www. peopledaily. com. cn/channel/welcome htm

Domain name for its computer www. peopledaily. com. cn.

The hypertext file (file type.html) is welcome.htm under directory /channel.

This is a computer for People's Daily of China.

Example: http: / / www. rol. cn. NET/talk/talk1 htm

Domain name for its computer www. rol. cn. Net.

The hypertext file (file type.html) is talk1.htm under directory /talk.

This is the address of the reed chat room, from which you can enter room 1 of the reed chat room.

2. The file URL

When the file is represented by URL, the server is represented by file, followed by the host IP address, the access path (that is, the directory) of the file and the file name.

Sometimes you can omit directories and file names, but you cannot omit the/symbol.

Example: file: / / ftp. yoyodyne. com pub/files/foobar txt

The above URL represents a file in the pub/files/ directory on the host ftp.yoyodyne.com, and the file name is foobar.txt.

Example: file: / / ftp. yoyodyne. com/pub

Represents the directory on the host ftp.yoyodyne.com /pub.

Example: file: / / ftp. yoyodyne. com /

The root directory representing the host ftp.yoyodyne.com.

The main processing object of the crawler is URL. It obtains the required file content according to the address of URL, and then processes it step by step.

Therefore, an accurate understanding of URL is critical to understanding web crawlers.


Related articles: