Zero based writing python crawler crawler definition and URL composition

  • 2020-04-02 14:20:05
  • OfStack

One, the definition of web crawler

Web crawlers, or Web spiders, are a fancy name.
A Spider is a Spider that crawls around the web.
Web spiders look for web pages by their link addresses.
Start with a page on your site (usually the home page), read the contents of the page, find other links in the page,
Then use these links to find the next page, and so on and so on, until the site has all the pages crawl.
If the entire Internet as a website, then web spiders can use this principle to the Internet all the web pages down.
In this way, the web crawler is a crawler, a program to crawl web pages.
The basic operation of web crawler is to grab web pages.
So how do you get the page you want?
Let's start with the URL.

Second, the process of browsing the web

The process of crawling a web page is actually the same as that of browsing a web page with Internet explorer.
Let's say you type in the address bar of your browser       (link: http://www.baidu.com)       This address.
The process of opening a web page is actually the browser as a browsing "client", sent a request to the server side, the server side of the file "grab" to the local, and then explain, show.
HTML is a markup language that tags content and parses and distinguishes it.
The function of the browser is to parse the obtained HTML code and then turn the original code into the website page we see directly.

Concepts and examples of uris and urls

In simple terms, a URL is entered on the browser side       (link: http://www.baidu.com)       This string.
Before you can understand urls, you first need to understand the concept of uris.
What is a URI?
Every available Resource on the Web, such as HTML documents, images, video clips, programs, and so on, is located by a Universal Resource Identifier (URI).
A URI usually consists of three parts:
(1) access to the naming mechanism of resources;
(2) the host name of the resource;
The name of the resource itself, represented by the path.
The following URI:
(link: http://www.why.com.cn/myhtml/html1223/)
We can explain it this way:
This is a resource that can be accessed through the HTTP protocol,
(2) on the host (link: http://www.webmonkey.com.cn),
Through the path "/ HTML /html40" access.

Four, the understanding of URL and examples

Urls are a subset of uris. It is an abbreviation for 'Uniform Resource Locator'.
Generally speaking, a URL is a string describing information resources on the Internet, mainly used in various WWW clients and server programs.
Urls can be used to describe various information resources in a uniform format, including files, server addresses, directories, and so on.
The general format of the URL is (optional with square brackets []) :
Protocol :// hostname[:port] / path / [;parameters][?query]#fragment

The URL format consists of three parts:
The first part is the protocol (or service mode).
The second part is the host IP address (sometimes including the port number) where the resource is stored.
The third part is the specific address of the host resources, such as directory and file name.
The first part is separated from the second by the :// symbol,
The second and third parts are separated by the/symbol.
The first and second parts are indispensable, and the third part can sometimes be omitted.

V. simple comparison of urls and uris

A URI is a lower-level abstraction of a URL, a standard for string text.
In other words, the URI belongs to the parent class, and the URL belongs to a subclass of the URI. Urls are a subset of uris.
URI is defined as: uniform resource identifier;
URL is defined as: uniform resource locator.
The difference is that the URI represents the path to the request server, defining such a resource.
The URL also states how to access the resource (http://).

Let's look at two small examples of urls.

1. URL example of HTTP protocol:
A resource that provides hypertext information services using the hypertext transport protocol HTTP.
Example: (link: http://www.peopledaily.com.cn/channel/welcome.htm)
Domain name for its computer (link: http://www.peopledaily.com.cn).
The hypertext file (file type.html) is welcome.htm under directory /channel.
This is a computer of the People's Daily of China.
Example: (link: http://www.rol.cn.net/talk/talk1.htm)
Its computer domain name is (link: http://www.rol.cn.net).
The hypertext file (file type.html) is talk1.htm under directory /talk.
This is the address of reed chat room, from which you can enter room 1 of reed chat room.

2. The URL of the file
When the URL represents the file, the server means with the file, after the host IP address, the file access path (that is, directory) and file name and other information.
Sometimes you can omit directories and file names, but you cannot omit the/symbol.
Example: (link: #)
The URL above represents a file in the pub/files/ directory on the host (link: #) with the file name foobar.txt.
Example: (link: #)
Represents the directory /pub on the host (link: #).
Example: (link: #)
The root directory that represents the host (link: #).

The main object of crawler is URL. It obtains the required file content according to the URL address, and then further processes it.
Therefore, an accurate understanding of urls is crucial to understanding web crawlers.

Ok, so much for the basics, let's do some actual crawler operations


Related articles: