Basic knowledge used by the Scrapy framework

  • 2020-12-21 18:06:33
  • OfStack

scrapy is an asynchronous processing framework based on Twisted with strong scalability. The advantages are not repeated here.

Here's some conceptual knowledge to help you understand scrapy.

1. Data flow

To master the framework, 1 must understand how the data flow is 1 process. To sum up:

1. The engine opens the website first and requests url.

2. The engine schedules url in the form of Request through the scheduler.

3. Engine requests the next url.

4. The scheduler sends url to the engine via Downloader Middlewares

5.Downloader generates response and sends it to the engine via Downloader Middlewares

6. Engine receives Response and sends it to spider for processing via spiderMiddleware

7. Handle response spider

8. The engine sends the item processed by spider to ItemPipeline and then sends the new Request to the scheduler.

2. Function of each structure

DownloderMiddleware

The scheduler takes Request out of the queue and sends it to Downloader for download, which is processed by DownloaderMiddleware.

There are two roles:

Before the scheduler calls out Request and sends it to Downloader. After downloading Response is generated before sending it to spider.

There are three core approaches:

process_request(request,spider)

Request is called before it reaches Downloader

Parameter introduction:

request: Request object, Request to be processed. spider: spider object, the corresponding spider to Request being processed above.

The return value:

1. Return None Call the other process_request() Method until Request is executed to Response.

2. Return Response Object, low priority process_request() and process_exception Don't call.

3. Return request Object, low priority process_request() Stop execution and return new Request .

process_response(request,response,spider)

Position of action:

After Downloader executes Request, it will get the corresponding Reponse. The scrapy engine will send Response to spider for parsing. Before sending, it will call this method to process Response.

Case of return value:

1. Return Request Low priority process_respons() Don't call.

2. Return process_request()0 Low priority process_respons() Continue calling.

process_exception (request exception, spider)

This function is mainly used to handle exceptions.

spiderMiddleware

Position of action:

Downloader generates Response and sends it to spider, Before sending, it is processed by spiderMiddleware.

Core methods:

process_soider_input(response,spider)

The return value:

1. Return None

Proceed with Response, calling all of the spiderMiddleware , know spider processing

2. Run out of the ordinary

Call Request directly errback() Method, use process_spider_output() To deal with.

process_spider_output(response,result,spider)

Called when spider processes the Response return result.

process_spider_exception(response,exception,spider)

Return value: none

Continue working on response, return 1 iterable object, process_spider_output() Method is called.

process_start_request(start_requests,spider)

In order to spider To start the Request Must return for parameter to be called request .

conclusion


Related articles: