Basic knowledge used by the Scrapy framework

2020-12-21 18:06:33
OfStack
scrapy is an asynchronous processing framework based on Twisted with strong scalability. The advantages are not repeated here.

Here's some conceptual knowledge to help you understand scrapy.

1. Data flow

To master the framework, 1 must understand how the data flow is 1 process. To sum up:

1. The engine opens the website first and requests url.

2. The engine schedules url in the form of Request through the scheduler.

3. Engine requests the next url.

4. The scheduler sends url to the engine via Downloader Middlewares

5.Downloader generates response and sends it to the engine via Downloader Middlewares

6. Engine receives Response and sends it to spider for processing via spiderMiddleware

7. Handle response spider

8. The engine sends the item processed by spider to ItemPipeline and then sends the new Request to the scheduler.

2. Function of each structure

DownloderMiddleware

The scheduler takes Request out of the queue and sends it to Downloader for download, which is processed by DownloaderMiddleware.

There are two roles:


Before the scheduler calls out Request and sends it to Downloader.
After downloading Response is generated before sending it to spider.


There are three core approaches:

process_request(request,spider)

Request is called before it reaches Downloader

Parameter introduction:


request: Request object, Request to be processed.
spider: spider object, the corresponding spider to Request being processed above.


The return value:

1. Return
None 
Call the other
process_request()
Method until Request is executed to Response.

2. Return
Response
Object, low priority
process_request()
and
process_exception
Don't call.

3. Return
request
Object, low priority
process_request()
Stop execution and return new
Request
.

process_response(request,response,spider)

Position of action:


After Downloader executes Request, it will get the corresponding Reponse. The scrapy engine will send Response to spider for parsing. Before sending, it will call this method to process Response.


Case of return value:

1. Return
Request
Low priority
process_respons()
Don't call.

2. Return
process_request()0
Low priority
process_respons()
Continue calling.

process_exception (request exception, spider)

This function is mainly used to handle exceptions.

spiderMiddleware

Position of action:


Downloader generates Response and sends it to spider,
Before sending, it is processed by spiderMiddleware.


Core methods:

process_soider_input(response,spider)

The return value:

1. Return None

Proceed with Response, calling all of the
spiderMiddleware
, know spider processing

2. Run out of the ordinary

Call Request directly
errback（）
Method, use
process_spider_output()
To deal with.

process_spider_output(response,result,spider)

Called when spider processes the Response return result.

process_spider_exception(response,exception,spider)

Return value: none

Continue working on response, return 1 iterable object,
process_spider_output()
Method is called.

process_start_request(start_requests,spider)

In order to
spider
To start the
Request
Must return for parameter to be called
request
.





conclusion