Usage of Spider in Scrapy of Python crawler framework

  • 2021-11-13 08:25:21
  • OfStack

Usage of Spider in Scrapy

The Spider class defines how to crawl a Web site (or sites). Includes crawling actions (e.g. whether to follow links) and how to extract structured data from the content of a Web page (crawling item). In other words, Spider is where you define crawling actions and analyze a page (or pages).

For spider, the crawling loop is similar to the following:

1. Initialize Request with the original URL and set the callback function. When the request is downloaded and returned, the response is generated and passed as an argument to the callback function. The initial request in spider is obtained by calling start_requests (). start_requests () reads URL in start_urls and generates Request with parse as a callback function.

2. Parse the returned content within a callback function, returning either an ltem object or an Request or an iterative container of two. The returned Request object is then processed by Scrapy, the corresponding content is downloaded, and the set callback function is called (the function can be the same).

3. Within the callback function, you can use the selector (Selectors) (you can also use BeautifulSoup, Ixml, or any parser you want) to parse the Web page content and generate item from the parsed data.

4. Finally, the item returned by spider will be stored in a database (processed by some ltem Pipeline) or in a file using Feed exports.

Although this loop is (to some extent) applicable to any type of spider, Scrapy still provides a variety of default spider for different requirements. These spider will be discussed later.

Spider

scrapy. spider. Spider is the simplest spider. Each other spider must inherit from this class (including the other spider that comes with Scrapy and your own spider). It only requests the given start_urls/start_requests and calls the parse method of spider based on the returned result (resulting responses).

name

A string (string) that defines the name of spider. The name of spider defines how Scrapy locates (and initializes) spider, so it must be 1-only. However, you can generate multiple instances of the same spider (instance) without any restrictions. name is the most important attribute of spider, and it is necessary.

If the spider crawls a single Web site (single domain), a common practice is to name spider after that Web site (domain) (with or without suffixes). For example, if spider crawls mywebsite. com, the spider is usually named mywebsite.

allowed_domains

Optional. Contains a list of domain names (domain) that spider allows to crawl (list). When OffsiteMiddleware is enabled, URL whose domain name is not in the list will not be followed.

start_urls

URL list. When a specific URL is not specified, the spider crawls from the list. Therefore, the URL of the first retrieved page will be the first of the list. Subsequent URL will be extracted from the obtained data.

start_requests()

The method must return 1 iterable object (iterable). This object contains the first Request used by spider to crawl.

This method is called when spider initiates crawling and URL is not enacted. When URL is specified, make_requests_from_url () is called to create an Request object. This method is only called once by Scrapy, so you can implement it as a generator.

The default implementation of this method is to generate Request using url of start_urls.

If you want to modify the Request object that originally crawled a Web site, you can override (override) this method. For example, if you need to log in to a website as POST at startup, you can write this:


def start_requests(self) :
    return [scrapy.FormRequest("http : / /ww. example.com/login" , 
        formdata={ 'user' : 'john', ' pass ' : 'secret'},
        ca77back=se1f.1ogged_in)]
 
def logged_in(self , response) :
## here you would extract links to follow and return Requests for
## each of them , with another ca77back
pass

parse

This method is the default method for Scrapy to process downloaded response when response does not specify a callback function.

The parse processes the response and returns the processed data and/or the subsequent URL. Spider has the same requirements for other Request callback functions.

This method and other Request callback functions must return 1 iterable object containing Request and/or ltem.

Parameter: response-response for analysis

Startup mode

start_urls

start_urls is a list

start_requests

Override start_ur1s with start_requests () to send the request yourself using the Request () method:


def start_requests(se7f):
    """ Rewrite start_urls  Rules """
    yield scrapy.Request(ur1='http://quotes.toscrape.com/page/1/'cal1back=self.parse)

scrapy.Request

scrapy. Request is a request object that must be created with a callback function.

Data preservation

You can use-o to save data in a common format (by suffix name)
The following formats are supported:

json jsonlines jl csv xml marshal pickle

Usage:


scrapy crawl quotes2 -o a.json

Case: Spider sample


##1*- coding: utf-8 -*-
 
import scrapy
 
clTass Quotes2spider(scrapy.spider):
    name = 'quotes2'
    a7lowed_domains = [ 'toscrape.com ' ]
    start_urls = [ ' http: //quotes.toscrape.com/ page/2/ ']
 
    def parse(self , response):
        quotes = response.css('.quote ' )
        for quote in quotes:
            text = quote.css( '.text: : text ' ).extract_first()
            auth = quote.css( '.author : :text ').extract_first()
            tages = quote.css('.tags a: :text' ).extract()
            yield dict(text=text , auth=auth, tages=tages)

Related articles: