python crawler scrapy framework incremental crawler sample code

  • 2021-09-12 01:35:01
  • OfStack

Incremental crawler of scrapy framework

1. Incremental crawler

When to use incremental crawlers:
Incremental crawler: When we browse a number of sites will find that some sites will regularly update a number of new data on the basis of the original. For example, some movie websites will update the latest popular movies in real time. So, when we encounter these situations in the process of crawling, should we update the program regularly to crawl the updated new data? Then, incremental crawlers can help us realize

2. Incremental reptiles

Concepts:
Through the crawler program to detect the data update of a website, so that you can crawl to the updated data of the website

How to do incremental crawling work:
Determine whether this URL has been crawled before sending the request
After parsing the content, determine whether the content has been crawled before
When writing to a storage medium, it is judged whether the content is in the medium or not

The core of incremental type is deduplication
Method of weight removal:
The URL generated in the crawling process is stored in the set in the redis, and when the next crawling is carried out, the set in the stored URL is judged, and if the URL exists, the request is not initiated, otherwise, the request is initiated
The crawled website content is marked with only one, and then the only one mark is stored in set of redis. When the data is crawled again next time, before persistent storage, it is judged that the only one mark of the data is in set which is not in redis. If it is, it is not stored, otherwise, the content is stored

3. Examples

Crawler file


# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from redis import Redis
from increment2_Pro.items import Increment2ProItem
import hashlib
class QiubaiSpider(CrawlSpider):
  name = 'qiubai'
  # allowed_domains = ['www.xxx.com']
  start_urls = ['https://www.qiushibaike.com/text/']

  rules = (
    Rule(LinkExtractor(allow=r'/text/page/\d+/'), callback='parse_item', follow=True),
  )

  def parse_item(self, response):

    div_list = response.xpath('//div[@class="article block untagged mb15 typs_hot"]')
    conn = Redis(host='127.0.0.1',port=6379)
    for div in div_list:
      item = Increment2ProItem()
      item['content'] = div.xpath('.//div[@class="content"]/span//text()').extract()
      item['content'] = ''.join(item['content'])
      item['author'] = div.xpath('./div/a[2]/h2/text() | ./div[1]/span[2]/h2/text()').extract_first()
      
			#  Hash the currently crawled data 1 Identification ( Data fingerprint )
      sourse = item['content']+item['author']
      hashvalue = hashlib.sha256(sourse.encode()).hexdigest()

      ex = conn.sadd('qiubai_hash',hashvalue)
      if ex == 1:
        yield item
      else:
        print(' There is no updatable data to crawl through ')


    # item = {}
    #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
    #item['name'] = response.xpath('//div[@id="name"]').get()
    #item['description'] = response.xpath('//div[@id="description"]').get()
    # return item

Pipeline file (pipeline file can also be added without adding)


from redis import Redis
class Increment2ProPipeline(object):
  conn = None
  def open_spider(self,spider):
    self.conn = Redis(host='127.0.0.1',port=6379)
  def process_item(self, item, spider):
    dic = {
      'author':item['author'],
      'content':item['content']
    }
    self.conn.lpush('qiubaiData',dic)
    print(' Crawl to 1 Bar data , Being put into storage ......')
    return item

Related articles: