python Crawler Framework talonspider is briefly introduced

  • 2020-06-03 07:06:48
  • OfStack

1. Why do you write this?

1 some simple pages, do not need to use a larger frame to crawl, their pure handwriting and more trouble

So talonspider was written for this requirement:

The & # 8226; 1. item extraction for single page - details here
The & # 8226; 2.spider module - details here

2. Introduction & & use

2.1.item

This module can be used independently. For a website with relatively simple request (such as get request only), you can quickly write the crawler you want with this module alone, such as (python3 is used below, python2 is shown in examples directory) :

2.1.1. Single page and single target

Such as to obtain the url http: / / book qidian. com/info / 1004608738 books information, the information such as cover, can be directly written like this:


import time
from talonspider import Item, TextField, AttrField
from pprint import pprint

class TestSpider(Item):
  title = TextField(css_select='.book-info>h1>em')
  author = TextField(css_select='a.writer')
  cover = AttrField(css_select='a#bookImg>img', attr='src')

  def tal_title(self, title):
    return title

  def tal_cover(self, cover):
    return 'http:' + cover

if __name__ == '__main__':
  item_data = TestSpider.get_item(url='http://book.qidian.com/info/1004608738')
  pprint(item_data)

See qidian_details_by_item py

2.1.1. Single page multiple goals

For example, you can get 25 films displayed on the homepage of Douban 250 films. This page has 25 targets. You can write this directly:


from talonspider import Item, TextField, AttrField
from pprint import pprint

#  Definition inherits from item reptilian 
class DoubanSpider(Item):
  target_item = TextField(css_select='div.item')
  title = TextField(css_select='span.title')
  cover = AttrField(css_select='div.pic>a>img', attr='src')
  abstract = TextField(css_select='span.inq')

  def tal_title(self, title):
    if isinstance(title, str):
      return title
    else:
      return ''.join([i.text.strip().replace('\xa0', '') for i in title])

if __name__ == '__main__':
  items_data = DoubanSpider.get_items(url='https://movie.douban.com/top250')
  result = []
  for item in items_data:
    result.append({
      'title': item.title,
      'cover': item.cover,
      'abstract': item.abstract,
    })
  pprint(result)

See douban_page_by_item py

2.2.spider

spider comes in handy when you need to climb a layered page, such as climbing all 250 movies on Douban:


# !/usr/bin/env python
from talonspider import Spider, Item, TextField, AttrField, Request
from talonspider.utils import get_random_user_agent


#  Definition inherits from item reptilian 
class DoubanItem(Item):
  target_item = TextField(css_select='div.item')
  title = TextField(css_select='span.title')
  cover = AttrField(css_select='div.pic>a>img', attr='src')
  abstract = TextField(css_select='span.inq')

  def tal_title(self, title):
    if isinstance(title, str):
      return title
    else:
      return ''.join([i.text.strip().replace('\xa0', '') for i in title])


class DoubanSpider(Spider):
  #  Define the starting url That must be 
  start_urls = ['https://movie.douban.com/top250']
  # requests configuration 
  request_config = {
    'RETRIES': 3,
    'DELAY': 0,
    'TIMEOUT': 20
  }
  #  Analytic function   There must be 
  def parse(self, html):
    #  will html into etree
    etree = self.e_html(html)
    #  Extract the target value to generate a new one url
    pages = [i.get('href') for i in etree.cssselect('.paginator>a')]
    pages.insert(0, '?start=0&filter=')
    headers = {
      "User-Agent": get_random_user_agent()
    }
    for page in pages:
      url = self.start_urls[0] + page
      yield Request(url, request_config=self.request_config, headers=headers, callback=self.parse_item)

  def parse_item(self, html):
    items_data = DoubanItem.get_items(html=html)
    # result = []
    for item in items_data:
      # result.append({
      #   'title': item.title,
      #   'cover': item.cover,
      #   'abstract': item.abstract,
      # })
      #  save 
      with open('douban250.txt', 'a+') as f:
        f.writelines(item.title + '\n')


if __name__ == '__main__':
  DoubanSpider.start()

Console:


/Users/howie/anaconda3/envs/work3/bin/python /Users/howie/Documents/programming/python/git/talonspider/examples/douban_page_by_spider.py
2017-06-07 23:17:30,346 - talonspider - INFO: talonspider started
2017-06-07 23:17:30,693 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250
2017-06-07 23:17:31,074 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=25&filter=
2017-06-07 23:17:31,416 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=50&filter=
2017-06-07 23:17:31,853 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=75&filter=
2017-06-07 23:17:32,523 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=100&filter=
2017-06-07 23:17:33,032 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=125&filter=
2017-06-07 23:17:33,537 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=150&filter=
2017-06-07 23:17:33,990 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=175&filter=
2017-06-07 23:17:34,406 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=200&filter=
2017-06-07 23:17:34,787 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=225&filter=
2017-06-07 23:17:34,809 - talonspider - INFO: Time usage : 0:00:04.462108

Process finished with exit code 0

The current directory generates douban250.txt, as shown in douban_page_ES74en_spider.py.

3. The instructions

Learning work, there are many places to be improved, welcome to comment, project address talonspider.


Related articles: