python Crawler Framework talonspider is briefly introduced
- 2020-06-03 07:06:48
- OfStack
1. Why do you write this?
1 some simple pages, do not need to use a larger frame to crawl, their pure handwriting and more trouble
So talonspider was written for this requirement:
The & # 8226; 1. item extraction for single page - details here
The & # 8226; 2.spider module - details here
2. Introduction & & use
2.1.item
This module can be used independently. For a website with relatively simple request (such as get request only), you can quickly write the crawler you want with this module alone, such as (python3 is used below, python2 is shown in examples directory) :
2.1.1. Single page and single target
Such as to obtain the url http: / / book qidian. com/info / 1004608738 books information, the information such as cover, can be directly written like this:
import time
from talonspider import Item, TextField, AttrField
from pprint import pprint
class TestSpider(Item):
title = TextField(css_select='.book-info>h1>em')
author = TextField(css_select='a.writer')
cover = AttrField(css_select='a#bookImg>img', attr='src')
def tal_title(self, title):
return title
def tal_cover(self, cover):
return 'http:' + cover
if __name__ == '__main__':
item_data = TestSpider.get_item(url='http://book.qidian.com/info/1004608738')
pprint(item_data)
See qidian_details_by_item py
2.1.1. Single page multiple goals
For example, you can get 25 films displayed on the homepage of Douban 250 films. This page has 25 targets. You can write this directly:
from talonspider import Item, TextField, AttrField
from pprint import pprint
# Definition inherits from item reptilian
class DoubanSpider(Item):
target_item = TextField(css_select='div.item')
title = TextField(css_select='span.title')
cover = AttrField(css_select='div.pic>a>img', attr='src')
abstract = TextField(css_select='span.inq')
def tal_title(self, title):
if isinstance(title, str):
return title
else:
return ''.join([i.text.strip().replace('\xa0', '') for i in title])
if __name__ == '__main__':
items_data = DoubanSpider.get_items(url='https://movie.douban.com/top250')
result = []
for item in items_data:
result.append({
'title': item.title,
'cover': item.cover,
'abstract': item.abstract,
})
pprint(result)
See douban_page_by_item py
2.2.spider
spider comes in handy when you need to climb a layered page, such as climbing all 250 movies on Douban:
# !/usr/bin/env python
from talonspider import Spider, Item, TextField, AttrField, Request
from talonspider.utils import get_random_user_agent
# Definition inherits from item reptilian
class DoubanItem(Item):
target_item = TextField(css_select='div.item')
title = TextField(css_select='span.title')
cover = AttrField(css_select='div.pic>a>img', attr='src')
abstract = TextField(css_select='span.inq')
def tal_title(self, title):
if isinstance(title, str):
return title
else:
return ''.join([i.text.strip().replace('\xa0', '') for i in title])
class DoubanSpider(Spider):
# Define the starting url That must be
start_urls = ['https://movie.douban.com/top250']
# requests configuration
request_config = {
'RETRIES': 3,
'DELAY': 0,
'TIMEOUT': 20
}
# Analytic function There must be
def parse(self, html):
# will html into etree
etree = self.e_html(html)
# Extract the target value to generate a new one url
pages = [i.get('href') for i in etree.cssselect('.paginator>a')]
pages.insert(0, '?start=0&filter=')
headers = {
"User-Agent": get_random_user_agent()
}
for page in pages:
url = self.start_urls[0] + page
yield Request(url, request_config=self.request_config, headers=headers, callback=self.parse_item)
def parse_item(self, html):
items_data = DoubanItem.get_items(html=html)
# result = []
for item in items_data:
# result.append({
# 'title': item.title,
# 'cover': item.cover,
# 'abstract': item.abstract,
# })
# save
with open('douban250.txt', 'a+') as f:
f.writelines(item.title + '\n')
if __name__ == '__main__':
DoubanSpider.start()
Console:
/Users/howie/anaconda3/envs/work3/bin/python /Users/howie/Documents/programming/python/git/talonspider/examples/douban_page_by_spider.py
2017-06-07 23:17:30,346 - talonspider - INFO: talonspider started
2017-06-07 23:17:30,693 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250
2017-06-07 23:17:31,074 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=25&filter=
2017-06-07 23:17:31,416 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=50&filter=
2017-06-07 23:17:31,853 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=75&filter=
2017-06-07 23:17:32,523 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=100&filter=
2017-06-07 23:17:33,032 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=125&filter=
2017-06-07 23:17:33,537 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=150&filter=
2017-06-07 23:17:33,990 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=175&filter=
2017-06-07 23:17:34,406 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=200&filter=
2017-06-07 23:17:34,787 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=225&filter=
2017-06-07 23:17:34,809 - talonspider - INFO: Time usage : 0:00:04.462108
Process finished with exit code 0
The current directory generates douban250.txt, as shown in douban_page_ES74en_spider.py.
3. The instructions
Learning work, there are many places to be improved, welcome to comment, project address talonspider.