Module analysis commonly used in python crawlers

2020-04-02 13:59:25
OfStack

In this paper, the commonly used modules of Python crawler are deeply analyzed and illustrated with examples. Share with you for your reference. Specific analysis is as follows:

Creepy module

Developed by a Taiwanese deity, it has simple functions and can automatically crawl all the contents of a website. Of course, you can also set the url to be grabbed.

Address: https://pypi.python.org/pypi/creepy

Functional interface:

Set_content_type_filter:
Set the contenttype to be fetched (the contenttype in the header). Including text/HTML

Add_url_filter:
Filter the url, passing in a regular expression

Set_follow_mode:
Set the recursive mode F_ANY: all links on the page will be fetched. F_SAME_DOMAIN is similar to F_SAME_HOST. That is, the same domain name will crawl. F_SAME_PATH: fetching the same path. For example bag.vancl.com/l1/d3/1.jpg path for l1 / d3/1. JPG, the path for the l1 / d3 / * will grab. Here you can add your own recursive pattern as needed

Set_concurrency_level:
Sets the maximum number of threads

Process_document:
Generally need to rewrite, process web content, extract their own content.

The selenium
Visual interface, grab automation, API use is super simple, it is like you are operating the browser.

The official website: http://www.seleniumhq.org/
Official python website
http://pypi.python.org/pypi/selenium
Webdriver API (very useful, I suggest you know more about it)
http://www.seleniumhq.org/docs/03_webdriver.jsp

The following is an example of a crawler website:


from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

browser = webdriver.Firefox()
browser.get('http://bag.vancl.com/28145-28167-a18568_18571-b1-n3-s1.html#ref=hp-hp-hot-8_1_1-v:n')
elem = browser.find_element_by_name('ch_bag-3-page-next') # Find the search box
time.sleep(1)
print elem.get_attribute("href")
elem.click()

time.sleep(1)
elem = browser.find_element_by_name('ch_bag-3-page-next') # Find the search box
print elem.get_attribute("href")
elem.click()

I hope this article has helped you with your Python programming.