A practical Python crawler library RoboBrowser
- 2021-09-24 23:13:49
- OfStack
2. Installation and usage
3. Actual combat 1
3-1 Open Target Website 3-2 Automated Form Submission 3-3 Data Crawling 4. Finally
1. Preface
Hello, everyone, I'm Ango!
Today, I recommend a niche lightweight crawler library: RoboBrowser
RoboBrowser, Your friendly neighborhood web scraper! Written by pure Python, running without an independent browser, it can not only do crawler, but also realize the automation of Web
Project address:
https://github.com/jmcarp/robobrowser
2. Installation and usage
Before the actual combat, we first install the dependency library and parser
PS: The officially recommended parser is "lxml"
# Install dependencies
pip3 install robobrowser
# lxml Parser (officially recommended)
pip3 install lxml
Two common features of RoboBrowser are:
Simulated form Form submission Web page data crawlingUsing RoboBrowser to crawl web page data, three common methods are as follows:
findQuery the first element of the current page that meets the criteria
find_allQuery 1 list element with common attributes on the current page
selectQuery the page through CSS selector and return 1 element list
It should be noted that RoboBrowser relies on BS4, so it is used in a similar way to BS4
For more functions, please refer to:
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
3. Actual combat 1
Let's take "Baidu Search and Crawling Search Results List" as an example
3-1 Open Target Website
First, we instantiate an RoboBrowser object
from time import sleep
from robobrowser import RoboBrowser
home_url = 'https://baidu.com'
# parser: Parser, HTML parser; used by BeautifulSoup
# Official recommendation: lxml
rb = RoboBrowser(history=True, parser='lxml')
# Open the target Web site
rb.open(home_url)
Then, use the open () method in the RoboBrowser instance object to open the target Web site
3-2 Automated Form Submission
First, use the RoboBrowser instance object to get the form Form in the Web page
Then, simulate the input operation by assigning values to the input boxes in the form
Finally, the form is submitted using the submit_form () method to simulate a search operation
# Get a form object
bd_form = rb.get_form()
print(bd_form)
bd_form['wd'].value = "AirPython"
# Submit a form, simulate 1 Second search
rb.submit_form(bd_form)
3-3 data crawling
The structure of search page is analyzed, and all search list elements are matched by select () method in RoboBrowser
The search list elements are traversed and the
() method is used to find the title of each item and the href link address
# View the results
result_elements = rb.select(".result")
# Search results
search_result = []
# No. 1 1 Link address of item
first_href = ''
for index, element in enumerate(result_elements):
title = element.find("a").text
href = element.find("a")['href']
search_result.append(title)
if index == 0:
first_href = element.find("a")
print(' No. 1 1 The item address is :', href)
print(search_result)
Finally, use the follow_link () method in RoboBrowser to simulate the operation of "Click the link to view the details of the web page" under 1
# Jump to the 1 Links
rb.follow_link(first_href)
# Get history
print(rb.url)
Note that the parameters of the follow_link () method are a tags with href values
4. Finally
Combined with Baidu search example, this paper uses RoboBrowser to complete one automation and crawler operation
Compared with Selenium, Helium, etc., RoboBrowser is lightweight and does not rely on independent browsers and drivers
If you want to deal with some simple crawler or Web automation, RoboBrowser is completely enough; However, for some complex automation scenarios, it is recommended to use Selenium, Pyppeteer, Helium, etc.
The above is the Python crawler library RoboBrowser introduction to the use of the details, more about Python crawler library RoboBrowser information please pay attention to other related articles on this site!