A practical Python crawler library RoboBrowser

  • 2021-09-24 23:13:49
  • OfStack

Catalogue 1. Preface
2. Installation and usage
3. Actual combat 1
3-1 Open Target Website 3-2 Automated Form Submission 3-3 Data Crawling 4. Finally

1. Preface

Hello, everyone, I'm Ango!

Today, I recommend a niche lightweight crawler library: RoboBrowser

RoboBrowser, Your friendly neighborhood web scraper! Written by pure Python, running without an independent browser, it can not only do crawler, but also realize the automation of Web

Project address:

​https://github.com/jmcarp/robobrowser

2. Installation and usage

Before the actual combat, we first install the dependency library and parser

PS: The officially recommended parser is "lxml"


#  Install dependencies 
pip3 install robobrowser

# lxml Parser (officially recommended) 
pip3 install lxml

Two common features of RoboBrowser are:

Simulated form Form submission Web page data crawling

Using RoboBrowser to crawl web page data, three common methods are as follows:

find

Query the first element of the current page that meets the criteria

find_all

Query 1 list element with common attributes on the current page

select

Query the page through CSS selector and return 1 element list

It should be noted that RoboBrowser relies on BS4, so it is used in a similar way to BS4

For more functions, please refer to:

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

3. Actual combat 1

Let's take "Baidu Search and Crawling Search Results List" as an example

3-1 Open Target Website

First, we instantiate an RoboBrowser object


from time import sleep

from robobrowser import RoboBrowser

home_url = 'https://baidu.com'

# parser:  Parser, HTML parser; used by BeautifulSoup
#  Official recommendation: lxml
rb = RoboBrowser(history=True, parser='lxml')

#  Open the target Web site 
rb.open(home_url)

Then, use the open () method in the RoboBrowser instance object to open the target Web site

3-2 Automated Form Submission

First, use the RoboBrowser instance object to get the form Form in the Web page

Then, simulate the input operation by assigning values to the input boxes in the form

Finally, the form is submitted using the submit_form () method to simulate a search operation


#  Get a form object 
bd_form = rb.get_form()

print(bd_form)

bd_form['wd'].value = "AirPython"

#  Submit a form, simulate 1 Second search 
rb.submit_form(bd_form)

3-3 data crawling

The structure of search page is analyzed, and all search list elements are matched by select () method in RoboBrowser

The search list elements are traversed and the

() method is used to find the title of each item and the href link address


#  View the results 
result_elements = rb.select(".result")

#  Search results 
search_result = []

#  No. 1 1 Link address of item 
first_href = ''

for index, element in enumerate(result_elements):
 title = element.find("a").text
 href = element.find("a")['href']
 search_result.append(title)

 if index == 0:
  first_href = element.find("a")
  print(' No. 1 1 The item address is :', href)

print(search_result)

Finally, use the follow_link () method in RoboBrowser to simulate the operation of "Click the link to view the details of the web page" under 1


#  Jump to the 1 Links 
rb.follow_link(first_href)

#  Get history 
print(rb.url)

Note that the parameters of the follow_link () method are a tags with href values

4. Finally

Combined with Baidu search example, this paper uses RoboBrowser to complete one automation and crawler operation

Compared with Selenium, Helium, etc., RoboBrowser is lightweight and does not rely on independent browsers and drivers

If you want to deal with some simple crawler or Web automation, RoboBrowser is completely enough; However, for some complex automation scenarios, it is recommended to use Selenium, Pyppeteer, Helium, etc.

The above is the Python crawler library RoboBrowser introduction to the use of the details, more about Python crawler library RoboBrowser information please pay attention to other related articles on this site!


Related articles: