Python searches for details using the Beautiful Soup module

  • 2020-05-27 06:06:24
  • OfStack

preface

We'll take advantage of the search capabilities of the Beautiful Soup module to search by tag name, tag properties, document text, and regular expressions.

The search method

The built-in search method for Beautiful Soup is as follows:

find() find_all() find_parent() find_parents() find_next_sibling() find_next_siblings() find_previous_sibling() find_previous_siblings() find_previous() find_all_previous() find_next() find_all_next()

Search using the find() method

The first step is to create an HTML file for testing.


<html>
<body>
<div class="ecopyramid">
 <ul id="producers">
 <li class="producerlist">
  <div class="name">plants</div>
  <div class="number">100000</div>
 </li>
 <li class="producerlist">
  <div class="name">algae</div>
  <div class="number">100000</div>
 </li>
 </ul>
 <ul id="primaryconsumers">
 <li class="primaryconsumerlist">
  <div class="name">deer</div>
  <div class="number">1000</div>
 </li>
 <li class="primaryconsumerlist">
  <div class="name">rabbit</div>
  <div class="number">2000</div>
 </li>
 </ul>
 <ul id="secondaryconsumers">
 <li class="secondaryconsumerlist">
  <div class="name">fox</div>
  <div class="number">100</div>
 </li>
 <li class="secondaryconsumerlist">
  <div class="name">bear</div>
  <div class="number">100</div>
 </li>
 </ul>
 <ul id="tertiaryconsumers">
 <li class="tertiaryconsumerlist">
  <div class="name">lion</div>
  <div class="number">80</div>
 </li>
 <li class="tertiaryconsumerlist">
  <div class="name">tiger</div>
  <div class="number">50</div>
 </li>
 </ul>
</div>
</body>
</html>

We can go through find() Methods to get < ul > Tag, by default, will get the first one to appear. And then get it again < li > The tag, by default, still gets the first one that comes up, and then gets it < div > Tag, the output content to verify that the first occurrence of the tag is obtained.


from bs4 import BeautifulSoup
with open('search.html','r') as filename:
 soup = BeautifulSoup(filename,'lxml')
first_ul_entries = soup.find('ul')
print first_ul_entries.li.div.string

The find() method is as follows:


find(name,attrs,recursive,text,**kwargs) 

As shown in the code above, find() The method accepts five parameters: name, attrs, recursive, text, and **kwargs. The name, attrs, and text parameters are available find() The method ACTS as a filter to improve the accuracy of the matching results.

Search tag

Except for the search in the above code < ul > Outside of the tag, we can also search < li > Tag that returns the result and the first match that occurs.


tag_li = soup.find('li')
# tag_li = soup.find(name = "li")
print type(tag_li)
print tag_li.div.string

Search text

If we only want to search by text content, we can just pass in text parameters:


search_for_text = soup.find(text='plants')
print type(search_for_text)
<class 'bs4.element.NavigableString'>

The result returned is also an NavigableString object.

Search by regular expressions

The following is a paragraph of HTML text


<div>The below HTML has the information that has email ids.</div>
 abc@example.com 
<div>xyz@example.com</div> 
 <span>foo@example.com</span>

You can see that the abc @example email address is not included in any of the labels, so it is not possible to find the email address based on the label. At this point, we can use regular expressions to match.


email_id_example = """
 <div>The below HTML has the information that has email ids.</div>
 abc@example.com
 <div>xyz@example.com</div>
 <span>foo@example.com</span>
 """
email_soup = BeautifulSoup(email_id_example,'lxml')
print email_soup
# pattern = "\w+@\w+\.\w+"
emailid_regexp = re.compile("\w+@\w+\.\w+")
first_email_id = email_soup.find(text=emailid_regexp)
print first_email_id

When matching with a regular expression, if there are multiple matches, the first is returned.

Search by tag attribute value

Can be searched by the attribute value of the tag:


search_for_attribute = soup.find(id='primaryconsumers')
print search_for_attribute.li.div.string

Searching by tag attribute values is available for most attributes, such as id, style, and title.

But it will be different for the following two cases:

Custom attribute Class (class) properties

We can no longer use the property value to search directly, but we have to use the attrs parameter to pass to find() Function.

Search by custom properties

In HTML5 you can add custom attributes to tags, such as attributes to tags.

As shown in the following code, if we search for id again, we will get an error. The variable of Python cannot include the -symbol.


customattr = """
 <p data-custom="custom">custom attribute example</p>
   """
customsoup = BeautifulSoup(customattr,'lxml')
customsoup.find(data-custom="custom")
# SyntaxError: keyword can't be an expression

This time use the attrs property value to pass a dictionary type as a parameter to search:


using_attrs = customsoup.find(attrs={'data-custom':'custom'})
print using_attrs

Search based on the classes in CSS

For class attributes of CSS, class is a keyword in Python, so it cannot be passed as a tag attribute parameter. In this case, it is searched like a custom attribute. Also using the attrs property, pass a dictionary to match.

In addition to using the attrs attribute, you can also use the class_ attribute to pass, which distinguishes it from class and does not cause errors.


from bs4 import BeautifulSoup
with open('search.html','r') as filename:
 soup = BeautifulSoup(filename,'lxml')
first_ul_entries = soup.find('ul')
print first_ul_entries.li.div.string
0

Search using custom functions

Can you give find() The method passes in a function, and the search is performed according to the criteria defined by the function.

The function should return either true or false.


from bs4 import BeautifulSoup
with open('search.html','r') as filename:
 soup = BeautifulSoup(filename,'lxml')
first_ul_entries = soup.find('ul')
print first_ul_entries.li.div.string
1

The code defines an is_producers function, which checks whether the label is specific to the id property and whether the property value is equal to producers, returning true if the condition is met, and false if not.

Use a combination of search methods

Beautiful Soup provides a variety of search methods, which can also be used in combination to improve the accuracy of the search.


from bs4 import BeautifulSoup
with open('search.html','r') as filename:
 soup = BeautifulSoup(filename,'lxml')
first_ul_entries = soup.find('ul')
print first_ul_entries.li.div.string
2

Search using the find_all() method

use find() Method returns the first match from the search results, and find_all() The method returns all matched entries.

in find() The filter items used in the method can also be used find_all() Methods. In fact, they can be used in any search method, such as: find_parents() and find()0 In the.


#  Search all  class  Attribute is equal to the  tertiaryconsumerlist  The label. 
all_tertiaryconsumers = soup.find_all(class_='tertiaryconsumerlist')
print type(all_tertiaryconsumers)
for tertiaryconsumers in all_tertiaryconsumers:
 print tertiaryconsumers.div.string

find_all() The method is:


from bs4 import BeautifulSoup
with open('search.html','r') as filename:
 soup = BeautifulSoup(filename,'lxml')
first_ul_entries = soup.find('ul')
print first_ul_entries.li.div.string
4

The sum of its parameters find() The method is somewhat similar, with multiple limit parameters. The limit parameter is used to limit the number of results. while find() The limit of the method is just 1.

We can also pass a list of 1 string arguments to search for tags, tag attribute values, custom attribute values, and the CSS class.


from bs4 import BeautifulSoup
with open('search.html','r') as filename:
 soup = BeautifulSoup(filename,'lxml')
first_ul_entries = soup.find('ul')
print first_ul_entries.li.div.string
5

Search for related tags

In general, we can use it find() and find_all() Method to search for the specified tags, as well as other tags of interest related to those tags.

Search for parent tag

You can use find_parent() or find_parents() Method to search for the parent tag of a tag.

find_parent() Method will return the first match, and find_parents() Will return all the matches that this 1 point with find() and find_all() The method is similar.


from bs4 import BeautifulSoup
with open('search.html','r') as filename:
 soup = BeautifulSoup(filename,'lxml')
first_ul_entries = soup.find('ul')
print first_ul_entries.li.div.string
6

Search for peer tags

Beautiful Soup also provides the ability to search for peer tags.

Using the function find_next_siblings() The function can search for the next tag of the same level, and find_next_sibling() The function searches for the next tag of the same level.


from bs4 import BeautifulSoup
with open('search.html','r') as filename:
 soup = BeautifulSoup(filename,'lxml')
first_ul_entries = soup.find('ul')
print first_ul_entries.li.div.string
7

Again, you can use it find_previous_siblings() and find_previous_sibling() Method to search for 1 sibling tag.

Search for the next TAB

use find_next() Method will search for the first occurrence of the next tag, and find_next_all() All the tag items for the children will be returned.


#  Under the search 1 Class label 
first_div = soup.div
all_li_tags = first_div.find_all_next("li")
print all_li_tags

Search on 1 TAB

Similar to searching the next TAB, use find_previous() and find_all_previous() Method to search on 1 tag.

conclusion


Related articles: