Python USES the Beautiful Soup module to modify the content method example

  • 2020-05-27 06:09:18
  • OfStack

preface

In addition to searching and navigating, the Beautiful Soup module can also modify the content of HTML/XML documents. This means the ability to add or remove tags, change tag names, change tag attribute values, modify text content, and so on. This article is very detailed to introduce Python using Beautiful Soup module to modify the content of the method, the following words do not say, to see the detailed introduction.

Modify the label

The sample HTML document used is as follows:


html_markup="""
 <div class="ecopyramid">
 <ul id="producers">
  <li class="producerlist">
  <div class="name">plants</div>
  <div class="number">100000</div>
  </li>
  <li class="producerlist">
  <div class="name">algae</div>
  <div class="number">100000</div>
  </li>
 </ul>
 </div>
 """

Modify the tag name


soup = BeautifulSoup(html_markup,'lxml')
producer_entries = soup.ul
print producer_entries.name
producer_entries.name = "div"
print producer_entries.prettify()

Modify the tag attribute value


#  Modify tag attributes 
#  Update the tag's existing attribute values 
producer_entries['id'] = "producers_new_value"
print producer_entries.prettify()
#  Tag adds a new attribute value 
producer_entries['class'] = "newclass"
print producer_entries.prettify()
#  Deletes the tag attribute value 
del producer_entries['class']
print producer_entries.prettify()

Add a new tag

We can use the new_tag method to generate a new tag and then use append() , insert() , insert_after() , insert_before() Method to add the label to the HTML tree.

For example, add an li tag to the ul tag of the HTML document above. The new li tag is first generated and then inserted into the HTML tree structure. And insert the corresponding div tag in the li tag.


#  Add a new tag 
# new_tag  generate 1 a  tag  object 
new_li_tag = soup.new_tag("li")
#  A method to add attributes to a tag object 
new_atag = soup.new_tag("a",href="www.example.com" rel="external nofollow" )
new_li_tag.attrs = {'class':'producerlist'}
soup = BeautifulSoup(html_markup,'lxml')
producer_entries = soup.ul
#  use  append()  Method is added to the end 
producer_entries.append(new_li_tag)
print producer_entries.prettify()
#  Generate two  div  The label , Insert it into  li  In the label 
new_div_name_tag = soup.new_tag("div")
new_div_name_tag['class'] = "name"
new_div_number_tag = soup.new_tag("div")
new_div_number_tag["class"] = "number"
#  use  insert()  Method to specify the location of the insert 
new_li_tag.insert(0,new_div_name_tag)
new_li_tag.insert(1,new_div_number_tag)
print new_li_tag.prettify()

Modify string content

Modify the string content can be used new_string() , append() , insert() Methods.


#  Modify string content 
#  use  .string  Property modifies the string content 
new_div_name_tag.string = 'new_div_name'
#  use  .append()  Method to add string content 
new_div_name_tag.append("producer")
#  use  soup  The object's  new_string()  Method to generate a string 
new_string_toappend = soup.new_string("producer")
new_div_name_tag.append(new_string_toappend)
#  use insert()  Methods insert 
new_string_toinsert = soup.new_string("10000")
new_div_number_tag.insert(0,new_string_toinsert)
print producer_entries.prettify()

Delete label node

The Beautiful Soup module is provided decompose() and extract() Method to delete the node.

decompose() Method deletes a node, which not only deletes the current node, but also deletes its child node 1 block.

extract() The HTML tree () method is used to remove the node or string content from the HTML tree.


#  Remove nodes 
third_producer = soup.find_all("li")[2]
#  use  decompose()  Methods to remove  div  node 
div_name = third_producer.div
div_name.decompose()
print third_producer.prettify()
#  use  extract()  Method to delete a node 
third_producer_removed = third_producer.extract()
print soup.prettify()

Delete tag content

The tag may have an NavigableString object or an Tag object as its child nodes, and removing all of these child nodes can be used clear() Methods. This will remove all.content of the tag.

Other ways to modify content

In addition to the methods mentioned above, there are other methods to modify the content.

insert_after() and insert_before() methods

The two methods above can insert a label or string before or after the label or string. The method can accept only one parameter, either an NavigableString object or an Tag object.

replace_with() methods

This method replaces the original label or string with a new label or string content, and can receive one label or string as input.

wrap() and unwrap() methods

wrap() The method is to wrap a label or string with another label.

unwrap() Method and wrap() The opposite is true.


# wrap() methods 
li_tags = soup.find_all('li')
for li in li_tags:
 new_div_tag = soup.new_tag('div')
 li.wrap(new_div_tag)
print soup.prettify()
# unwrap() methods 
li_tags = soup.find_all("li")
for li in li_tags:
 li.div.unwrap()
print soup.prettify()

conclusion


Related articles: