python Formatting Instance Method for Web Page Text

  • 2021-12-05 06:59:06
  • OfStack

1, 1 Web page usually contains text information. For different text types, we can choose the appropriate HTML semantic elements for tagging.

2. The em element is used to mark and emphasize parts of the content, and the small element is used for comments and signed text.

Instances


<body>
    <h1> On Linguistics and Papers 1</h1>
    <p><small>
    <b> By: </b><abbr title=" The name Qiu, the word Zhong Ni "> Confucius <sup><a href="#" rel="external nofollow" >1</a></sup></abbr> ( <time> Front 551 Year 9 Month 28 Day-before 479 Year 4 Month 11 Day </time> ) 
    </small></p>
    <h2> This quotation </h2>
    <p> "Learning and" is the first in the Analects of Confucius 1 The title of the article. Each article in the Analects of Confucius 1 Generally speaking, it is the first 1 The front of the chapter 23 As the title of the article. "Learn" 1 Articles include 16 Chapter, the content involves many aspects. The key points are 
     <strong> "My day 3 Save my body "; "Save money and love others, so that the people can take time"; "The use of courtesy, harmony is precious" and benevolence, filial piety, faith, etc. </strong> Moral category. </p>
    <h2> Original text </h2>
    <p> Confucius said, " <mark> Learn and learn from time to time, isn't it also said? </mark> Is it a pleasure to have friends from afar? People don't know, but don't worry, isn't it a gentleman? "  </p>
  </body>

Extension of knowledge points:

Transformation between Python, int and string

string- > int

1. Decimal string is converted to int

int ('12')

2. Hexadecimal string is converted into int

int ('12', 16)

int- > string

1. int is converted into decimal string

str(18)

2. int is converted into hexadecimal string

hex(18)

2. When the second page is selected on the chain home network, only one "d2" is added at the back of the page, such as http://sh.lianjia.com/ershoufang/pudong/d2, so if you want to crawl more web pages, you only need to update the page URL of requests cyclically

3. After adding 1 loop, all crawling results can be printed


from lxml import etree
import requests
import string
url = 'http://sh.lianjia.com/ershoufang/'
region = 'pudong'
price = 'p23'
finalURL = url+region+price

def spider_room(finallyURL):
   r= requests.get(finallyURL)
   html = requests.get(finalURL).content.decode('utf-8')
   dom_tree = etree.HTML(html)
   # all the messages
   all_message = dom_tree.xpath("//ul[@class='js_fang_list']/li")
   for index in range(len(all_message)):
      print(all_message[index].xpath('string(.)').strip())
   return
for i in range(20):
   finallyURL = finalURL + '/d'+str(i)
   spider_room(finallyURL)

4. Crawled 20 pages of content, but the form of output of the content did not change

Above is the python to the web page text formatting example method detailed content, more about the python crawler web page text formatting information please pay attention to this site other related articles!


Related articles: