HTML and XHTML parsing of HTMLParser BeautifulSoup

  • 2020-04-02 13:43:40
  • OfStack

First, use HTMLParser for web page parsing
Specific HTMLParser official document can refer to http://docs.python.org/library/htmlparser.html#HTMLParser.HTMLParser

1. Start with a simple parsing example
Case 1:
Test1. The contents of the HTML file are as follows:


<html> 
<head> 
<title> XHTML  with  HTML 4.01  There is not much difference in standards </title> 
</head> 
<body> 
i love you 
</body> 
</html>

Here's an example of a program that lists title and body:


##@ Small WuYi: 
##HTMLParser The sample  
import HTMLParser 
class TitleParser(HTMLParser.HTMLParser): 
    def __init__(self): 
        self.taglevels=[] 
        self.handledtags=['title','body'] # Put forward the label  
        self.processing=None 
        HTMLParser.HTMLParser.__init__(self) 
    def handle_starttag(self,tag,attrs): 
        if tag in self.handledtags: 
            self.data='' 
            self.processing=tag 
    def handle_data(self,data): 
        if self.processing: 
            self.data +=data 
    def handle_endtag(self,tag): 
        if tag==self.processing: 
            print str(tag)+':'+str(tp.gettitle()) 
            self.processing=None 
    def gettitle(self): 
        return self.data 
fd=open('test1.html') 
tp=TitleParser() 
tp.feed(fd.read())

The operation results are as follows:
Title: XHTML is not much different from the HTML 4.01 standard
Body:
I love you
The program defines a TitleParser class, which is a descendant of the HTMLParser class. The HTMLParser feed method receives the data and parses it accordingly through the defined HTMLParser object. Where handle_starttag and handle_endtag determine the start and endtag, handle_data checks whether to get data, if self.processing is not None, then get data.

2. Solve HTML entity problems
(useful character entities in HTML)
(1) entity name
The above example is not possible when dealing with entities in HTML, as shown here by changing the code for test1.html to:
Example 2:


<html> 
<head> 
<title> XHTML  with " HTML 4.01 " There is not much difference in standards </title> 
</head> 
<body> 
i love you× 
</body> 
</html>

Using the above example for analysis, the result is:
Title: XHTML is not much different from the HTML 4.01 standard
Body:
I love you
The entity completely disappeared. This is because HTMLParser calls the handle_entityref() method when an entity appears, and because the method is not defined in the code, nothing is done. After modification, it is as follows:


##@ Small WuYi: 
##HTMLParser Example: solving entity problems  
from htmlentitydefs import entitydefs 
import HTMLParser 
class TitleParser(HTMLParser.HTMLParser): 
    def __init__(self): 
        self.taglevels=[] 
        self.handledtags=['title','body'] 
        self.processing=None 
        HTMLParser.HTMLParser.__init__(self) 
    def handle_starttag(self,tag,attrs): 
        if tag in self.handledtags: 
            self.data='' 
            self.processing=tag 
    def handle_data(self,data): 
        if self.processing: 
            self.data +=data 
    def handle_endtag(self,tag): 
        if tag==self.processing: 
            print str(tag)+':'+str(tp.gettitle()) 
            self.processing=None 
    def handle_entityref(self,name): 
        if entitydefs.has_key(name): 
            self.handle_data(entitydefs[name]) 
        else: 
            self.handle_data('&'+name+';') 
    def gettitle(self): 
        return self.data 
fd=open('test1.html') 
tp=TitleParser() 
tp.feed(fd.read())

The running result is:
Title: XHTML is not much different from the "HTML 4.01" standard
Body:
I love you *
So here we have all the entities.

(2) entity coding
Example 3:


<html> 
<head> 
<title> XHTML  with " HTML 4.01 " There is not much difference in standards </title> 
</head> 
<body> 
i love÷ you× 
</body> 
</html>

If the code of example 2 is executed, the result is:

Title: XHTML is not much different from the "HTML 4.01" standard
Body:
I love you *
Results the & # 247; The corresponding present is not shown.
Add handle_charref () for processing, the specific code is as follows:


##@ Small WuYi: 
##HTMLParser Example: solving entity problems  
from htmlentitydefs import entitydefs 
import HTMLParser 
class TitleParser(HTMLParser.HTMLParser): 
    def __init__(self): 
        self.taglevels=[] 
        self.handledtags=['title','body'] 
        self.processing=None 
        HTMLParser.HTMLParser.__init__(self) 
    def handle_starttag(self,tag,attrs): 
        if tag in self.handledtags: 
            self.data='' 
            self.processing=tag 
    def handle_data(self,data): 
        if self.processing: 
            self.data +=data 
    def handle_endtag(self,tag): 
        if tag==self.processing: 
            print str(tag)+':'+str(tp.gettitle()) 
            self.processing=None 
    def handle_entityref(self,name): 
        if entitydefs.has_key(name): 
            self.handle_data(entitydefs[name]) 
        else: 
            self.handle_data('&'+name+';') 
    def handle_charref(self,name): 
        try: 
            charnum=int(name) 
        except ValueError: 
            return 
        if charnum<1 or charnum>255: 
            return 
        self.handle_data(chr(charnum)) 
    def gettitle(self): 
        return self.data 
fd=open('test1.html') 
tp=TitleParser() 
tp.feed(fd.read())

The running result is:
Title: XHTML is not much different from the "HTML 4.01" standard
Body:
I love present you x

3. Extract links
Example 4:


<html> 
<head> 
<title> XHTML  with " HTML 4.01 " There is not much difference in standards </title> 
</head> 
<body> 
<a href="http://pypi.python.org/pypi" title="link1">i love÷ you×</a> 
</body> 
</html>

Here in handle_starttag(self,tag,attrs),attrs records the attribute value when tag=a, so you just need to factor out the value of name=href in attrs. The details are as follows:


##@ Small WuYi: 
##HTMLParser Example: extract links  
# -*- coding: cp936 -*- 
from htmlentitydefs import entitydefs 
import HTMLParser 
class TitleParser(HTMLParser.HTMLParser): 
    def __init__(self): 
        self.taglevels=[] 
        self.handledtags=['title','body'] 
        self.processing=None 
        HTMLParser.HTMLParser.__init__(self)        
    def handle_starttag(self,tag,attrs): 
        if tag in self.handledtags: 
            self.data='' 
            self.processing=tag 
        if tag =='a': 
            for name,value in attrs: 
                if name=='href': 
                    print ' Connection address: '+value 
    def handle_data(self,data): 
        if self.processing: 
            self.data +=data 
    def handle_endtag(self,tag): 
        if tag==self.processing: 
            print str(tag)+':'+str(tp.gettitle()) 
            self.processing=None 
    def handle_entityref(self,name): 
        if entitydefs.has_key(name): 
            self.handle_data(entitydefs[name]) 
        else: 
            self.handle_data('&'+name+';') 
    def handle_charref(self,name): 
        try: 
            charnum=int(name) 
        except ValueError: 
            return 
        if charnum<1 or charnum>255: 
            return 
        self.handle_data(chr(charnum)) 
    def gettitle(self): 
        return self.data 
fd=open('test1.html') 
tp=TitleParser() 
tp.feed(fd.read())

The running result is:
Title: XHTML is not much different from the "HTML 4.01" standard
Address: http://pypi.python.org/pypi
Body:

I love present you x

4. Extract pictures
If there is an image file in the page, extract it and store it as a separate file.
Example 5:


<html> 
<head> 
<title> XHTML  with " HTML 4.01 " There is not much difference in standards </title> 
</head> 
<body> 
i love÷ you× 
<a href="http://pypi.python.org/pypi" title="link1"> i miss you </a> 
<div id="m"><img src="http://www.baidu.com/img/baidu_sylogo1.gif" width="270" height="129" ></div> 
</body> 
</html>

Baidu_sylogo1.gif access out, the specific code is as follows:


##@ Small WuYi: 
##HTMLParser Example: extracting pictures  
# -*- coding: cp936 -*- 
from htmlentitydefs import entitydefs 
import HTMLParser,urllib 
def getimage(addr):# Extract the image and store it in the current directory  
    u = urllib.urlopen(addr) 
    data = u.read() 
    filename=addr.split('/')[-1] 
    f=open(filename,'wb') 
    f.write(data) 
    f.close() 
    print filename+' Generated! ' 
class TitleParser(HTMLParser.HTMLParser): 
    def __init__(self): 
        self.taglevels=[] 
        self.handledtags=['title','body'] 
        self.processing=None 
        HTMLParser.HTMLParser.__init__(self)        
    def handle_starttag(self,tag,attrs): 
        if tag in self.handledtags: 
            self.data='' 
            self.processing=tag 
        if tag =='a': 
            for name,value in attrs: 
                if name=='href': 
                    print ' Connection address: '+value 
        if tag=='img': 
            for name,value in attrs: 
                if name=='src': 
                    getimage(value) 
    def handle_data(self,data): 
        if self.processing: 
            self.data +=data 
    def handle_endtag(self,tag): 
        if tag==self.processing: 
            print str(tag)+':'+str(tp.gettitle()) 
            self.processing=None 
    def handle_entityref(self,name): 
        if entitydefs.has_key(name): 
            self.handle_data(entitydefs[name]) 
        else: 
            self.handle_data('&'+name+';') 
    def handle_charref(self,name): 
        try: 
            charnum=int(name) 
        except ValueError: 
            return 
        if charnum<1 or charnum>255: 
            return 
        self.handle_data(chr(charnum)) 
    def gettitle(self): 
        return self.data 
fd=open('test1.html') 
tp=TitleParser() 
tp.feed(fd.read())

The movement results are:
Title: XHTML is not much different from the "HTML 4.01" standard
Address: http://pypi.python.org/pypi
Baidu_sylogo1.gif has been generated!
Body:
I love present you x
? O??????

5. Practical examples:
Example 6. Get the links on the home page of renren. The code is as follows:


##@ Small WuYi: 
##HTMLParser Example: get the links on renren's home page  
#coding: utf-8 
from htmlentitydefs import entitydefs 
import HTMLParser,urllib 
def getimage(addr): 
    u = urllib.urlopen(addr) 
    data = u.read() 
    filename=addr.split('/')[-1] 
    f=open(filename,'wb') 
    f.write(data) 
    f.close() 
    print filename+' Generated! ' 
class TitleParser(HTMLParser.HTMLParser): 
    def __init__(self): 
        self.taglevels=[] 
        self.handledtags=['a'] 
        self.processing=None 
        self.linkstring='' 
        self.linkaddr='' 
        HTMLParser.HTMLParser.__init__(self)        
    def handle_starttag(self,tag,attrs): 
        if tag in self.handledtags: 
            for name,value in attrs: 
                if name=='href': 
                    self.linkaddr=value 
            self.processing=tag 
    def handle_data(self,data): 
        if self.processing: 
            self.linkstring +=data 
            #print data.decode('utf-8')+':'+self.linkaddr 
    def handle_endtag(self,tag): 
        if tag==self.processing: 
            print self.linkstring.decode('utf-8')+':'+self.linkaddr 
            self.processing=None 
            self.linkstring='' 
    def handle_entityref(self,name): 
        if entitydefs.has_key(name): 
            self.handle_data(entitydefs[name]) 
        else: 
            self.handle_data('&'+name+';') 
    def handle_charref(self,name): 
        try: 
            charnum=int(name) 
        except ValueError: 
            return 
        if charnum<1 or charnum>255: 
            return 
        self.handle_data(chr(charnum)) 
    def gettitle(self): 
        return self.linkaddr 
tp=TitleParser() 
tp.feed(urllib.urlopen('http://www.renren.com/').read())

Operation results:
Sharing: http://share.renren.com
Application :http://app.renren.com
Public homepage :http://page.renren.com
Everyone lives :http://life.renren.com
Everyone group: http://xiaozu.renren.com/
Same name :http://name.renren.com
All high school: http://school.renren.com/allpages.html
University of wikipedia: http://school.renren.com/daxue/
Everyone is hot: http://life.renren.com/hot
Renren xiaozhan :http://zhan.renren.com/
Shopping for everyone :http://j.renren.com/
All the school recruit: http://xiaozhao.renren.com/
: http://www.renren.com.
Registration: http://wwv.renren.com/xn.do? Ss = 10113 & rt = 27
Login: http://www.renren.com/
Help: http://support.renren.com/helpcenter
Give us advice: http://support.renren.com/link/suggest
More: #
Javascript: closeError ();
Open email for confirmation :#
Reenter :javascript:closeError();
Javascript: closeStop ();
Customer service: http://help.renren.com/#http://help.renren.com/support/contomvice? Pid = 2 & selection = {couId: 193, proId: 342, cityId: 1000375}
Javascript: closeLock ();
Immediately unlock: http://safe.renren.com/relive.do
Forget your password? : http://safe.renren.com/findPass.do.
Forget your password? : http://safe.renren.com/findPass.do.
Change one: javascript: refreshCode_login ();
MSN: #
360: https://openapi.360.cn/oauth2/authorize? client_id=5ddda4458747126a583c5d58716bab4c&response_type=code&redirect_uri=http://www.renren.com/bind/tsz/tszLoginCallBack&scope=basic&display=default
Physical: https://oauth.api.189.cn/emp/oauth2/authorize? app_id=296961050000000294&response_type=code&redirect_uri=http://www.renren.com/bind/ty/tyLoginCallBack
Why fill in my birthday? : # birthday
Can't get another one? Javascript: refreshCode ();
Want to know more about the people network? Click here :javascript:;
: javascript:;
: javascript:;
Immediately registered: http://reg.renren.com/xn6245.do? Ss = 10113 & rt = 27
About: http://www.renren.com/siteinfo/about
Open platform :http://dev.renren.com
Renren games :http://wan.renren.com
Public home page: http://page.renren.com/register/regGuide/
Cell phone everyone: http://mobile.renren.com/mobilelink.do? PSF = 40002
Group: http://www.nuomi.com
All happy net: http://www.jiexi.com
Marketing services :http://ads.renren.com
Recruitment: http://job.renren-inc.com/
The help of the service: http://support.renren.com/helpcenter
Privacy: http://www.renren.com/siteinfo/privacy
Beijing ICP certificate no. 090254: http://www.miibeian.gov.cn/
The Internet drug information service certificate: http://a.xnimg.cn/n/core/res/certificate.jpg

BeautifulSoup is used for web page analysis
BeautifulSoup download and install
Download address: http://www.crummy.com/software/BeautifulSoup/download/3.x/
Chinese document address: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#Entity%20Conversion
Installation method: after the downloaded file is unzipped, there is a setup.py file under the folder, and then under CMD, run python setup.py install to install, paying attention to the path problem of setup.py. Once installed, BeautifulSoup can be imported directly from python.
2. Start with a simple parsing example
Example 7:
The same code at the page code block index 8

Code to get title:


##@ Small WuYi: 
##BeautifulSoup Example: title 
#coding: utf8 
import BeautifulSoup 
a=open('test1.html','r') 
htmlline=a.read() 
soup=BeautifulSoup.BeautifulSoup(htmlline.decode('gb2312')) 
#print soup.prettify()# The canonical html file  
titleTag=soup.html.head.title 
print titleTag.string


Operation results:
XHTML with & quot; HTML 4.01 & quot; There is not much difference in standards
From the point of view of the code and the results, two points should be noted:
. First, the BeautifulSoup. BeautifulSoup (htmlline decode (' gb2312) initialization process, should pay attention to character encoding format, search on the Internet, began to use utf-8 encoding display abnormal, after change for gb2312 shows normal. You can actually use the soup.originalencoding method to see the encoding format of the original file.
Second, the character entity is not processed in the result. In BeautifulSoup's Chinese document, there is a special explanation for entity conversion. The result will be displayed normally after the above code is changed to the following code:


##@ Small WuYi: 
##BeautifulSoup Example: title 
#coding: utf8 
import BeautifulSoup 
a=open('test1.html','r') 
htmlline=a.read() 
soup=BeautifulSoup.BeautifulStoneSoup(htmlline.decode('gb2312'),convertEntities=BeautifulSoup.BeautifulStoneSoup.ALL_ENTITIES) 
#print soup.prettify()# The canonical html file  
titleTag=soup.html.head.title 
print titleTag.string

In here convertEntities = BeautifulSoup. BeautifulStoneSoup. ALL_ENTITIES ALL_ENTITIES defines entities both XML and HTML code. Of course, you can also use XML_ENTITIES or HTML_ENTITIES directly. The operation results are as follows:
XHTML is not much different from the "HTML 4.01" standard
3. Extract links
In the example above, the code changes to:


##@ Small WuYi: 
##BeautifulSoup Example: extract links  
#coding: utf8 
import BeautifulSoup 
a=open('test1.html','r') 
htmlline=a.read() 
a.close() 
soup=BeautifulSoup.BeautifulStoneSoup(htmlline.decode('gb2312'),convertEntities=BeautifulSoup.BeautifulStoneSoup.ALL_ENTITIES) 
name=soup.find('a').string 
links=soup.find('a')['href'] 
print name+':'+links

The running result is:
I think you: http://pypi.python.org/pypi
4. Extract pictures
The above example is still used to extract the picture of baidu.
The code is:


##@ Small WuYi: http://www.cnblogs.com/xiaowuyi
#coding: utf8 
import BeautifulSoup,urllib 
def getimage(addr):# Extract the image and store it in the current directory  
    u = urllib.urlopen(addr) 
    data = u.read() 
    filename=addr.split('/')[-1] 
    f=open(filename,'wb') 
    f.write(data) 
    f.close() 
    print filename+' finished!' 
a=open('test1.html','r') 
htmlline=a.read() 
soup=BeautifulSoup.BeautifulStoneSoup(htmlline.decode('gb2312'),convertEntities=BeautifulSoup.BeautifulStoneSoup.ALL_ENTITIES) 
links=soup.find('img')['src'] 
getimage(links)

Extract link and extract image are two main parts of the use of find method, the specific method is:
Find (name, attrs, recursive, text, **kwargs)
FindAll lists all qualified items, and find lists only the first. Notice that findAll returns a list.
5. Practical examples:
Get each link address on the home page of renren, the code is as follows:


##@ Small WuYi: 
##BeautifulSoup Example: get the links on renren's home page  
#coding: utf8 
import BeautifulSoup,urllib 
linkname='' 
htmlline=urllib.urlopen('http://www.renren.com/').read() 
soup=BeautifulSoup.BeautifulStoneSoup(htmlline.decode('utf-8')) 
links=soup.findAll('a') 
for i in links: 
    ## judge tag is a The inside of the, href Whether there is.  
    if 'href' in str(i): 
        linkname=i.string 
        linkaddr=i['href'] 
        if 'NoneType' in str(type(linkname)):# when i No content is linkname for Nonetype Type.  
            print linkaddr 
        else: 
            print linkname+':'+linkaddr

Operation results:
Sharing: http://share.renren.com
Application :http://app.renren.com
Public homepage :http://page.renren.com
Everyone lives :http://life.renren.com
Everyone group: http://xiaozu.renren.com/
Same name :http://name.renren.com
All high school: http://school.renren.com/allpages.html
University of wikipedia: http://school.renren.com/daxue/
Everyone is hot: http://life.renren.com/hot
Renren xiaozhan :http://zhan.renren.com/
Shopping for everyone :http://j.renren.com/
All the school recruit: http://xiaozhao.renren.com/
http://www.renren.com
Registration: http://wwv.renren.com/xn.do? Ss = 10113 & rt = 27
Login: http://www.renren.com/
Help: http://support.renren.com/helpcenter
Give us advice: http://support.renren.com/link/suggest
More: #
Javascript: closeError ();
Open email for confirmation :#
Reenter :javascript:closeError();
Javascript: closeStop ();
Customer service: http://help.renren.com/#http://help.renren.com/support/contomvice? Pid = 2 & selection = {couId: 193, proId: 342, cityId: 1000375}
Javascript: closeLock ();
Immediately unlock: http://safe.renren.com/relive.do
Forget your password? : http://safe.renren.com/findPass.do.
Forget your password? : http://safe.renren.com/findPass.do.
Change one: javascript: refreshCode_login ();
MSN: #
360: https://openapi.360.cn/oauth2/authorize? client_id=5ddda4458747126a583c5d58716bab4c&response_type=code&redirect_uri=http://www.renren.com/bind/tsz/tszLoginCallBack&scope=basic&display=default
Physical: https://oauth.api.189.cn/emp/oauth2/authorize? app_id=296961050000000294&response_type=code&redirect_uri=http://www.renren.com/bind/ty/tyLoginCallBack
# birthday
Can't get another one? Javascript: refreshCode ();
Javascript:;
Javascript:;
Immediately registered: http://reg.renren.com/xn6245.do? Ss = 10113 & rt = 27
About: http://www.renren.com/siteinfo/about
Open platform :http://dev.renren.com
Renren games :http://wan.renren.com
Public home page: http://page.renren.com/register/regGuide/
Cell phone everyone: http://mobile.renren.com/mobilelink.do? PSF = 40002
Group: http://www.nuomi.com
All happy net: http://www.jiexi.com
Marketing services :http://ads.renren.com
Recruitment: http://job.renren-inc.com/
The help of the service: http://support.renren.com/helpcenter
Privacy: http://www.renren.com/siteinfo/privacy
Beijing ICP certificate no. 090254: http://www.miibeian.gov.cn/
The Internet drug information service certificate: http://a.xnimg.cn/n/core/res/certificate.jpg


Related articles: