Python parses the methods used by pyquery an HTML development library

2020-04-02 13:24:59
OfStack

For example,


<div id="info">
<span><span class='pl'> The director </span>: <a href="/celebrity/1047989/" rel="v:directedBy"> Tom tykwer </a> / <a href="/celebrity/1161012/" rel="v:directedBy"> Lana wachowski </a> / <a href="/celebrity/1013899/" rel="v:directedBy"> Andy wachowski </a></span><br/>
<span><span class='pl'> The writers </span>: <a href="/celebrity/1047989/"> Tom tykwer </a> / <a href="/celebrity/1013899/"> Andy wachowski </a> / <a href="/celebrity/1161012/"> Lana wachowski </a></span><br/>
<span><span class='pl'> starring </span>: <a href="/celebrity/1054450/" rel="v:starring"> Tom Hanks </a> / <a href="/celebrity/1054415/" rel="v:starring"> Halle berry </a> / <a href="/celebrity/1019049/" rel="v:starring"> Jim broadbent </a> / <a href="/celebrity/1040994/" rel="v:starring"> Hugo weaving </a> / <a href="/celebrity/1053559/" rel="v:starring"> Jim sturgis </a> / <a href="/celebrity/1057004/" rel="v:starring"> Bae doona </a> / <a href="/celebrity/1025149/" rel="v:starring"> This ・ WeiXiao </a> / <a href="/celebrity/1049713/" rel="v:starring"> James darcy </a> / <a href="/celebrity/1027798/" rel="v:starring"> Zhou xun </a> / <a href="/celebrity/1019012/" rel="v:starring"> Keith David </a> / <a href="/celebrity/1201851/" rel="v:starring"> David gyasi </a> / <a href="/celebrity/1054392/" rel="v:starring"> Susan sarandon </a> / <a href="/celebrity/1003493/" rel="v:starring"> Hugh grant </a></span><br/>
<span class="pl"> type :</span> <span property="v:genre"> The plot </span> / <span property="v:genre"> Science fiction </span> / <span property="v:genre"> The suspense </span><br/>
<span class="pl"> The official website :</span> <a href="http://cloudatlas.warnerbros.com" rel="nofollow" target="_blank">cloudatlas.warnerbros.com</a><br/>
<span class="pl"> Producer country / region :</span>  Germany  /  The United States  /  Hong Kong  /  Singapore <br/>
<span class="pl"> language :</span>  English <br/>
<span class="pl"> Release date :</span> <span property="v:initialReleaseDate" content="2013-01-31( The Chinese mainland )">2013-01-31( The Chinese mainland )</span> / <span property="v:initialReleaseDate" content="2012-10-26( The United States )">2012-10-26( The United States )</span><br/>
<span class="pl"> Running time :</span> <span property="v:runtime" content="134">134 minutes ( The Chinese mainland )</span> / 172 minutes ( The United States )<br/>
<span class="pl">IMDb link :</span> <a href="http://www.imdb.com/title/tt1371111" target="_blank" rel="nofollow">tt1371111</a><br>
<span class="pl"> The official station :</span>
<a href="http://site.douban.com/202494/" target="_blank"> Cloud atlas </a>
</div>


from pyquery import PyQuery as pq
doc=pq(url='http://movie.douban.com/subject/3530403/')
data=doc('.pl')
for i in data:
    print pq(i).text()

The output


 The director 
 The writers 
 starring 
 type :
 The official website :
 Producer country / region :
 language :
 Release date :
 Running time :
IMDb link :
 The official station :

usage

Users can load XML documents from strings, LXML objects, files, or urls using the PyQuery class:


>>> from pyquery import PyQuery as pq
>>> from lxml import etree
>>> doc=pq("<html></html>")
>>> doc=pq(etree.fromstring("<html></html>"))
>>> doc=pq(filename=path_to_html_file)
>>> doc=pq(url='http://movie.douban.com/subject/3530403/')

You can now select objects like jQuery


>>> doc('.pl')
[<span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span#rateword.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <p.pl>]

In this way, all the objects with class 'pl' are selected.

However, the text needs to be repackaged when using iteration:


for para in doc('.pl'):
    para=pq(para)
    print para.text()   
 The director 
 The writers 
 starring 
 type :
 The official website :
 Producer country / region :
 language :
 Release date :
 Running time :
IMDb link :
 The official station :

The text you get here is unicode code, which you need to encode as a string if you want to write to a file.
Some of the pseudo-classes provided by jquery (but not yet CSS supported) can be used to do this, such as:


>>> doc('.pl:first')
[<span.pl>]
>>> print  doc('.pl:first').text()
 The director

The Attributes
Gets the attributes of an HTML element


>>> p=pq('<p id="hello" class="hello"></p>')('p')
>>> p.attr('id')
'hello'
>>> p.attr.id
'hello'
>>> p.attr['id']
'hello'

The assignment


>>> p.attr.id='plop'
>>> p.attr.id
'plop'
>>> p.attr['id']='ola'
>>> p.attr.id
'ola'
>>> p.attr(id='hello',class_='hello2')
[<p#hello.hell0>]

Traversing
filter


>>> d=pq('<p id="hello" class="hello"><a/>hello</p><p id="test"><a/>world</p>')
>>> d('p').filter('.hello')
[<p#hello.hello>]
>>> d('p').filter('#test')
[<p#test>]
>>> d('p').filter(lambda i:i==1)
[<p#test>]
>>> d('p').filter(lambda i:i==0)
[<p#hello.hello>]
>>> d('p').filter(lambda i:pq(this).text()=='hello')
[<p#hello.hello>]

Select in order


>>> d('p').eq(0)
[<p#hello.hello>]
>>> d('p').eq(1)
[<p#test>]

Select the inline element


>>> d('p').eq(1).find('a')
[<a>]

Select parent


>>> d=pq('<p><span><em>Whoah!</em></span></p><p><em> there</em></p>')
>>> d('p').eq(1).find('em')
[<em>]
>>> d('p').eq(1).find('em').end()
[<p>]
>>> d('p').eq(1).find('em').end().text()
'there'
>>> d('p').eq(1).find('em').end().end()
[<p>, <p>]