Simple Python crawler that crawls taobao pictures

2020-04-02 14:26:05
OfStack
Wrote a crawler to grasp taobao pictures, all written by if, for, while, relatively simple, introductory works.
From a web page (link: http://mm.taobao.com/json/request_top_list.htm? Extract photos of taobao model from type=0&page) =.


# -*- coding: cp936 -*-

import urllib2

import urllib

mmurl="http://mm.taobao.com/json/request_top_list.htm?type=0&page="

i=0# The second page has a personal page without a picture , There will be a IO error 

while i<15:

        url=mmurl+str(i)

        #print url # Print out the list url

        up=urllib2.urlopen(url)# Open the page and store it in the handle 

        cont=up.read()

        #print len(cont)# Page length 

        ahref='<a href="http'# Filter the keywords of the page links within the page 

        target="target"

        pa=cont.find(ahref)# Locate the header of the link 

        pt=cont.find(target,pa)# Find the end of the page link 

        for a in range(0,20):# If the ability not to 20 Hard coded in? How do I find the end of a file? 

                urlx=cont[pa+len(ahref)-4:pt-2]# From the head to the tail, store the page links in variables 

                if len(urlx) < 60:# If the page link length is suitable [ len ()!! 】 

                    urla=urlx     # So I'm going to print it out 

                    print urla    # That's what you want model personal URL

                    ######### So let's start with model An individual's URL operate #########

                    mup=urllib2.urlopen(urla)# Open the model A personal page, stored in a handle 

                    mcont=mup.read()# right model The handle of the page is read and stored mcont string 

                    imgh="<img style=" # Filter the keywords of the [picture] link in the page 

                    imgt=".jpg"

                    iph=mcont.find(imgh)# Find the head position of the link 

                    ipt=mcont.find(imgt,iph)# Find the end of the link 

                    for b in range(0,10):# Again, hard coding 

                            mpic=mcont[iph:ipt+len(imgt)]# Original image link, link characters are too noisy 

                            iph1=mpic.find("http")# Filter the above links again 

                            ipt1=mpic.find(imgt)  # Same as above 

                            picx=mpic[iph1:ipt1+len(imgt)]

                            if len(picx)<150:# There are still some URL Is" http : ss.png><dfsdf>.jpg " ( Set to 100 Can accidentally hurt )

                                    pica=picx # 【 is len(picx)<100 Rather than picx!! Otherwise it won't show 

                                    print pica

                                    ############################

                                    ########### Start the download pica The image 

                                    urllib.urlretrieve(pica,"pic\tb"+str(i)+"x"+str(a)+"x"+str(b)+".jpg")                                   

                                    ###########   pica The picture has been downloaded. .( Add the number of each loop body to avoid repeating the name )

                                    ############################

                            iph=mcont.find(imgh,iph+len(imgh))# Start the next cycle 

                            ipt=mcont.find(imgt,iph)

                    ############model personal URL Inside the [picture link] extraction finished ##########

                pa=cont.find(ahref,pa+len(ahref))# Take the original head position as the starting point, and go back to the next head 

                pt=cont.find(target,pa)# Let's keep looking for the next tail 

        i+=1
Is not very simple, friends can slightly modify the next crawl other content...