Simple Python crawler that crawls taobao pictures

  • 2020-04-02 14:26:05
  • OfStack

Wrote a crawler to grasp taobao pictures, all written by if, for, while, relatively simple, introductory works.

From a web page (link: http://mm.taobao.com/json/request_top_list.htm? Extract photos of taobao model from type=0&page) =.


# -*- coding: cp936 -*-
import urllib2
import urllib
mmurl="http://mm.taobao.com/json/request_top_list.htm?type=0&page="
i=0# The second page has a personal page without a picture , There will be a IO error
while i<15:
        url=mmurl+str(i)
        #print url # Print out the list url
        up=urllib2.urlopen(url)# Open the page and store it in the handle
        cont=up.read()
        #print len(cont)# Page length
        ahref='<a href="http'# Filter the keywords of the page links within the page
        target="target"
        pa=cont.find(ahref)# Locate the header of the link
        pt=cont.find(target,pa)# Find the end of the page link
        for a in range(0,20):# If the ability not to 20 Hard coded in? How do I find the end of a file?
                urlx=cont[pa+len(ahref)-4:pt-2]# From the head to the tail, store the page links in variables
                if len(urlx) < 60:# If the page link length is suitable [ len ()!! 】
                    urla=urlx     # So I'm going to print it out
                    print urla    # That's what you want model personal URL
                    ######### So let's start with model An individual's URL operate #########
                    mup=urllib2.urlopen(urla)# Open the model A personal page, stored in a handle
                    mcont=mup.read()# right model The handle of the page is read and stored mcont string
                    imgh="<img style=" # Filter the keywords of the [picture] link in the page
                    imgt=".jpg"
                    iph=mcont.find(imgh)# Find the head position of the link
                    ipt=mcont.find(imgt,iph)# Find the end of the link
                    for b in range(0,10):# Again, hard coding
                            mpic=mcont[iph:ipt+len(imgt)]# Original image link, link characters are too noisy
                            iph1=mpic.find("http")# Filter the above links again
                            ipt1=mpic.find(imgt)  # Same as above
                            picx=mpic[iph1:ipt1+len(imgt)]
                            if len(picx)<150:# There are still some URL Is" http : ss.png><dfsdf>.jpg " ( Set to 100 Can accidentally hurt )
                                    pica=picx # 【 is len(picx)<100 Rather than picx!! Otherwise it won't show
                                    print pica
                                    ############################
                                    ########### Start the download pica The image
                                    urllib.urlretrieve(pica,"pic\tb"+str(i)+"x"+str(a)+"x"+str(b)+".jpg")                                  
                                    ###########   pica The picture has been downloaded. .( Add the number of each loop body to avoid repeating the name )
                                    ############################
                            iph=mcont.find(imgh,iph+len(imgh))# Start the next cycle
                            ipt=mcont.find(imgt,iph)
                    ############model personal URL Inside the [picture link] extraction finished ##########
                pa=cont.find(ahref,pa+len(ahref))# Take the original head position as the starting point, and go back to the next head
                pt=cont.find(target,pa)# Let's keep looking for the next tail
        i+=1

Is not very simple, friends can slightly modify the next crawl other content...


Related articles: