Simple Python crawler that crawls taobao pictures
- 2020-04-02 14:26:05
- OfStack
Wrote a crawler to grasp taobao pictures, all written by if, for, while, relatively simple, introductory works.
From a web page (link: http://mm.taobao.com/json/request_top_list.htm? Extract photos of taobao model from type=0&page) =.
# -*- coding: cp936 -*-
import urllib2
import urllib
mmurl="http://mm.taobao.com/json/request_top_list.htm?type=0&page="
i=0# The second page has a personal page without a picture , There will be a IO error
while i<15:
url=mmurl+str(i)
#print url # Print out the list url
up=urllib2.urlopen(url)# Open the page and store it in the handle
cont=up.read()
#print len(cont)# Page length
ahref='<a href="http'# Filter the keywords of the page links within the page
target="target"
pa=cont.find(ahref)# Locate the header of the link
pt=cont.find(target,pa)# Find the end of the page link
for a in range(0,20):# If the ability not to 20 Hard coded in? How do I find the end of a file?
urlx=cont[pa+len(ahref)-4:pt-2]# From the head to the tail, store the page links in variables
if len(urlx) < 60:# If the page link length is suitable [ len ()!! 】
urla=urlx # So I'm going to print it out
print urla # That's what you want model personal URL
######### So let's start with model An individual's URL operate #########
mup=urllib2.urlopen(urla)# Open the model A personal page, stored in a handle
mcont=mup.read()# right model The handle of the page is read and stored mcont string
imgh="<img style=" # Filter the keywords of the [picture] link in the page
imgt=".jpg"
iph=mcont.find(imgh)# Find the head position of the link
ipt=mcont.find(imgt,iph)# Find the end of the link
for b in range(0,10):# Again, hard coding
mpic=mcont[iph:ipt+len(imgt)]# Original image link, link characters are too noisy
iph1=mpic.find("http")# Filter the above links again
ipt1=mpic.find(imgt) # Same as above
picx=mpic[iph1:ipt1+len(imgt)]
if len(picx)<150:# There are still some URL Is" http : ss.png><dfsdf>.jpg " ( Set to 100 Can accidentally hurt )
pica=picx # 【 is len(picx)<100 Rather than picx!! Otherwise it won't show
print pica
############################
########### Start the download pica The image
urllib.urlretrieve(pica,"pic\tb"+str(i)+"x"+str(a)+"x"+str(b)+".jpg")
########### pica The picture has been downloaded. .( Add the number of each loop body to avoid repeating the name )
############################
iph=mcont.find(imgh,iph+len(imgh))# Start the next cycle
ipt=mcont.find(imgt,iph)
############model personal URL Inside the [picture link] extraction finished ##########
pa=cont.find(ahref,pa+len(ahref))# Take the original head position as the starting point, and go back to the next head
pt=cont.find(target,pa)# Let's keep looking for the next tail
i+=1
Is not very simple, friends can slightly modify the next crawl other content...