python crawler frame scrapy practical climbing jingdong mall advanced chapter

  • 2020-05-30 20:33:16
  • OfStack


The previous article has talked about how to get links and parameters. For details, please refer to the general article of python crawling jingdong mall. This article will introduce how to crawl jingdong mall using python crawler framework scrapy in detail.


scrapy.Request is used here. This method calls the start_urls request by default. If you want to change the default request, you must overload this method.

The code is as follows:

def start_requests(self):
 for i in range(1,101):
 page=i*2-1 # Here is the build request url the page, Said the odd 
 yield scrapy.Request(url,meta={'search_page':page+1},callback=self.parse_url) # Used here meta Want the callback function to pass in data, the callback function USES response.meta['search-page'] Receive data 

The following is the parse page, from above you can see that the parse callback function here is parse_url, so the page is parsed in this function. , the 1 sample here and it says this url get information just before 1 and a half, if you want to get 1 and a half after the information and request again, there's just one tip: 1 kind to parse out a data array, not in a hurry to take out the number 1, if statement to use to judge, because if you get a [], then directly take out [0] will be an error, this is just a method to avoid error.

The code is as follows:

def parse_url(self,response):
 if response.status==200: # Determine if the request was successful 
 # print response.url
 pids = set() # This collection is used to filter and save the results id, Used as a later ajax The request of url Constitute a 
 all_goods = response.xpath("//div[@id='J_goodsList']/ul/li") # First you get the entire frame of all the clothes, and then you extract each of them 1 A framework 
 for goods in all_goods: # Parse each 1 a 
 #,self) # This is a 1 Debugging methods, where debug mode is directly turned on 
 items = JdSpiderItem() # Define the data to grab 
 img_url_src = goods.xpath("div/div[1]/a/img/@src").extract() #  If it doesn't exist 1 An empty array [] So you can't take it here [0]
 img_url_delay = goods.xpath(
  "div/div[1]/a/img/@data-lazy-img").extract() #  This is a picture that's not loaded, so you can't say array number 1 a [0]
 price = goods.xpath("div/div[3]/strong/i/text()").extract() # The price 
 cloths_name = goods.xpath("div/div[4]/a/em/text()").extract()
 shop_id = goods.xpath("div/div[7]/@ data-shopid").extract()
 cloths_url = goods.xpath("div/div[1]/a/@href").extract()
 person_number = goods.xpath("div/div[5]/strong/a/text()").extract()
 pid = goods.xpath("@data-pid").extract()
 # product_id=goods.xpath("@data-sku").extract()
 if pid:
 if img_url_src: #  if img_url_src There are 
  print img_url_src[0]
  items['img_url'] = img_url_src[0]
 if img_url_delay: #  If you get to an image that's not loaded, you take this url
  print img_url_delay[0]
  items['img_url'] = img_url_delay[0] #  If the array is not empty, you can write it 
 if price:
  items['price'] = price[0]
 if cloths_name:
  items['cloths_name'] = cloths_name[0]
 if shop_id:
  items['shop_id'] = shop_id[0]
  shop_url = "" + str(shop_id[0]) + ".html"
  items['shop_url'] = shop_url
 if cloths_url:
  items['cloths_url'] = cloths_url[0]
 if person_number:
  items['person_number'] = person_number[0]
 # if product_id:
 # print "************************************csdjkvjfskvnk***********************"
 # print self.comments_url.format(str(product_id[0]),str(self.count))
 # yield scrapy.Request(url=self.comments_url.format(str(product_id[0]),str(self.count)),callback=self.comments)
 #yield scrapy.Request So let me write it here as per parse 1 The key pants will call the callback function 1 time 
 yield items
 except Exception:
 print "********************************************ERROR**********************************************************************"
 yield scrapy.Request(url=self.search_url.format(str(response.meta['search_page']),",".join(pids)),callback=self.next_half_parse) # Request again, here's the request ajax The loaded data has to be put here, because it's only until you get all of it pid To compose the request, the callback function is used for the following parsing 

2. As can be seen from the last part of the above code, the last part is to parse the page loaded by ajax. The next_half_parse function called here is the same as that of the previous page yield items , items must be passed through meta into the next callback function to continue to improve yield items I don't need it here.

The code is as follows:

# Analyze asynchronously loaded web pages 
 def next_half_parse(self,response):
 if response.status==200:
 print response.url
 items=JdSpiderItem(),self) #y debugging 
 for li in lis:
  if cloths_url:
  print cloths_url[0]
  if img_url_1:
  print img_url_1[0]
  if img_url_2:
  print img_url_2[0]
  if cloths_name:
  if price:
  if shop_id:
  items['shop_url']="" + str(shop_id[0]) + ".html"
  if person_number:
  yield items # again 1 Second generation, here is the complete data, so can yield items
 except Exception:
 print "**************************************************"

3. Of course, there is also the use of setting up the request pool, mysql storage, not ip agent, which I mentioned in my previous blog. I won't go into details here.

Want to see the source code friends please

Click here or download locally


People will complain about why they have to start from scratch when their crawler breaks in the middle of the process. Why can't they start from scratch? Here is a method: add it in the configuration file JOBDIR=file_name , file_name is the name of a file Set download delay to prevent ban: DOWNLOAD_DELAY = 2 : set the interval time for each time RANDOMIZE_DOWNLOAD_DELAY = True : the delay time of this is set randomly between 0.5 and 1.5 times of the set time, so as to prevent it from being used by ban more effectively. 1 is usually used together ROBOTSTXT_OBEY = False : robots.txt file is not followed. The default is True CONCURRENT_REQUESTS : set the maximum number of requests, here the default is 16, we can according to the configuration of our own computer to change 1 point to speed up the speed of the request


Related articles: