Sample code for python Data Fetching analysis (python + mongodb)

  • 2020-06-19 10:58:12
  • OfStack

This paper introduces Python data capture analysis and shares it with you as follows:

Programming modules: requests,lxml, pymongo, time, BeautifulSoup

First, get the classified website of all the products:


def step():
 try:
  headers = {
    . 
   }
  r = requests.get(url,headers,timeout=30)
  html = r.content
  soup = BeautifulSoup(html,"lxml")
  url = soup.find_all( Regular expression )
  for i in url:
   url2 = i.find_all('a')
   for j in url2:
     step1url =url + j['href']
     print step1url
     step2(step1url)
 except Exception,e:
  print e 

When classifying products, we need to determine whether the address we visit is a product or another classified product address (so we need to determine whether the address we visit contains if judgment mark) :


def step2(step1url):
 try:
  headers = {
    . 
   }
  r = requests.get(step1url,headers,timeout=30)
  html = r.content
  soup = BeautifulSoup(html,"lxml")
  a = soup.find('div',id='divTbl')
  if a:
   url = soup.find_all('td',class_='S-ITabs')
   for i in url:
    classifyurl = i.find_all('a')
    for j in classifyurl:
      step2url = url + j['href']
      #print step2url
      step3(step2url)
  else:
   postdata(step1url)

When we if after the judgment is true will be the second page of the classified url to get (step 1), otherwise execute the postdata function, will web product address capture!


def producturl(url):
 try:
  p1url = doc.xpath( Regular expression )
  for i in xrange(1,len(p1url) + 1):
   p2url = doc.xpath( Regular expression )
   if len(p2url) > 0:
    producturl = url + p2url[0].get('href')
    count = db[table].find({'url':producturl}).count()
    if count <= 0:
      sn = getNewsn()
      db[table].insert({"sn":sn,"url":producturl})
      print str(sn) + 'inserted successfully'
    else:
      'url exist'
 except Exception,e:
  print e

Where is the product address obtained and stored in mongodb, sn as the new id address.

Below, we need to obtain our website address and access it through the new id index in mongodb, analyze and capture the product data, and update the data into the database!

BeautifulSoup is the module most used, but it is difficult to use BeautifulSoup for the value data existing in js. Therefore, I recommend using xpath for the data in js, but the method of ES38en.document_ES40en (url) is needed to parse the webpage.

For xpath capture value data at the same time 1 must be careful! If you'd like to know more about xpath, leave a comment below and I'll get back to you as soon as possible!


def parser(sn,url):
 try:
  headers = {
    . 
   }
  r = requests.get(url, headers=headers,timeout=30)
  html = r.content
  soup = BeautifulSoup(html,"lxml")
  dt = {}
  #partno
  a = soup.find("meta",itemprop="mpn")
  if a:
   dt['partno'] = a['content']
  #manufacturer
  b = soup.find("meta",itemprop="manufacturer")
  if b:
   dt['manufacturer'] = b['content']
  #description
  c = soup.find("span",itemprop="description")
  if c:
   dt['description'] = c.get_text().strip()
  #price
  price = soup.find("table",class_="table table-condensed occalc_pa_table")
  if price:
   cost = {}
   for i in price.find_all('tr'):
    if len(i) > 1:
     td = i.find_all('td')
     key=td[0].get_text().strip().replace(',','')
     val=td[1].get_text().replace(u'\u20ac','').strip()
     if key and val:
      cost[key] = val
   if cost:
    dt['cost'] = cost
    dt['currency'] = 'EUR'
  
  #quantity
  d = soup.find("input",id="ItemQuantity")
  if d:
   dt['quantity'] = d['value']
  #specs
  e = soup.find("div",class_="row parameter-container")
  if e:
   key1 = []
   val1= []
   for k in e.find_all('dt'):
    key = k.get_text().strip().strip('.')
    if key:
     key1.append(key)
   for i in e.find_all('dd'):
    val = i.get_text().strip()
    if val:
     val1.append(val)
   specs = dict(zip(key1,val1))
  if specs:
   dt['specs'] = specs
   print dt   
  if dt:
   db[table].update({'sn':sn},{'$set':dt})
   print str(sn) + ' insert successfully'
   time.sleep(3)
  else:
   error(str(sn) + '\t' + url)
 except Exception,e:
  error(str(sn) + '\t' + url)
  print "Don't data!"

Finally, all the programs run, value data analysis and processing and stored in the database!


Related articles: