Sample code for python Data Fetching analysis (python + mongodb)
- 2020-06-19 10:58:12
- OfStack
This paper introduces Python data capture analysis and shares it with you as follows:
Programming modules: requests,lxml, pymongo, time, BeautifulSoup
First, get the classified website of all the products:
def step():
try:
headers = {
.
}
r = requests.get(url,headers,timeout=30)
html = r.content
soup = BeautifulSoup(html,"lxml")
url = soup.find_all( Regular expression )
for i in url:
url2 = i.find_all('a')
for j in url2:
step1url =url + j['href']
print step1url
step2(step1url)
except Exception,e:
print e
When classifying products, we need to determine whether the address we visit is a product or another classified product address (so we need to determine whether the address we visit contains if judgment mark) :
def step2(step1url):
try:
headers = {
.
}
r = requests.get(step1url,headers,timeout=30)
html = r.content
soup = BeautifulSoup(html,"lxml")
a = soup.find('div',id='divTbl')
if a:
url = soup.find_all('td',class_='S-ITabs')
for i in url:
classifyurl = i.find_all('a')
for j in classifyurl:
step2url = url + j['href']
#print step2url
step3(step2url)
else:
postdata(step1url)
When we if after the judgment is true will be the second page of the classified url to get (step 1), otherwise execute the postdata function, will web product address capture!
def producturl(url):
try:
p1url = doc.xpath( Regular expression )
for i in xrange(1,len(p1url) + 1):
p2url = doc.xpath( Regular expression )
if len(p2url) > 0:
producturl = url + p2url[0].get('href')
count = db[table].find({'url':producturl}).count()
if count <= 0:
sn = getNewsn()
db[table].insert({"sn":sn,"url":producturl})
print str(sn) + 'inserted successfully'
else:
'url exist'
except Exception,e:
print e
Where is the product address obtained and stored in mongodb, sn as the new id address.
Below, we need to obtain our website address and access it through the new id index in mongodb, analyze and capture the product data, and update the data into the database!
BeautifulSoup is the module most used, but it is difficult to use BeautifulSoup for the value data existing in js. Therefore, I recommend using xpath for the data in js, but the method of ES38en.document_ES40en (url) is needed to parse the webpage.
For xpath capture value data at the same time 1 must be careful! If you'd like to know more about xpath, leave a comment below and I'll get back to you as soon as possible!
def parser(sn,url):
try:
headers = {
.
}
r = requests.get(url, headers=headers,timeout=30)
html = r.content
soup = BeautifulSoup(html,"lxml")
dt = {}
#partno
a = soup.find("meta",itemprop="mpn")
if a:
dt['partno'] = a['content']
#manufacturer
b = soup.find("meta",itemprop="manufacturer")
if b:
dt['manufacturer'] = b['content']
#description
c = soup.find("span",itemprop="description")
if c:
dt['description'] = c.get_text().strip()
#price
price = soup.find("table",class_="table table-condensed occalc_pa_table")
if price:
cost = {}
for i in price.find_all('tr'):
if len(i) > 1:
td = i.find_all('td')
key=td[0].get_text().strip().replace(',','')
val=td[1].get_text().replace(u'\u20ac','').strip()
if key and val:
cost[key] = val
if cost:
dt['cost'] = cost
dt['currency'] = 'EUR'
#quantity
d = soup.find("input",id="ItemQuantity")
if d:
dt['quantity'] = d['value']
#specs
e = soup.find("div",class_="row parameter-container")
if e:
key1 = []
val1= []
for k in e.find_all('dt'):
key = k.get_text().strip().strip('.')
if key:
key1.append(key)
for i in e.find_all('dd'):
val = i.get_text().strip()
if val:
val1.append(val)
specs = dict(zip(key1,val1))
if specs:
dt['specs'] = specs
print dt
if dt:
db[table].update({'sn':sn},{'$set':dt})
print str(sn) + ' insert successfully'
time.sleep(3)
else:
error(str(sn) + '\t' + url)
except Exception,e:
error(str(sn) + '\t' + url)
print "Don't data!"
Finally, all the programs run, value data analysis and processing and stored in the database!