Learn Python selenium automated Web crawler

2020-07-21 09:00:56
OfStack

To get straight to the point --Python selenium automatically controls the browser to grab the data of the web page, including button clicks, jump to the page, input of the search box, value data storage of the page, mongodb automatic id logo, etc.

1, first of all, the introduction of 1 Python selenium -- automated testing tools, used to control the browser to the operation of the web page, in the crawler and BeautifulSoup combined that is seamless, remove some abnormal verification of the foreign web page, for the picture captchAs I have written the crack image captchas source code, the success rate of 85%.

For details, please consult QQ group -607021567 (this is not advertising, there are a lot of Python resources sharing, and some knowledge of big data [hadoop])

2, beautifulsoup doesn't need detailed introduced, directly on the web site: : https: / / www crummy. com software/BeautifulSoup/bs4 / doc/(BeautifulSoup official documentation)

3. Automatic GENERATION of id for mongodb All the data stored in mongodb are fixed with id, but id of mongodb is complicated for human beings and a dish for machines. Therefore, when storing data, I used to use the new id to take responsibility for each piece of data!

If you use mongodb in Python, you need to introduce the module from pymongo import MongoClient,ASCENDING, DESCENDING. The module is your responsibility!

Next start the program, directly on the example (step 1 step 1) :

Introduction module:


from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from pymongo import MongoClient,ASCENDING, DESCENDING
import time
import re

Each module will say that it has been explained, re and requests are mentioned before, they are the core missing 1 cannot!

First, I take a small example, Taobao automatic simulation search function (source) :

First, let's talk about the positioning method of selenium


find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

Source:


from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from pymongo import MongoClient,ASCENDING, DESCENDING
import time
import re
def TaoBao():
 try:
  Taobaourl = 'https://www.taobao.com/'
  driver = webdriver.Chrome()
  driver.get(Taobaourl)
  time.sleep(5)# Usually there is a pause, or your program is likely to be detected as yes Spider
  text='Strong Man'# Input content 
  driver.find_element_by_xpath('//input[@class="search-combobox-input"]').send_keys(text).click()
  driver.find_element_by_xpath('//button[@class="btn-search tb-bg"]').click()
  driver.quit()
 except Exception,e:
  print e
if __name__ == '__main__':
 TaoBao()

Effect of the implementation, you can directly copy directly after the direct run! I only used the xpath method because it's the most realistic! The orange font (if I don't have color blindness) is the element that is located in the page, you can find it!

Next is the combination with BeautifulSoup, but we only see the open web page, there is no source code, then need "variable name.page_source" this method, he will realize your dream, you know.


ht = driver.page_source
#print ht  You can Print Out now and see 
soup = BeautifulSoup(ht,'html.parser')

The following is the BeautifulSoup 1 syntax operation, for the data structure and collection, in the last 1 there are detailed fetching operations!!

Forget it! Say 1 of the simplest positional grabs:


soup = BeautifulSoup(ht,'html.parser')
a = soup.find('table',id="ctl00_ContentMain_SearchResultsGrid_grid")
if a: # Must add judgment, otherwise the page visited may not have this 1 Element, the program will all stop!

The class tag must be class_,1 remember!

Ha ha ha! mongodb, details, first need to use the module -from pymongo import MongoClient,ASCENDING, DESCENDING

Since the syntax for python, mongodb is still useful, you need to define a library that is global and has a global variable that links to your computer.


if __name__ == '__main__': 
 global db# The global variable      
 global table# Global database 
 table = 'mouser_product'
 mconn=MongoClient("mongodb://localhost")# address 
 db=mconn.test
 db.authenticate('test','test')# Username and password 
 Taobao()

After defining these, we need our new id to define the data trace:


db.sn.find_and_modify({"_id": table}, update={ "$inc": {'currentIdValue': 1}},upsert=True)
dic = db.ids.find({"_id":table}).limit(1)
return dic[0].get("currentIdValue")

This method is generic, so just remember the mongodb syntax! Because there's a return value here, so this is a method body, you don't have to worry too much about how it's implemented, just understand, is it centered or is it in the process of storing the data


count = db[table].find({' data ': data }).count() # Is to retrieve data from a database 
if count <= 0:        # Determine if there is 
ids= getNewsn()       #ids That's our new definition id Here, id is 1 Growth at the beginning id
db[table].insert({"ids":ids," data ": data })

In this way, our data will be directly stored in the DATABASE of mongodb. Here is an explanation of why I like mongodb so much in big data, because it is small and fast!

Finally comes an example source code:


from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from pymongo import MongoClient,ASCENDING, DESCENDING
import time
import re
def parser():
 try:
  f = open('sitemap.txt','r')
  for i in f.readlines():
   sorturl=i.strip()
   driver = webdriver.Firefox()
   driver.get(sorturl)
   time.sleep(50)
   ht = driver.page_source
   #pageurl(ht)
   soup = BeautifulSoup(ht,'html.parser')
   a = soup.find('a',class_="first-last")
   if a:
    pagenum = int(a.get_text().strip())
    print pagenum
    for i in xrange(1,pagenum):
     element = driver.find_element_by_xpath('//a[@id="ctl00_ContentMain_PagerTop_%s"]' %i)
     element.click()
     html = element.page_source
     pageurl(html)
     time.sleep(50)
     driver.quit()
 except Exception,e:
  print e
def pageurl(ht):
 try:
  soup = BeautifulSoup(ht,'html.parser')
  a = soup.find('table',id="ctl00_ContentMain_SearchResultsGrid_grid")
  if a:
   tr = a.find_all('tr',class_="SearchResultsRowOdd")
   if tr:
     for i in tr:
      td = i.find_all('td')
      if td:
       url = td[2].find('a')
       if url:
        producturl = ' The url '+url['href']
        print producturl
        count = db[table].find({"url":producturl}).count()
        if count<=0:
         sn = getNewsn()
         db[table].insert({"sn":sn,"url":producturl})
         print str(sn) + ' inserted successfully'
         time.sleep(3)
        else:
         print 'exists url'
   tr1 = a.find_all('tr',class_="SearchResultsRowEven")
   if tr1:
     for i in tr1:
      td = i.find_all('td')
      if td:
       url = td[2].find('a')
       if url:
        producturl = ' The url '+url['href']
        print producturl
        count = db[table].find({"url":producturl}).count()
        if count<=0:
         sn = getNewsn()
         db[table].insert({"sn":sn,"url":producturl})
         print str(sn) + ' inserted successfully'
         time.sleep(3)
        else:
         print 'exists url'
        #time.sleep(5)
 except Exception,e:
  print e
def getNewsn(): 
 db.sn.find_and_modify({"_id": table}, update={ "$inc"{'currentIdValue': 1}},upsert=True)
 dic = db.sn.find({"_id":table}).limit(1)
 return dic[0].get("currentIdValue")
if __name__ == '__main__': 
 global db     
 global table
 table = 'mous_product'
 mconn=MongoClient("mongodb://localhost")
 db=mconn.test
 db.authenticate('test','test')
 parser()

This 1 string of code is cracked a foreigner's boring verification code interface, I really very speechless to him! Crack method or practice! This is the complete source code, no deletion oh! Pure manual!