Python crawls jd book review data

  • 2020-04-02 14:02:13
  • OfStack

  Jd.com book review has a wealth of information, which includes the purchase date, title, author, good reviews, medium reviews, bad reviews and so on. Take the purchase date as an example, using Python + Mysql collocation for implementation, the program is not large, only 100 lines. I've added a note to the program explaining why:

The from the selenium import webdriver
The from bs4 import BeautifulSoup
The import re
The import win32com. Client
The import threading, time
The import MySQLdb

Def mydebug () :
      Driver. The quit ()
      The exit (0)

Def catchDate (s) :
      Page data extraction
      Soup = BeautifulSoup (s)
      Z = []
      Global nowtimes
     
      M = soup. The.findall (" div ", class_ = "date - buy")
      For obj in m:
              Try:
                      TMP = obj. Find (" br "). The contents
              Except the Exception, e:
                      The continue
              If (TMP! = "") :
                      Z.a ppend (TMP)
                      Nowtimes + = 1
      Return the z

Def getTimes (n, t) :
      """ get current progress """ "
      Return "current progress: + STR (int(100*n/t)) + "%"


# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | program start | -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -
Determine the category of books
Cate = {" 3273 ", "history", "3279", "psychology", "3276", "political and military", "3275", "Chinese ancient books", "3274", "philosophy of religion", "3277", "law", "3280", "culture", "3281", "social sciences"}

Keep catching
Num1 = input (" bookid: ")
Num2 = input (" pagenumber: ")

To generate links for books, 17355*20 = 347,100 times
Totaltimes = 347100.0
Nowtimes = 0

# opens the PhantomJS object for webdirver
# driver = webdriver. PhantomJS ()
Driver = webdriver. Ie (' C: \ Python27 \ Scripts \ IEDriverServer ')
# driver = webdriver. Chrome (' C: \ Python27 \ Scripts \ chromedriver ')

Read the comment page in Mysql and grab it
Connect to the database
Try:
      Conn = MySQLdb. Connect (host = 'localhost', user = 'root', passwd = ', db = 'jd')
Except the Exception, e:
      Print e
      Sys. The exit ()

Gets a cursor object
Cursor = conn. Cursor ()
SQL = "SELECT * FROM booknew ORDER BY pagenumber DESC"
Cursor. The execute (SQL)
Alldata = cursor. Fetchall ()

Flag = 0
Flag2 = 0

# if there is data back loop output, http://club.jd.com/review/10178500-1-154.html
If alldata:
      For rec in alldata:
              # rec [0] - bookid, rec [1] - cateid, rec [2] - pagenumber
              If (rec [0]! = STR (num1) and flag == 0):
                      The continue
              The else:
                      Flag = 1
              For p in the range (num2, rec [2]) :
                      If (flag2 = = 0) :
                              Num2 = 0
                              Flag2 = 1
                      P + = 1
                      The link = "(link: http://club.jd.com/review/)" + rec [0] + - 1 - "" + STR (p) +" HTML"
                      # catch a web page
                      Driver. The get (link)
                      HTML = driver. Page_source
                      # catch comments
                      Buydate = catchDate (HTML)
                      # write to database
                      For z in buydate:
                              SQL = "INSERT INTO LJJ (id, cateid, bookid, date) VALUES (NULL, '" + rec [0] +"', '" + rec [1] + "', '" [0] + + z ");"
                              Try:
                                      Cursor. The execute (SQL)
                              Except the Exception, e:
                                      Print e
                      Conn.com MIT ()
              Print getTimes (nowtimes totaltimes)

Driver. The quit ()
Cursor. The close ()
Conn. Close ()


Related articles: