Python crawls jd book review data
- 2020-04-02 14:02:13
- OfStack
Jd.com book review has a wealth of information, which includes the purchase date, title, author, good reviews, medium reviews, bad reviews and so on. Take the purchase date as an example, using Python + Mysql collocation for implementation, the program is not large, only 100 lines. I've added a note to the program explaining why:
The from the selenium import webdriver
The from bs4 import BeautifulSoup
The import re
The import win32com. Client
The import threading, time
The import MySQLdb
Def mydebug () :
Driver. The quit ()
The exit (0)
Def catchDate (s) :
Page data extraction
Soup = BeautifulSoup (s)
Z = []
Global nowtimes
M = soup. The.findall (" div ", class_ = "date - buy")
For obj in m:
Try:
TMP = obj. Find (" br "). The contents
Except the Exception, e:
The continue
If (TMP! = "") :
Z.a ppend (TMP)
Nowtimes + = 1
Return the z
Def getTimes (n, t) :
""" get current progress """ "
Return "current progress: + STR (int(100*n/t)) + "%"
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | program start | -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -
Determine the category of books
Cate = {" 3273 ", "history", "3279", "psychology", "3276", "political and military", "3275", "Chinese ancient books", "3274", "philosophy of religion", "3277", "law", "3280", "culture", "3281", "social sciences"}
Keep catching
Num1 = input (" bookid: ")
Num2 = input (" pagenumber: ")
To generate links for books, 17355*20 = 347,100 times
Totaltimes = 347100.0
Nowtimes = 0
# opens the PhantomJS object for webdirver
# driver = webdriver. PhantomJS ()
Driver = webdriver. Ie (' C: \ Python27 \ Scripts \ IEDriverServer ')
# driver = webdriver. Chrome (' C: \ Python27 \ Scripts \ chromedriver ')
Read the comment page in Mysql and grab it
Connect to the database
Try:
Conn = MySQLdb. Connect (host = 'localhost', user = 'root', passwd = ', db = 'jd')
Except the Exception, e:
Print e
Sys. The exit ()
Gets a cursor object
Cursor = conn. Cursor ()
SQL = "SELECT * FROM booknew ORDER BY pagenumber DESC"
Cursor. The execute (SQL)
Alldata = cursor. Fetchall ()
Flag = 0
Flag2 = 0
# if there is data back loop output, http://club.jd.com/review/10178500-1-154.html
If alldata:
For rec in alldata:
# rec [0] - bookid, rec [1] - cateid, rec [2] - pagenumber
If (rec [0]! = STR (num1) and flag == 0):
The continue
The else:
Flag = 1
For p in the range (num2, rec [2]) :
If (flag2 = = 0) :
Num2 = 0
Flag2 = 1
P + = 1
The link = "(link: http://club.jd.com/review/)" + rec [0] + - 1 - "" + STR (p) +" HTML"
# catch a web page
Driver. The get (link)
HTML = driver. Page_source
# catch comments
Buydate = catchDate (HTML)
# write to database
For z in buydate:
SQL = "INSERT INTO LJJ (id, cateid, bookid, date) VALUES (NULL, '" + rec [0] +"', '" + rec [1] + "', '" [0] + + z ");"
Try:
Cursor. The execute (SQL)
Except the Exception, e:
Print e
Conn.com MIT ()
Print getTimes (nowtimes totaltimes)
Driver. The quit ()
Cursor. The close ()
Conn. Close ()