Python to grab a car web data parsing HTML into the excel example
- 2020-04-02 13:18:57
- OfStack
1. Address of an automobile website
2. After using firefox, I found that the information of this website is not json data, but simple HTML page
3. Use pyquery in pyquery library for HTML parsing
Page style:
< img border = 0 id = theimg onclick = window. The open this. (SRC) SRC = "/ / files.jb51.net/file_images/article/201312/20131204142751.jpg? 2013114143646 ">
def get_dealer_info(self):
""" Obtain dealer information """
css_select = 'html body div.box div.news_wrapper div.main div.news_list div.service_main div table tr '
# Use autocopy in firefox css Location data is required for path acquisition
page = urllib2.urlopen(self.entry_url).read()
# Read page
page = page.replace('<br />','&')
page = page.replace('<br/>','&')
# Because the phone information in the page is used br Line breaks, so there's a problem when you grab
# The problem is: if you get a pair of tags in the data contained in <br/>, It will be worth it br The previous data will not be available, and the subsequent data will not be available html Is a task /> At the end of the standard
d = pq(page)
# use PyQuery Parse the page, here pq=PyQuery, because from pyquery import PyQuery as pq
dealer_list = []
# Create a list to submit to the store method
for dealer_div in d(css_select):
# The positioning tr , the specific data in this label td tag
p = dealer_div.findall('td')
# here p Is a tr Label inside, all td Collection of data
dealer = {}
# The dictionary here is used to store information about a store that is submitted to the list
if len(p)==1:
# Here in Togo if Judgment is used to process the data, because some format does not meet the requirements of the final data, need to be removed, this fast code according to the requirements
print '@'
elif len(p)==6 :
strp = p[0].text.strip()
dealer[Constant.CITY] = p[1].text.strip()
strc = p[2].text.strip()
dealer[Constant.PROVINCE] = p[0].text.strip()
dealer[Constant.CITY] = p[1].text.strip()
dealer[Constant.NAME] = p[2].text.strip()
dealer[Constant.ADDRESSTYPE] = p[3].text.strip()
dealer[Constant.ADDRESS] = p[4].text.strip()
dealer[Constant.TELPHONE] = p[5].text.strip()
dealer_list.append(dealer)
elif len(p)==5:
if p[0].text.strip() != u' provinces ':
dealer[Constant.PROVINCE] = strp
dealer[Constant.CITY] = p[0].text.strip()
dealer[Constant.NAME] = p[1].text.strip()
dealer[Constant.ADDRESSTYPE] = p[2].text.strip()
dealer[Constant.ADDRESS] = p[3].text.strip()
dealer[Constant.TELPHONE] = p[4].text.strip()
dealer_list.append(dealer)
elif len(p)==3:
print '@@'
print '@@@'
self.saver.add(dealer_list)
self.saver.commit()
4. Finally, the code was executed successfully, and the corresponding data was obtained and stored in excel