Python to grab a car web data parsing HTML into the excel example

  • 2020-04-02 13:18:57
  • OfStack

1. Address of an automobile website

2. After using firefox, I found that the information of this website is not json data, but simple HTML page

3. Use pyquery in pyquery library for HTML parsing

Page style:

< img border = 0 id = theimg onclick = window. The open this. (SRC) SRC = "/ / files.jb51.net/file_images/article/201312/20131204142751.jpg? 2013114143646 ">


def get_dealer_info(self):
        """ Obtain dealer information """
        css_select = 'html body div.box div.news_wrapper div.main div.news_list div.service_main div table tr '
        # Use autocopy in firefox css Location data is required for path acquisition 
        page = urllib2.urlopen(self.entry_url).read()
        # Read page 
        page = page.replace('<br />','&')
        page = page.replace('<br/>','&')
        # Because the phone information in the page is used br Line breaks, so there's a problem when you grab 
        # The problem is: if you get a pair of tags in the data contained in <br/>, It will be worth it br The previous data will not be available, and the subsequent data will not be available html Is a task /> At the end of the standard         
        d = pq(page)
        # use PyQuery Parse the page, here pq=PyQuery, because from pyquery import PyQuery as pq
        dealer_list = []
        # Create a list to submit to the store method 
        for dealer_div in d(css_select):
            # The positioning tr , the specific data in this label td tag 
            p = dealer_div.findall('td')
            # here p Is a tr Label inside, all td Collection of data 
            dealer = {}
            # The dictionary here is used to store information about a store that is submitted to the list 
            if len(p)==1:
                # Here in Togo if Judgment is used to process the data, because some format does not meet the requirements of the final data, need to be removed, this fast code according to the requirements 
                print '@'
            elif len(p)==6 :
                strp = p[0].text.strip()
                dealer[Constant.CITY] = p[1].text.strip()
                strc = p[2].text.strip()

                dealer[Constant.PROVINCE] = p[0].text.strip()
                dealer[Constant.CITY] = p[1].text.strip()
                dealer[Constant.NAME] = p[2].text.strip()
                dealer[Constant.ADDRESSTYPE] = p[3].text.strip()
                dealer[Constant.ADDRESS] = p[4].text.strip()
                dealer[Constant.TELPHONE] = p[5].text.strip()
                dealer_list.append(dealer)  
            elif len(p)==5:
                if p[0].text.strip() != u' provinces ':
                    dealer[Constant.PROVINCE] = strp
                    dealer[Constant.CITY] = p[0].text.strip()
                    dealer[Constant.NAME] = p[1].text.strip()
                    dealer[Constant.ADDRESSTYPE] = p[2].text.strip()
                    dealer[Constant.ADDRESS] = p[3].text.strip()
                    dealer[Constant.TELPHONE] = p[4].text.strip()
                    dealer_list.append(dealer)
            elif len(p)==3:
                print '@@'
        print '@@@'
        self.saver.add(dealer_list)
        self.saver.commit()

4. Finally, the code was executed successfully, and the corresponding data was obtained and stored in excel


Related articles: