python is used to parse xml file into html file

  • 2020-06-19 10:51:15
  • OfStack

The function is as described in the title, my python2.7, installed in windows environment, I use wingide 6.0 development tool

1. First, I designed a simple xml file, which is the source file used for parsing

Here is the contents of this file website.xml:


<website>
<page name="index" title="fuckyou">
	<h1>welcome to</h1>
	<p>this is a moment</p>
<ul>
<li><a href="shouting.html" rel="external nofollow" >Shouting</a></li>
</ul>
</page>
<page name="shouting" title="mother">
<h1>My name is likeyou</h1>
</page>
</website>

Explanation: page is corresponding to 1 html file, there are two page files which are resolved into two html files, and then index.html and shouting.html respectively, which are passed in ES22en.html < a > Link to the shouting.html file to display the contents of the ES29en.html file

2. python code analysis (ES34en. py)


#!D:\Python27\python.exe
#-*- coding:utf-8 -*-
from xml.sax import parse
from xml.sax.handler import ContentHandler
class PageCreate(ContentHandler):
 pagethrough = False
 def startElement(self, name, attrs):
  if name == 'page':
   self.pagethrough = True
   self.out = open(attrs['name'] + '.html', 'w')
   self.out.write('<html>\n<head>\n')
   self.out.write('<title>%s</title>\n' %(attrs['title']))
   self.out.write('</head>\n<body>\n')
  elif self.pagethrough:
   self.out.write('<')
   self.out.write(name)
   for str,val in attrs.items():
    self.out.write(' %s="%s"' %(str, val))
   self.out.write('>') 
   
 def endElement(self, name):
  if name == 'page':
   self.out.write('</body>\n</html>')
   self.pagethrough = False
   self.out.close()
  if self.pagethrough:
   self.out.write('<')
   self.out.write('/' + name)
   self.out.write('>')
   
 def characters(self, content):
  if self.pagethrough:
   self.out.write(content)
 
parse('D:\\pyproject\\file\\website.xml', PageCreate())

Code explanation:

xml. sax parsing method called parse method to parse, created a parse class, inherited ContentHandler, in which respectively overwrote startelement and endelement methods and charactors method, startelement method is called when finding the beginning tag in xml file, such as < a > , < h1 > The xml. sax is the element that belongs to the current page page, because xml. sax CARES about the tag, it doesn't care if you are in the current page < html > < head > < body > Note that attrs stores the attributes of the tag, for example < page > name="shouting", name="index", then attrs will store this name="shouting" thing, and get shouting and index in attrs attribute as the file name of html, same thing < a > The inside of the href =... This data is also retrieved, stored in the str and val variables, respectively, and written to the file via write.

Then endelement is when parse to < /h1 > This type of closing tag is called when adding the closing tag, or if it's the end of a file, it's called < /page > This is the time to put < /html > , < /body > These html end tags are added, otherwise they are the end tags of the elements in the page page

characters adds the string found between the beginning tag and the end tag

Finally, after running the python code, we can see that two html files are generated in the same directory, shouting. html and index. html


Related articles: