Learn more about the XML tools in Python

2020-05-09 18:45:43
OfStack

Module: xmllib

xmllib is a non-validating low-level parser. The xmllib used by application programmers can override the XMLParser class and provide methods for handling document elements such as specific or generic tags, or character entities. The use of xmllib has not changed since Python 1.5x to Python 2.0+; In most cases, the better option is to use SAX technology, which is also stream-oriented and more standard for languages and developers.

The examples in this article are the same as in the previous column: include an DTD called quotations.dtd and the DTD documentation, sample.xml (see resources for a file of the files mentioned in this article). The following code shows the first few lines of each introduction in sample.xml and generates a very simple ASCII indicator for unknown tags and entities. The parsed text is processed as a continuous stream, and any accumulators used are the responsibility of the programmer (such as the string in the tag (#PCDATA), or the list or dictionary of the tags encountered).
Listing 1: try_xmllib py


import
         xmllib, string
    
    classQuotationParser

        (xmllib.XMLParser):
  """Crude xmllib extractor for quotations.dtd document"""
  
    
    def__init__

        (self):
    xmllib.XMLParser.__init__(self)
    self.thisquote = ''       
    
    # quotation accumulator
     
     
     defhandle_data

        (self, data):
    self.thisquote = self.thisquote + data
  
    
    defsyntax_error
        (self, message):
    
    
    pass
  defstart_quotations
        (self, attrs): 
    
    # top level tag
         
     
     print

         '--- Begin Document ---'
  
    
    defstart_quotation
        (self, attrs):
    
    
    print
         'QUOTATION:'
  
    
    defend_quotation
        (self):
    
    
    print

         string.join(string.split(self.thisquote[:230]))+'...',
    
    
    print

         '('+str(len(self.thisquote))+' bytes)\n'
    self.thisquote = ''
  
    
    defunknown_starttag

        (self, tag, attrs):
    self.thisquote = self.thisquote + '{'
  
    
    defunknown_endtag

        (self, tag):
    self.thisquote = self.thisquote + '}'
  
    
    defunknown_charref

        (self, ref):
    self.thisquote = self.thisquote + '?'
  
    
    defunknown_entityref

        (self, ref):
    self.thisquote = self.thisquote + '#'
    
    if

         __name__ == '__main__':
  parser = QuotationParser()
  
    
    for
         c 
    
    in
         open("sample.xml").read():
    parser.feed(c)
  parser.close()

validation

The reason you might want to look into the future of standard XML support is that validation is required along with parsing. Unfortunately, the standard Python 2.0 XML package does not include a validating parser.

xmlproc is the native parser of python, which performs almost complete validation. If a validating parser is required, xmlproc is currently the only 1 option for Python. Furthermore, xmlproc provides a variety of advanced and test interfaces that other parsers do not.

Select a parser

If you decide to use XML's simple API (SAX) -- which should be used for complex things, since most of the other tools are built on it -- it will do a lot of sorting for you. The xml.sax module contains a facility for automatically selecting the "best" parser. In the standard Python 2.0 installation, the only parser of choice for 1 is expat, a fast extension of the C language. However, it is also possible to install another parser under $PYTHONLIB/xml/parsers for selection. Setting up the parser is simple:
Listing 2: Python selects the statement for the best parser


import
         xml.sax
parser = xml.sax.make_parser()

You can also select a specific parser by passing parameters; But for portability -- and for upward compatibility with future parsers -- the best way to do this is to use make_parser().

You can import xml.parsers.expat directly. If you do, you'll get some special tips that the SAX interface doesn't offer. Thus, xml.parsers.expat is somewhat "low" compared to SAX. But the SAX technology is very standard and very stream-oriented; Most of the time the SAX level is just right. In general, the difference in pure speed is small because the make_parser() function already gets the performance that expat provides.

What is a SAX

For context, a good answer to what SAX is is:

SAX (simple API for XML) is the common parser interface for XML parser. It allows application authors to write applications that use the XML parser, but it is independent of the parser used. (think of it as JDBC for XML.) (Lars Marius Garshol, SAX for Python)

SAX -- like the API of the parser module it provides -- is basically a sequential processor for one XML document. The way you use it is very similar to the xmllib example, but more abstract. Instead of the parser class, the application programmer will define an handler class that can be registered with any parser used. You must define four SAX interfaces (each of which has several methods) : DocumentHandler, DTDHandler, EntityResolver, and ErrorHandler. Create a parser that also connects to the default interface unless overridden. The code performs the same tasks as the xmllib example:
Listing 3: try_sax py


"Simple SAX example, updated for Python 2.0+"
    
    import
         string
    
    import
         xml.sax
    
    from
         xml.sax.handler 
    
    import
         *
    
    classQuotationHandler

        
  (ContentHandler):
  """Crude extractor for quotations.dtd compliant XML document"""
  
    
    def__init__

        
  (self):
    self.in_quote = 0
    self.thisquote = ''
  
    
    defstartDocument
        
  (self):
    
    
    print

         '--- Begin Document ---'
  
    
    defstartElement

        
  (self, name, attrs):
    
    
    if

         name == 'quotation':
      
    
    print

         'QUOTATION:'
      self.in_quote = 1
    
    
    else:
    
    
      self.thisquote = self.thisquote + '{'
  
    
    defendElement

        
  (self, name):
    
    
    if

         name == 'quotation':
      
    
    print

         string.join(string.split(self.thisquote[:230]))+'...',
      
    
    print

         '('+str(len(self.thisquote))+' bytes)\n'
      self.thisquote = ''
      self.in_quote = 0
    
    
    else:
    
    
      self.thisquote = self.thisquote + '}'
  
    
    defcharacters
        
  (self, ch):
    
    
    if

         self.in_quote:
      self.thisquote = self.thisquote + ch
    
    if
         __name__ == '__main__':
  parser = xml.sax.make_parser()
  handler = QuotationHandler()
  parser.setContentHandler(handler)
  parser.parse("sample.xml")

In contrast to xmllib, there are two small things to notice in the example above: the.parse () method handles the entire stream or string, so you don't have to create a loop for the parser; .parse () also has the flexibility to accept a file name, a file object, or a number of class file objects (some with.read () mode).

Package: DOM

DOM is an advanced tree representation of an XML document. This model is not for Python only, but a common XML model (see resources for further information). Python's DOM package is built on SAX and is included in Python 2.0's standard XML support. The code sample was not included in this article due to space limitations, but an excellent overall description is given in XML-SIG's "Python/XML HOWTO" :

The document object model specifies a tree representation for XML documents. The top-level document instance is the root of the tree, which has only one child, the top-level element instance. This element has child nodes representing content and child elements, which can also have children, and so on. The defined functions allow you to traverse the result tree at will, access element and attribute values, insert and delete nodes, and convert the tree back to XML.

DOM can be used to modify XML documents, because you can create an DOM tree, modify the tree by adding new nodes and moving the subtree back and forth, and then generate a new XML document as output. You can also construct an DOM tree yourself and convert it to XML; The output ratio of XML generated in this way will only be < tag1 > ... < /tag1 > Writing to files is more flexible.

The syntax for using the xml.dom module has changed a bit from earlier articles. The DOM implementation that comes with Python 2.0 is called xml.dom.minidom and offers both lightweight and small versions of DOM. Obviously, there are some experimental features in the full XML-SIG DOM that are not included in xml.dom.minidom, but you won't notice that.

Generating DOM objects is simple; Only need to:
Listing 4: create the Python DOM object in the XML file


from
         xml.dom.minidom 
    
    import

         parse, parseString
dom1 = parse('mydata.xml') 
    
    # parse an XML file by name

Using DOM objects is a very straightforward OOP pattern job. However, it is not uncommon to encounter many list-like properties at levels that are not immediately distinguishable (other than circular enumeration). For example, here is a common DOM Python code snippet:
Listing 5: iteration through the Python DOM node object


for
         node 
    
    in
         dom_node.childNodes:
  
    
    if

         node.nodeName == '#text':   
    
    # PCDATA is a kind of node,
    PCDATA = node.nodeValue    
    
    # but not a new subtag
     
     
     elif

         node.nodeName == 'spam':
    spam_node_list.append(node) 
    
    # Create list of <spam> nodes

A more detailed example of DOM is available in the Python standard documentation. My earlier example of using DOM objects (see resources) is still pointing in the right direction, but since the publication of this article, some method and property names have changed, so please refer to the documentation for Python below.

Module: pyxie

The pyxie module is built on top of Python standard XML support, which provides an additional high-level interface to XML documents. pyxie does two basic things: it converts XML documents into a line-based format that is easier to parse; And it provides a way to treat XML documents as operable trees. The line-based PYX format used by pyxie is language free and the tools are available in several languages. In summary, the PYX representation of a document is easier to process using common line-based text processing tools, such as grep, sed, awk, bash, perl, or standard python modules, such as string and re, than its XML representation. Depending on the results, switching from XML to PYX may save a lot of work.

pyxie's concept of processing XML documents as trees is similar to the idea in DOM. Since the DOM standard is widely supported in many programming languages, most programmers will use the DOM standard instead of pyxie if the tree representation of the XML document is required.

More modules: xml_pickle and xml_objectify

I developed my own advanced modules for handling XML, called xml_pickle and xml_objectify. I've written many similar modules elsewhere (see resources), but I don't need to go into them here. These modules are useful when you "think with Python" instead of "think with XML". In particular, xml_objectify itself hides almost all XML clues from the programmer, allowing you to make full use of Python "raw" objects in your program. The actual XML data format is abstracted almost invisibly. Similarly, xml_pickle lets the Python programmer start with the "raw" Python object, whose data can be sourced from any source code, and then put them (in sequence) into the XML format that other users may later need.