Example analysis of common methods for Python access to XML

  • 2020-05-27 06:01:05
  • OfStack

This article illustrates a common way for Python to access XML. I will share it with you for your reference as follows:

Currently, Python 3.2 has access to XML in the following four ways:

1.Expat
2.DOM
3.SAX
4.ElementTree

Take xml below as the basis for discussion


<?xml version="1.0" encoding="utf-8"?>
<Schools>
  <School Name="XiDian">
    <Class Id="030612">
      <Student Name="salomon">
        <Scores>
          <Math>98</Math>
          <English>85</English>
          <physics>89</physics>
        </Scores>
      </Student>
      <Student Name="Jupiter">
        <Scores>
          <Math>74</Math>
          <English>83</English>
          <physics>69</physics>
        </Scores>
      </Student>
    </Class>
    <Class Id="030611">
      <Student Name="Venus">
        <Scores>
          <Math>98</Math>
          <English>85</English>
          <physics>89</physics>
        </Scores>
      </Student>
      <Student Name="Mars">
        <Scores>
          <Math>74</Math>
          <English>83</English>
          <physics>69</physics>
        </Scores>
      </Student>
    </Class>
  </School>
</Schools>

Expat

Expat is a stream-oriented parser. You register the parser callback (or handler) function, and then start searching for its documentation. When the parser recognizes the specified location of the file, it calls the corresponding handler for that section (if you have already registered one). The file is sent to the parser, which splits it into segments and loads it into memory. So expat can parse those huge files.

SAX

SAX is a sequential access XML parser API, a parser that implements SAX (i.e. "SAX Parser") in the form of a streaming parser with event-driven API. The callback function is defined by the consumer and is called when an event occurs. Events are raised when any 1XML feature is encountered, and again when they are encountered at the end. The XML attribute is also part 1 of the event data passed to the element. Single directional when SAX is processed; Parsed data cannot be read again without starting again.

DOM

DOM parser in any processing prior to the start, it is necessary to put the tree in memory, so the memory usage DOM parser completely according to the size of the input data (relatively speaking, SAX parser memory contents, is only based on the maximum depth of XML files (XML maximum depth of the tree) and a single 1 XML XML attribute to store the biggest information) on the project.

DOM is implemented in python3.2 in two ways:

1.xml.minidom is a basic implementation.
2.xml.pulldom only builds subtrees that are accessed when needed.


'''
Created on 2012-5-25
@author: salomon
'''
import xml.dom.minidom as minidom
dom = minidom.parse("E:\\test.xml")
root = dom.getElementsByTagName("Schools") #The function getElementsByTagName returns NodeList.
print(root.length)
for node in root: 
  print("Root element is %s . " %node.tagName)#  Format the output, and C Series languages are quite different. 
  schools = node.getElementsByTagName("School")
  for school in schools:
    print(school.nodeName)
    print(school.tagName)
    print(school.getAttribute("Name"))
    print(school.attributes["Name"].value)
    classes = school.getElementsByTagName("Class")
    print("There are %d classes in school %s" %(classes.length, school.getAttribute("Name")))
    for mclass in classes:
      print(mclass.getAttribute("Id"))
      for student in mclass.getElementsByTagName("Student"):
        print(student.attributes["Name"].value)
        print(student.getElementsByTagName("English")[0].nodeValue) # Why is that? 
        print(student.getElementsByTagName("English")[0].childNodes[0].nodeValue)
        student.getElementsByTagName("English")[0].childNodes[0].nodeValue = 75
f = open('new.xml', 'w', encoding = 'utf-8')
dom.writexml(f,encoding = 'utf-8')
f.close()

ElementTree

At present, the information of ElementTree found is relatively small, and its working mechanism is not known at present. The data shows that ElementTree is close to a lightweight DOM, but all Element nodes of ElementTree work in a 1-way manner. It is very similar to XpathNavigator in C#.


'''
Created on 2012-5-25
@author: salomon
'''
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse("E:\\test.xml")
root = tree.getroot()
print(root.tag)
print(root[0].tag)
print(root[0].attrib)
schools = root.getchildren() 
for school in schools:
  print(school.get("Name"))
  classes = school.findall("Class")
  for mclass in classes:
    print(mclass.items())
    print(mclass.keys())
    print(mclass.attrib["Id"])
    math = mclass.find("Student").find("Scores").find("Math")
    print(math.text)
    math.set("teacher", "bada")
tree.write("new.xml")

Compare:

For all of the above, Expat and SAX parse XML the same way, but don't know how the performance compares. DOM consumes memory compared to the above two parsers and is relatively slow to process files due to access time. If the file is too large to load into memory, the DOM parser will not work, but DOM is the only option for some kinds of XML validation that require access to the entire file, or for some XML processing that only requires access to the entire file.

Note:

It is important to note that these technologies for accessing XML are not unique to Python, and Python was borrowed from or directly imported from other languages. For example, Expat is a development library developed in the C language to parse XML documents. While SAX was originally developed by DavidMegginson in java language, DOM can access and modify the content and structure of a document in a platform-independent and language-independent manner. It can be applied to any programming language.

For comparison, I would also like to list 1 the way C# accesses XML documents:

1. XmlDocument based on DOM
2. Streaming file based XmlReader and XmlWriter (it is different from the streaming file implementation of SAX, which is an event-driven model).
3. Linq to Xml

Two models for streaming files: XmlReader/XMLWriter VS SAX

The flow model iterates one node in the XML document at a time, which is suitable for processing larger documents and consumes less memory space. There are two variants of the flow model -- the "push" model and the "pull" model.

Push model is also known as SAX, SAX is a kind of event-driven model, that is to say, it USES push model to raise an event every time it finds a node, and we have to write a program to handle these events, which is very inflexible and troublesome.

. NET used in the implementation of the scheme is based on a "pull" model and "pull" model in traverse the document they interested in the part of documents from a reader, do not need to trigger events, allowing us to programmatically access to documents, it greatly improves the flexibility, on the performance of the "pull" model can be selectively processing nodes, and each SAX found 1 node will notify the client, thus, use "pull" model can improve the overall efficiency of Application.

PS: here are a few more online tools for you to use about xml:

Online XML/JSON interconversion tool:
http://tools.ofstack.com/code/xmljson

Online formatting XML/ online compression XML:
http://tools.ofstack.com/code/xmlformat

XML online compression/formatting tool:
http://tools.ofstack.com/code/xml_format_compress

XML code online formatting beautification tool:
http://tools.ofstack.com/code/xmlcodeformat

More about Python related topics: interested readers to view this site "Python xml data operation skill summary", "Python data structure and algorithm tutorial", "Python Socket programming skills summary", "Python function using techniques", "Python string skills summary", "Python introduction and advanced tutorial" and "Python file and directory skills summary"

I hope this article is helpful to you Python programming.


Related articles: