Python manipulation XML file details

2020-04-02 13:44:31
OfStack

There have been many articles about python reading XML, but most of them are about pasting an XML file and then pasting code to process the file. This is not good for beginners to learn, I hope that this article will be more straightforward to teach you how to use python to read XML files.

What is XML?

XML is an extensible markup language that can be used to tag data, define data types, and is a source language that allows users to define their own markup language.

The XML



<?xml version="1.0" encoding="utf-8"?>

<catalog>

    <maxid>4</maxid>

    <login username="pytest" passwd='123456'>

        <caption>Python</caption>

        <item id="4">

            <caption> test </caption>

        </item>

    </login>

    <item id="2">

        <caption>Zope</caption>

    </item>

</catalog>

Ok, structurally, it looks a lot like our usual HTML hypertext markup language. But they are designed for different purposes. Hypertext markup languages are designed to display data, with a focus on how the data looks. It is designed to transmit and store data, and its focus is on the content of the data.

So it has the following characteristics:

First of all, it's made up of tag pairs, < aa > < / aa >

Tags can have attributes: < Aa id = '123' > < / aa >

Tag pairs can embed data: < aa > ABC < / aa >

Tags can be embedded with subtags (with hierarchy) :

Second, get the label attributes

So, here's how to read this type of file in python.



#coding=utf-8

import  xml.dom.minidom
# Open the xml The document 

dom = xml.dom.minidom.parse('abc.xml')
# Gets the document element object 

root = dom.documentElement

print root.nodeName

print root.nodeValue

print root.nodeType

print root.ELEMENT_NODE

The mxl.dom. Minidom module is used to process XML files, so it is introduced first.

Xml.dom.minidom.parse () is used to open an XML file and place the file object in the dom variable.

DocumentElement is used to get the document elements of the dom object and give the obtained object to root

Each node has its nodeName, nodeValue, nodeType properties.

NodeName is the nodeName.

NodeValue is the value of a node and is only valid for text nodes.

NodeType is the type of node. Catalog is of type ELEMENT_NODE

Now there are the following:

'ATTRIBUTE_NODE'
'CDATA_SECTION_NODE'
'COMMENT_NODE'
'DOCUMENT_FRAGMENT_NODE'
'DOCUMENT_NODE'
'DOCUMENT_TYPE_NODE'
'ELEMENT_NODE'
'ENTITY_NODE'
'ENTITY_REFERENCE_NODE'
'NOTATION_NODE'
'PROCESSING_INSTRUCTION_NODE'
'TEXT_NODE'

Third, obtain the child label

Now get the name of the tag with which the child tag of catalog is named



<?xml version="1.0" encoding="utf-8"?>

<catalog>

       <maxid>4</maxid>

       <login username="pytest" passwd='123456'>

            　　<caption>Python</caption>

             <item id="4">

                    <caption> test </caption>

            </item>

    </login>

    <item id="2">

            <caption>Zope</caption>

    </item>

</catalog>

For child elements that know the name of the element, you can use the getElementsByTagName method to get:



#coding=utf-8

import  xml.dom.minidom
# Open the xml The document 

dom = xml.dom.minidom.parse('abc.xml')
# Gets the document element object 

root = dom.documentElement
bb = root.getElementsByTagName('maxid')

b= bb[0]

print b.nodeName
bb = root.getElementsByTagName('login')

b= bb[0]

print b.nodeName

How to distinguish tags with the same tag name:
The same code at the page code block index 2

< caption > and < The item > How to distinguish more than one label?



#coding=utf-8

import  xml.dom.minidom
# Open the xml The document 

dom = xml.dom.minidom.parse('abc.xml')
# Gets the document element object 

root = dom.documentElement
bb = root.getElementsByTagName('caption')

b= bb[2]

print b.nodeName
bb = root.getElementsByTagName('item')

b= bb[1]

print b.nodeName

Root. GetElementsByTagName (' caption ') is the label for the caption for a group of tags, b [0] said the first of a set of tags; B [2], representing the third tag in this set.

Four, get the tag attribute value

The same code at the page code block index 2

< The login > and < The item > Tags have attributes. How do you get their attributes?



#coding=utf-8

import  xml.dom.minidom
# Open the xml The document 

dom = xml.dom.minidom.parse('abc.xml')
# Gets the document element object 

root = dom.documentElement
itemlist = root.getElementsByTagName('login')

item = itemlist[0]

un=item.getAttribute("username")

print un

pd=item.getAttribute("passwd")

print pd
ii = root.getElementsByTagName('item')

i1 = ii[0]

i=i1.getAttribute("id")

print i
i2 = ii[1]

i=i2.getAttribute("id")

print i

The getAttribute method gets the values for the attributes of the element.

Get the data between the label pairs

The same code at the page code block index 2

< caption > There is data between the tag pairs. How do you get that data?

There are several ways to get data between label pairs,

Method one:



#coding=utf-8

import  xml.dom.minidom
# Open the xml The document 

dom = xml.dom.minidom.parse('abc.xml')
# Gets the document element object 

root = dom.documentElement
cc=dom.getElementsByTagName('caption')

c1=cc[0]

print c1.firstChild.data
c2=cc[1]

print c2.firstChild.data
c3=cc[2]

print c3.firstChild.data

The firstChild attribute returns the firstChild node of the selected node, and.data represents getting the node's person data.

Method 2:



#coding=utf-8

from xml.etree import ElementTree as ET

per=ET.parse('abc.xml')

p=per.findall('./login/item')
for oneper in p:

    for child in oneper.getchildren():

        print child.tag,':',child.text


p=per.findall('./item')
for oneper in p:

    for child in oneper.getchildren():

        print child.tag,':',child.text

Method 2 is a bit more complicated, and the referenced module is different from the previous one. Findall is used to specify which level of tag to start the traversal under.

The getchildren method returns all child tags in document order. And output the tag name (child.tag) and the tag data (child.text)

Actually, method 2 doesn't do that. Its core function is to iterate over all the child tags under a certain level of tags.

PS: here are a few more online tools about XML operation for your reference:

Online XML/JSON interconversion tool:
(link: http://tools.jb51.net/code/xmljson)

Online formatted XML/ online compressed XML:
(link: http://tools.jb51.net/code/xmlformat)

XML online compression/formatting tools:
(link: http://tools.jb51.net/code/xml_format_compress)