Java parses XML using xpath and dom4j

2020-04-01 02:51:24
OfStack

1 four ways to parse XML files

There are four classic ways to parse XML files in general. There are two basic ways of parsing, one called SAX and the other DOM. SAX is based on event stream parsing, and DOM is based on XML document tree structure parsing. On this basis, in order to reduce the encoding amount of DOM and SAX, JDOM appeared. Its advantage is that the 20-80 principle (pareto rule) greatly reduces the amount of code. In general, JDOM is used to meet the simple functions to be implemented, such as parsing, creation, and so on. But underneath, JDOM still USES SAX (most commonly), DOM, and Xanan documents. The other is DOM4J, a very, very good Java XML API with excellent performance, power, and extreme ease of use, as well as being open source software. Nowadays you can see that more and more Java software is using DOM4J to read and write XML, especially Sun's JAXM. Specific use of four methods, baidu, there will be a lot of detailed introduction.

2 brief introduction to XPath

XPath is a language for finding information in XML documents. XPath is used to navigate through and iterate over elements and attributes in an XML document. XPath is the main element of the W3C XSLT standard, and both XQuery and XPointer are built on top of XPath expressions. Therefore, an understanding of XPath is the basis for many advanced XML applications. XPath is very much like the SQL language for working with databases, or JQuery, which makes it easy for developers to grab what they need in a document. DOM4J also supports the use of XPath.

3 DOM4J USES XPath

When DOM4J USES XPath to parse XML documents, you first need to refer to two jars in your project:

Dom4j - 1.6.1. Jar: dom4j package, download address: http://sourceforge.net/projects/dom4j/;

Jaxen - xx. Xx. Jar: usually do not add this package, will throw an exception (Java. Lang. NoClassDefFoundError: org/jaxen/JaxenException), download address: http://www.jaxen.org/releases.html.

3.1 interference of namespace

When working with XML files converted from excel files or other formats, you often encounter situations where you cannot get results through XPath parsing. This is usually due to the existence of a namespace. Taking the following XML file as an example, a simple retrieval by XPath=" // Workbook/ Worksheet/Table/Row[1]/ Cell[1]/Data[1] "is usually fruitless. This is caused by namespace namespace (XMLNS ="urn:schemas-microsoft-com:office:spreadsheet").



<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">

  <Worksheet ss:Name="Sheet1">

    <Table ss:ExpandedColumnCount="81" ss:ExpandedRowCount="687" x:FullColumns="1" x:FullRows="1" ss:DefaultColumnWidth="52.5" ss:DefaultRowHeight="15.5625">

      <Row ss:AutoFitHeight="0">

  <Cell>

   <Data ss:Type="String"> The rat that knocks the code </Data>

  </Cell> 

      </Row>

      <Row ss:AutoFitHeight="0">

  <Cell>

   <Data ss:Type="String">Sunny</Data>

  </Cell> 

      </Row>

    </Table>

  </Worksheet>

</Workbook>

3.2 XPath parses XML files with namespaces

The first method (the read1() function) : use the local-name() and namespace-uri() that come with XPath syntax to specify the node name and namespace you want to use. XPath expressions are tricky to write.

The second method (read2() function) : sets the namespace of the XPath, using the setNamespaceURIs() function.

The third method (read3() function) : sets the namespace of the DocumentFactory(), using the setXPathNamespaceURIs() function. Two - and three-way XPath expressions are relatively easy to write.

The fourth method (read4() function) : the method is the same as the third one, but the XPath expression is different (the program is specific), mainly to test the difference of XPath expression, mainly refers to the degree of completeness, whether it will affect the retrieval efficiency.

(the above four methods all parse the XML file through DOM4J and XPath.)

Fifth method (read5() function) : XML files are parsed using DOM in conjunction with XPath, primarily to check for performance differences.

Nothing says it like code! Decisively on the code!



packageXPath;

importjava.io.IOException;

importjava.io.InputStream;

importjava.util.HashMap;

importjava.util.List;

importjava.util.Map;
importjavax.xml.parsers.DocumentBuilder;

importjavax.xml.parsers.DocumentBuilderFactory;

importjavax.xml.parsers.ParserConfigurationException;

importjavax.xml.xpath.XPathConstants;

importjavax.xml.xpath.XPathExpression;

importjavax.xml.xpath.XPathExpressionException;

importjavax.xml.xpath.XPathFactory;
importorg.dom4j.Document;

importorg.dom4j.DocumentException;

importorg.dom4j.Element;

importorg.dom4j.XPath;

importorg.dom4j.io.SAXReader;

importorg.w3c.dom.NodeList;

importorg.xml.sax.SAXException;
/**

*DOM4JDOMXMLXPath

*/

publicclassTestDom4jXpath{

publicstaticvoidmain(String[]args){

read1();

read2();

read3();

read4();//The read3 () method is the same, but the XPath expression is different 

read5();

}
publicstaticvoidread1(){

/*

*uselocal-name()andnamespace-uri()inXPath

*/

try{

longstartTime=System.currentTimeMillis();

SAXReaderreader=newSAXReader();

InputStreamin=TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\XXX.xml");

Documentdoc=reader.read(in);

/*Stringxpath="//*[local-name()='Workbook'andnamespace-uri()='urn:schemas-microsoft-com:office:spreadsheet']"

+"/*[local-name()='Worksheet']"

+"/*[local-name()='Table']"

+"/*[local-name()='Row'][4]"

+"/*[local-name()='Cell'][3]"

+"

Stringxpath="//*[local-name()='Row'][4]/*[local-name()='Cell'][3]/*[local-name()='Data'][1]";

System.err.println("=====uselocal-name()andnamespace-uri()inXPath====");

System.err.println("XPath : "+xpath);

@SuppressWarnings("unchecked")

List<Element>list=doc.selectNodes(xpath);

for(Objecto:list){

Elemente=(Element)o;

Stringshow=e.getStringValue();

System.out.println("show="+show);

longendTime=System.currentTimeMillis();

System.out.println(" Program running time: "+(endTime-startTime)+"ms");

}

}catch(DocumentExceptione){

e.printStackTrace();

}

}
publicstaticvoidread2(){

/*

*setxpathnamespace(setNamespaceURIs)

*/

try{

longstartTime=System.currentTimeMillis();

Mapmap=newHashMap();

map.put("Workbook","urn:schemas-microsoft-com:office:spreadsheet");

SAXReaderreader=newSAXReader();

InputStreamin=TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\XXX.xml");

Documentdoc=reader.read(in);

Stringxpath="//Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]";

System.err.println("=====usesetNamespaceURIs()tosetxpathnamespace====");

System.err.println("XPath : "+xpath);

XPathx=doc.createXPath(xpath);

x.setNamespaceURIs(map);

@SuppressWarnings("unchecked")

List<Element>list=x.selectNodes(doc);

for(Objecto:list){

Elemente=(Element)o;

Stringshow=e.getStringValue();

System.out.println("show="+show);

longendTime=System.currentTimeMillis();

System.out.println(" Program running time: "+(endTime-startTime)+"ms");

}

}catch(DocumentExceptione){

e.printStackTrace();

}

}
publicstaticvoidread3(){

/*

*setDocumentFactory()namespace(setXPathNamespaceURIs)

*/

try{

longstartTime=System.currentTimeMillis();

Mapmap=newHashMap();

map.put("Workbook","urn:schemas-microsoft-com:office:spreadsheet");

SAXReaderreader=newSAXReader();

InputStreamin=TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\XXX.xml");

reader.getDocumentFactory().setXPathNamespaceURIs(map);

Documentdoc=reader.read(in);

Stringxpath="//Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]";

System.err.println("=====usesetXPathNamespaceURIs()tosetDocumentFactory()namespace====");

System.err.println("XPath : "+xpath);

@SuppressWarnings("unchecked")

List<Element>list=doc.selectNodes(xpath);

for(Objecto:list){

Elemente=(Element)o;

Stringshow=e.getStringValue();

System.out.println("show="+show);

longendTime=System.currentTimeMillis();

System.out.println(" Program running time: "+(endTime-startTime)+"ms");

}

}catch(DocumentExceptione){

e.printStackTrace();

}

}
publicstaticvoidread4(){

/*

* with The read3 () method is the same, but the XPath expression is different 

*/

try{

longstartTime=System.currentTimeMillis();

Mapmap=newHashMap();

map.put("Workbook","urn:schemas-microsoft-com:office:spreadsheet");

SAXReaderreader=newSAXReader();

InputStreamin=TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\XXX.xml");

reader.getDocumentFactory().setXPathNamespaceURIs(map);

Documentdoc=reader.read(in);

Stringxpath="//Workbook:Worksheet/Workbook:Table/Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]";

System.err.println("=====usesetXPathNamespaceURIs()tosetDocumentFactory()namespace====");

System.err.println("XPath : "+xpath);

@SuppressWarnings("unchecked")

List<Element>list=doc.selectNodes(xpath);

for(Objecto:list){

Elemente=(Element)o;

Stringshow=e.getStringValue();

System.out.println("show="+show);

longendTime=System.currentTimeMillis();

System.out.println(" Program running time: "+(endTime-startTime)+"ms");

}

}catch(DocumentExceptione){

e.printStackTrace();

}

}
publicstaticvoidread5(){

/*

*DOMandXPath

*/

try{

longstartTime=System.currentTimeMillis();

DocumentBuilderFactorydbf=DocumentBuilderFactory.newInstance();

dbf.setNamespaceAware(false);

DocumentBuilderbuilder=dbf.newDocumentBuilder();

InputStreamin=TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\XXX.xml");

org.w3c.dom.Documentdoc=builder.parse(in);

XPathFactoryfactory=XPathFactory.newInstance();

javax.xml.xpath.XPathx=factory.newXPath();

//Select the name attribute 
 for all class elements
Stringxpath="//Workbook/Worksheet/Table/Row[4]/Cell[3]/Data[1]";

System.err.println("=====DomXPath====");

System.err.println("XPath : "+xpath);

XPathExpressionexpr=x.compile(xpath);

NodeListnodes=(NodeList)expr.evaluate(doc,XPathConstants.NODE);

for(inti=0;i<nodes.getLength();i++){

System.out.println("show="+nodes.item(i).getNodeValue());

longendTime=System.currentTimeMillis();

System.out.println(" Program running time: "+(endTime-startTime)+"ms");

}

}catch(XPathExpressionExceptione){

e.printStackTrace();

}catch(ParserConfigurationExceptione){

e.printStackTrace();

}catch(SAXExceptione){

e.printStackTrace();

}catch(IOExceptione){

e.printStackTrace();

}

}

}

PS: here are a few more online tools about XML operation for your reference:

Online XML/JSON interconversion tool:
(link: http://tools.jb51.net/code/xmljson)

Online formatted XML/ online compressed XML:
(link: http://tools.jb51.net/code/xmlformat)

XML online compression/formatting tools:
(link: http://tools.jb51.net/code/xml_format_compress)