C and C++ USES libxml2 to efficiently output large XML files

  • 2020-06-01 10:22:33
  • OfStack

preface

Libxml2 is an xml c language parser, originally developed as a tool for the Gnome project, and is a free open source software based on MIT License. In addition to support c language version, it also supports the binding of c++, PHP, Pascal, Ruby, Tcl and other languages, and can run on Windows, Linux, Solaris, MacOsX and other platforms. Function is still quite powerful, I believe that there is no problem to meet the needs of 1 general user.

libxml2 common data types

xmlChar is the character type in libxml2, and all characters in the library, strings, are based on this data type.

xmlChar * is a pointer type, and many functions return a variable of type xmlChar * of dynamically allocated memory, so remember to free up memory when using this type of function, otherwise it will lead to a memory leak, such as:


xmlChar *name = xmlNodeGetContent(CurNode);
strcpy(data.name, name);
xmlFree(name);
xmlDoc, xmlDocPtr // document object structs and Pointers xmlNode, xmlNodePtr // node object structs and node Pointers The structs of the xmlAttr, xmlAttrPtr // node attributes and their Pointers Structure and pointer to the xmlNs, xmlNsPtr // node namespace BAD_CAST //1 macro definition, which is actually of type xmlChar *

scenario

libxml2 is basically the C/C++ standard read-write library of xml, which is supported by default in linux and macOS. Unfortunately, Windows has its own msxml, so libxml2 is not supported.

2.xml's sax read library expat is also a good choice, but it does not support writing.

3.1 a whole structure is to generate 1 DOM library written way, after the output to the XML DOM structure format text, can bring their own writing function called or standard io function. This defect is too big, if produce this DOM structure can lead to memory surge when generating the DOM structure, and then output to the memory, memory soaring once again, at this time in the final output from memory to a file.

instructions

1. Storage of DOM structure is a waste of memory, if there is a large amount of data, but the parent-child relationship of elements, text values, attribute values, etc. If we can output according to each element, it is better to release the element memory after the output, so as to maximize the utilization of memory resources.

2. Local output elements can maximize the use of system resources, such as IO, which outputs functions requiring permission restrictions, or outputs to the interface

example

The following examples are libxml2 on windows, libxml2 compiled by mingw, and _wfopen to open the unicode encoded file path.


#include "stdafx.h"
#include <libxml/parser.h>
#include <libxml/tree.h>
#include <libxml/xmlreader.h>
#include <iostream>
#include <memory>

void TestStandardIOForXml()
{
 xmlDocPtr doc = NULL; /* document pointer */
 xmlNodePtr one_node = NULL, node = NULL, node1 = NULL;/* node pointers */
 char buff[256];
 int i, j;

 doc = xmlNewDoc(BAD_CAST "1.0");
 std::shared_ptr<void> sp_doc(doc,[](void* doc1){
 xmlDocPtr doc = (xmlDocPtr)doc1;
 xmlFreeDoc(doc);
 });

 FILE* file = _wfopen(L"test.xml",L"wb");
 if(!file)
 return;

 std::shared_ptr<FILE> sp_file(file,[](FILE* file){
 fclose(file);
 });

 //  write XML The statement of 
 xmlChar* doc_buf = NULL;
 int size = 0;
 xmlDocDumpMemoryEnc(doc,&doc_buf,&size,"UTF-8");
 std::shared_ptr<xmlChar> sp_xc(doc_buf,[](xmlChar* doc_buf){
 xmlFree(doc_buf);
 });
 fwrite(doc_buf,strlen((const char*)doc_buf),1,file);
 xmlBufferPtr buf = xmlBufferCreate();
 std::shared_ptr<void> sp_buf(buf,[](void* buf1){
 xmlBufferPtr buf = (xmlBufferPtr)buf1;
 xmlBufferFree(buf);
 });

 const char* kRootBegin = "<ROOT>";
 fwrite(kRootBegin,strlen(kRootBegin),1,file);
 for(int i = 0; i< 10; ++i){
 one_node = xmlNewNode(NULL, BAD_CAST "one");
 xmlNewChild(one_node, NULL, BAD_CAST "node1",
  BAD_CAST "content of node 1");
 xmlNewChild(one_node, NULL, BAD_CAST "node2", NULL);
 node = xmlNewChild(one_node, NULL, BAD_CAST "node3",BAD_CAST "this node has attributes");
 xmlNewProp(node, BAD_CAST "attribute", BAD_CAST "yes");
 xmlNewProp(node, BAD_CAST "foo", BAD_CAST "bar");

 node = xmlNewNode(NULL, BAD_CAST "node4");
 node1 = xmlNewText(BAD_CAST "other way to create content (which is also a node)");
 xmlAddChild(node, node1);
 xmlAddChild(one_node, node);

 xmlNodeDump(buf,doc,one_node,1,1);
 fwrite(buf->content,buf->use,1,file);

 xmlUnlinkNode(one_node);
 xmlFreeNode(one_node);
 xmlBufferEmpty(buf);
 }

 const char* kRootEnd = "</ROOT>";
 fwrite(kRootEnd,strlen(kRootEnd),1,file);

}

Output file:


<?xml version="1.0" encoding="UTF-8"?>
<ROOT><one>
 <node1>contentÖÐÎÄ of node 1</node1>
 <node2/>
 <node3 attribute="yes" foo="bar">this node has attributes</node3>
 <node4>other way to create content (which is also a node)</node4>
 </one><one>
 <node1>content of node 1</node1>
 <node2/>
 <node3 attribute="yes" foo="bar">this node has attributes</node3>
 <node4>other way to create content (which is also a node)</node4>
 </one><one>
 <node1>content of node 1</node1>
 <node2/>
 <node3 attribute="yes" foo="bar">this node has attributes</node3>
 <node4>other way to create content (which is also a node)</node4>
 </one><one>
 <node1>content of node 1</node1>
 <node2/>
 <node3 attribute="yes" foo="bar">this node has attributes</node3>
 <node4>other way to create content (which is also a node)</node4>
 </one><one>
 <node1>content of node 1</node1>
 <node2/>
 <node3 attribute="yes" foo="bar">this node has attributes</node3>
 <node4>other way to create content (which is also a node)</node4>
 </one><one>
 <node1>content of node 1</node1>
 <node2/>
 <node3 attribute="yes" foo="bar">this node has attributes</node3>
 <node4>other way to create content (which is also a node)</node4>
 </one><one>
 <node1>content of node 1</node1>
 <node2/>
 <node3 attribute="yes" foo="bar">this node has attributes</node3>
 <node4>other way to create content (which is also a node)</node4>
 </one><one>
 <node1>content of node 1</node1>
 <node2/>
 <node3 attribute="yes" foo="bar">this node has attributes</node3>
 <node4>other way to create content (which is also a node)</node4>
 </one><one>
 <node1>content of node 1</node1>
 <node2/>
 <node3 attribute="yes" foo="bar">this node has attributes</node3>
 <node4>other way to create content (which is also a node)</node4>
 </one><one>
 <node1>content of node 1</node1>
 <node2/>
 <node3 attribute="yes" foo="bar">this node has attributes</node3>
 <node4>other way to create content (which is also a node)</node4>
 </one></ROOT>

conclusion


Related articles: