php parses the html class library simple_html_dom of in detail

2020-07-21 07:07:05
OfStack

Download address: https: / / github com/samacs/simple_html_dom

The parser does more than just help us validate the html document; Better parsing of html documents that do not conform to the W3C standard. It USES an element selector similar to jQuery to find the location by id, class, tag and so on. It also provides the function of adding, deleting and modifying the document tree. Of course, a powerful html Dom parser isn't perfect; You need to be 10 minutes careful with memory consumption during use. But don't worry; At the end of this article, I'll show you how to avoid consuming too much memory.
Begin to use
After uploading the class file, there are three ways to call this class:
Load the html document from url
Load the html document from the string
Load the html document from the file


<?php
//  new 1 a Dom The instance 
$html = new simple_html_dom();

//  from url In the load 
$html->load_file('https://www.ofstack.com');

//  Load from a string 
$html->load('<html><body> Load from a string html The document presentation </body></html>');

// Load from a file 
$html->load_file('path/file/test.html');
?>

If you load the html document from a string, you need to download it from the network first. It is recommended to use cURL to grab html documents and load them into DOM.
Find the html element
You can use the find function to find elements in an html document. The result is an array containing the object. We use functions in the HTML DOM parsing class to access these objects. Here are some examples:


<?php

// To find the html Hyperlink elements in a document 
$a = $html->find('a');

// Find the values in the document (N) Returns an empty array if none is found .
$a = $html->find('a', 0);

//  To find the id for main the div The element 
$main = $html->find('div[id=main]',0);

//  Find all contained id Properties of the div The element 
$divs = $html->find('div[id]');

//  Find all contained id Element of an attribute 
$divs = $html->find('[id]');
?>

You can also use selectors like jQuery to find location elements:


<?php
//  To find the id='#container' The elements of the 
$ret = $html->find('#container');

//  Find all the class=foo The elements of the 
$ret = $html->find('.foo');

//  Look for more html The label 
$ret = $html->find('a, img');

//  You can also use it this way 
$ret = $html->find('a[title], img[title]');
?>

The parser supports searching for child elements


<?php

//  To find the  ul All of them in the list li item 
$ret = $html->find('ul li');

// To find the  ul  The list of designated class=selected the li item 
$ret = $html->find('ul li.selected');

?>

If you find this difficult to use, the built-in functions make it easy to locate the parent, child, and adjacent elements of an element


<?php
//  Return the parent element 
$e->parent;

//  Returns an array of child elements 
$e->children;

//  Returns the specified child element by index number 
$e->children(0);

//  Returns the first 1 A resource speed 
$e->first_child ();

//  Returns the last 1 child 
$e->last _child ();

//  On the back 1 Adjacent element 
$e->prev_sibling ();

// Returns the 1 Adjacent element 
$e->next_sibling ();
?>

Element attribute operation
Use simple regular expressions to manipulate property selectors.
The html element that contains an attribute is selected
[attribute=value] - Select the html element with all specified value attributes
[attribute!=value]- Selects all html elements with non-specified value attributes
[attribute^=value] - Selects the html element with all attributes at the beginning of the specified value
[attribute$=value] selects the html element for all specified value-ending attributes
[attribute*=value] - Selects all html elements that contain the specified value attribute
Element attributes are called in the parser
In DOM element attributes are also objects:


<?php
//  In this case, $a Is assigned to $link variable 
$link = $a->href;
?>

Or:


<?php
$link = $html->find('a',0)->href;
?

Each object has four basic object attributes:
The tag tag name is returned
innertext wok returns innerHTML
outertext wanders return outerHTML
The plaintext wOK returns the text in the html tag
Edit the element in the parser
The use of editing element attributes is similar to calling them:


<?php
// to $a A new value is assigned to the anchor chain connection 
$a->href = 'https://www.ofstack.com';

//  Remove chain link 
$a->href = null;

//  Check for chain connections 
if(isset($a->href)) {
// code 
}
?>

There is no special method to add or remove elements in the parser, but it can be used in the following way:


<?php
//  Enclosing element 
$e->outertext = '<div class="wrap">' . $e->outertext . '<div>';

//  Remove elements 
$e->outertext = '';

//  Add elements 
$e->outertext = $e->outertext . '<div>foo<div>';

//  Insert elements 
$e->outertext = '<div>foo<div>' . $e->outertext;
?

Saving the modified html DOM document is also very simple:


<?php
$doc = $html;
//  The output 
echo $doc;
?>

How do I prevent the parser from consuming too much memory
At the beginning of this article, I mentioned the problem of the Simple HTML DOM parser consuming too much memory. If the php script takes up too much memory, it can cause a series of serious problems such as a site that stops responding. The solution is also simple: after the parser loads the html document and USES it, remember to clean up the object. Of course, don't take it too seriously. If you just load two or three documents, it doesn't make much of a difference whether you clean them or not. When you load 5, 10 or more documents, you are absolutely responsible for clearing 1 memory after using 1


<?php

// To find the html Hyperlink elements in a document 
$a = $html->find('a');

// Find the values in the document (N) Returns an empty array if none is found .
$a = $html->find('a', 0);

//  To find the id for main the div The element 
$main = $html->find('div[id=main]',0);

//  Find all contained id Properties of the div The element 
$divs = $html->find('div[id]');

//  Find all contained id Element of an attribute 
$divs = $html->find('[id]');
?>