Jsoup parsing HTML instances and documentation methods

  • 2020-04-01 02:24:48
  • OfStack

Parsing and traversing an HTML document

How do I parse an HTML document :


String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

Its parser does its best to create a clean parse result from the HTML document you provide, whether or not the HTML format is complete. For example, it can handle:

1. Tags that are not closed (e.g. : < P> Lorem < P> Ipsum parses the to < P> Lorem< / p> < P> Ipsum< / p>)
2. Implicit tags (e.g., it can automatically transfer < Td> Table data< / td> Packaged into < Table> < Tr> < Td> ?).
3. Create a reliable document structure (HTML tags contain head and body, with only the appropriate elements in the head)

An object model for a document

1. The document consists of multiple Elements and TextNodes (and other auxiliary nodes).
2. Its inheritance structure is as follows: Document inheritance Element inherits Node. TextNode inherits Node.
3. An Element contains a collection of child nodes and a parent Element. They also provide a unique filter list of child elements.

Load a Document from a URL

There is a problem
You need to get and parse an HTML document from a website and find the relevant data. You can use the following solution:

The solution
Use the jsoup.connect (String url) method:


Document doc = Jsoup.connect("//www.jb51.net/").get();
String title = doc.title();

instructions
The connect(String url) method creates a new Connection, and get() retrieves and parses an HTML file. If an error occurs while fetching HTML from this URL, an IOException is thrown and should be handled appropriately.

The Connection interface also provides a chain of methods to handle special requests, as follows:


Document doc = Jsoup.connect("//www.jb51.net")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

This method only supports Web URLs (HTTP and HTTPS protocols); If you need to load from a File, use parse(File in, String charsetName) instead.

Load a document from a file

The problem
There is an HTML file on the local hard disk that needs to be parsed to extract data or make changes.

Way to
You can use the static jsoup.parse (File in, String charsetName, String baseUri) method:


File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "//www.jb51.net/");

instructions
Parse (File in, String charsetName, String baseUri) this method is used to load and parse an HTML File. If an error occurs while loading the file, an IOException will be thrown and should be handled appropriately.
The baseUri parameter is used to solve the problem that the URLs in the file are relative paths. You can pass in an empty string if you don't need to.
There is also a method parse(File in, String charsetName) that USES the path of the File as baseUri. This method is useful if the parsed file is located on the local file system of the site, and the associated link also points to that file system.


Use DOM methods to traverse a document

The problem
You have an HTML document to extract the data from and understand the structure of the HTML document.

methods
Once the HTML is parsed into a Document, you can manipulate it using methods similar to the DOM. Sample code:


File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "//www.jb51.net/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}

instructions
Elements is an object that provides a series of dom-like methods to find Elements, extract and process their data. The details are as follows:
Look for the element
GetElementById (String id)
GetElementsByTag (String tag)
GetElementsByClass (String className)
GetElementsByAttribute (String key) (and related methods)
Element siblings: siblingElements(), firstElementSibling(), lastElementSibling(); NextElementSibling (), previousElementSibling ()
Graph: parent(), children(), child(int index)

Element data
Attr (String key) gets the property attr(String key, String value) to set the property
Attributes () gets all the attributes
Id (), the className () and classNames ()
Text () gets the text content text(String value) sets the text content
HTML () gets the HTML within the element (String value) to set the HTML content within the element
OuterHtml () gets the HTML content outside the element
Data () gets the content of the data(for example: script and style tags)
The tag () and tagName ()

Manipulate HTML and text
Append (String HTML), the prepend (String HTML)
AppendText (String text), prependText (String text)
AppendElement (String tagName), prependElement (String tagName)
HTML (String value)


Use the selector syntax to find elements
The problem
You want to use syntax similar to CSS or jQuery to find and manipulate elements.

methods
You can use the element.select (String selector) and Elements. Select (String selector) methods to:


File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "//www.jb51.net./");
Elements links = doc.select("a[href]"); //A element with an href attribute
Elements pngs = doc.select("img[src$=.png]");
  //Image with the.png extension
Element masthead = doc.select("div.masthead").first();
  //Class is equal to the div tag of masthead
Elements resultLinks = doc.select("h3.r > a"); //The a element after the h3 element

instructions
The jsoup elements object supports a selector syntax similar to CSS (or jquery) for very powerful and flexible lookups. .
This select method can be used in Document, Element, or Elements objects. It is context-dependent, so you can filter or chain-select access to the specified element.
The Select method returns a collection of Elements and provides a set of methods to extract and process the results.

Selector Selector overview
Tagname: finds elements by tag, such as: a
Ns |tag: finds elements in the namespace by the tag. For example, fb|name syntax can be used to find < Fb: name> The element
#id: find elements by id, such as #logo
.class: finds an element by class name, such as.masthead
[attribute]: finds elements using attributes, such as: [href]
[^attr]: use the attribute name prefix to find elements. For example, you can use [^data-] to find elements with attributes of the HTML5 Dataset
[attr=value]: use the property value to find the element, for example: [width=500]
[attr^=value], [attr$=value], [attr*=value]: find elements with matching attribute values at the beginning, end, or containing attribute values, for example: [href*=/path/]
[attr~=regex]: use attribute values to match regular expressions to find elements, such as: img[SRC ~=(? I)\.(PNG |jpe? G)]
*: this symbol will match all elements

The Selector Selector is used in combination
El# id: element + id, such as: div#logo
El. Class: element +class, such as div. Masthead
El [attr]: element +class, for example: a[href]
Any combination, such as: a[href].highlight
Methode child: look for subunits of an element. For example, you can use.body p to look for all the p elements under the "body" element
The parent > Child: finds a direct child under a parent element, e.g., div.content > P looks for p, you can also use body > * find all direct child elements under the body tag
SiblingA + siblingB: find the first siblingB before the A element, such as div.head + div
SiblingA ~ siblingX: find the siblingX elements before the A element, such as h1 ~ p
El, el, el: a combination of selectors, looking for a unique element that matches any of the selectors, such as div.masthead, div.logo

False selectors selectors
:lt(n): find which element's sibling index value (its position in the DOM tree relative to its parent) is less than n, for example: td:lt(3) represents an element with less than three columns
:gt(n): find which elements have sibling index values greater than n, e.g. Div p:gt(2) indicates which divs contain more than two p elements
Eq (n): find which elements have the same level index value as n, for example: form input:eq(1) represents the form element containing an input label
: from the (seletor) : find matching selector element contains elements, such as: div: from the (p) said what div contains a p element
:not(selector): finds an element that does not match the selector, such as div:not(.logo), which represents a list of all divs that do not contain the class=logo element
Contains (text): contains(text): contains(jsoup)
: containsOwn (text) : looking for containing the elements of a given text
:matches(regex): finds which elements whose text matches the specified regular expression, such as: div:matches((? I) the login)
:matchesOwn(regex): finds an element that itself contains a text match with the specified regular expression
Note: the above pseudo-selector index starts at 0, that is, the index value of the first element is 0, the index value of the second element is 1, etc
See the Selector API reference for more details


Extract attributes, text, and HTML from elements

The problem
After parsing to get a Document instance object and finding some elements, you want to get the data in those elements.

methods
To get the value of an attribute, use the node.attr (String key) method
For text in an Element, you can use the element.text () method
To get the HTML content in an Element or attribute, you can use the element.html () or node.outerhtml () methods
Example:


String html = "<p>An <a href='//www.jb51.net/'><b>www.jb51.net</b></a> link.</p>";
Document doc = Jsoup.parse(html);//Parsing an HTML string returns a Document implementation
Element link = doc.select("a").first();//Find the first a element
String text = doc.body().text(); // "An www.jb51.net link"// Gets the text in the string 
String linkHref = link.attr("href"); // "//www.jb51.net/"// Get link address 
String linkText = link.text(); // "www.jb51.net""// Gets the text in the link address 
String linkOuterH = link.outerHtml(); 
    // "<a href="//www.jb51.net"><b>www.jb51.net</b></a>"
String linkInnerH = link.html(); // "<b>www.jb51.net</b>"// Get within the link html content 

instructions
The above approach is the core approach to element data access. There are other methods you can use:

Element. The id ()
Element. TagName ()
Element. The className () and Element. HasClass (String className)
These accessor methods all have corresponding setter methods to change the data.


Sample program: gets all the links
This sample program will show you how to get a page from a URL. Then extract all the links, images, and other ancillary content from the page. And check URLs and text messages.
To run the following program, you need to specify a url as an argument


package org.jsoup.www.jb51.nets;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

public class ListLinks {
    public static void main(String[] args) throws IOException {
        Validate.isTrue(args.length == 1, "usage: supply url to fetch");
        String url = args[0];
        print("Fetching %s...", url);
        Document doc = Jsoup.connect(url).get();
        Elements links = doc.select("a[href]");
        Elements media = doc.select("[src]");
        Elements imports = doc.select("link[href]");
        print("nMedia: (%d)", media.size());
        for (Element src : media) {
            if (src.tagName().equals("img"))
                print(" * %s: <%s> %sx%s (%s)",
                        src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
                        trim(src.attr("alt"), 20));
            else
                print(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
        }
        print("nImports: (%d)", imports.size());
        for (Element link : imports) {
            print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"));
        }
        print("nLinks: (%d)", links.size());
        for (Element link : links) {
            print(" * a: <%s>  (%s)", link.attr("abs:href"), trim(link.text(), 35));
        }
    }
    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }
    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
}
org/jsoup/www.jb51.nets/ListLinks.java


Related articles: