Java USES the open source library JSoup to parse instances of HTML files

  • 2020-04-01 03:31:16
  • OfStack

HTML is the heart of the WEB, and every page you see on the Internet is HTML, whether it's generated dynamically by JavaScript, JSP, PHP,ASP, or some other WEB technology. Your browser will parse the HTML and render it for you. But what if you need to parse an HTML document in a Java program and look up certain elements, tags, attributes, or check if a particular element exists? If you've been programming in Java for years, I'm sure you've tried to parse XML and have used a parser like DOM or SAX, but chances are you've never done any HTML parsing. Ironically, in Java applications, where you are rarely required to parse HTML documents, servlets or other Java WEB technologies are not included. To make matters worse, there are no HTTP or HTML libraries in the JDK core, at least not that I know of. That's why when it comes to parsing HTML files, many Java programmers have to Google to see how to get an HTML tag out of Java. When I had this need, I was sure there were some open source libraries that could do this, but I didn't expect JSoup to be as cool and fully functional as this. Not only does it support reading and parsing HTML documents, but it also allows you to extract any elements from the HTML file, their properties, their CSS properties, and you can modify them. With JSoup you can do almost anything with an HTML document. We'll see an example of how to download and parse an HTML file in Java from the Google home page or any URL.

What is the JSoup library

Jsoup is an open source Java library that can be used to handle HTML in real-world applications. It provides a very convenient API for data extraction and modification, taking full advantage of DOM, CSS and jquery style methods. Jsoup implements the specification of WAHTWG HTML5, which parses the DOM from HTML exactly as modern browsers like Chrome and Firefox parse it. Here are some useful features of the Jsoup library:

1.Jsoup can fetch and parse HTML from urls, files, or strings.
2.Jsoup can find and extract data using DOM traversal or CSS selectors.
3. You can use Jsoup to modify HTML elements, attributes, and text.
4.Jsoup ensures that user submissions are clean through a secure whitelist to prevent XSS attacks.
5.Jsoup also outputs clean HTML.

Jsoup is designed to handle a variety of real-world HTML, including valid HTML and incomplete collections of invalid tags. One of Jsoup's core strengths is its robustness.

Jsoup is used for HTML parsing in Java

In this tutorial on Java HTML parsing, you'll see three different examples of parsing and traversing HTML in Java using Jsoup. In the first example, we parse an HTML string whose content is a tag of string literals in Java. In the second example, we download an HTML document from the WEB, and in the third, we load an HTML sample file called login.html for parsing. This file is an example of an HTML document that contains the title tag, a div tag inside the body, and a form inside. It has an input tag to get the username and password, and a submit and reset button to do the next step. It is a valid HTML, that is, all tags and attributes are closed correctly. Here is our sample HTML file:


<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">  
<html>  
<head>  
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">  
<title>Login Page</title>  
</head>  
<body>  
<div id="login" class="simple" >  
<form action="login.do">  
Username : <input id="username" type="text" /><br>  
Password : <input id="password" type="password" /><br>  
<input id="submit" type="submit" />  
<input id="reset" type="reset" />  
</form>  
</div>  
</body>  
</html> 

Using Jsoup to parse HTML is as simple as calling its static method jsoup.parse () and passing it your HTML string. Jsoup provides multiple overloaded parse () methods that can read HTML files from strings, files, uris, urls, and even InputStream. If it is not utf-8 encoding, you can also specify the character encoding so that the HTML file can be read correctly. Below is a complete list of HTML parsing methods in the Jsoup library. The parse(String HTML) method parses the incoming HTML into a new Document. In Jsoup, Document inherits Element, which in turn inherits Node. The same TextNode also inherits from Node. As long as you pass in a non-null string, you're sure to get a successful and meaningful parse and get a Document with the head and body elements. Once you have the Document, you can call the appropriate methods on the Document and its superclasses Element and Node to get the data you want.

Java program that parses HTML documents

Below is a complete Java program that parses HTML strings, HTML files downloaded on the web, and HTML files in the local file system. You can use the Eclipse IDE or other IDE or even commands to run the program. In Eclipse, it's easy to copy the code, create a new Java project, right-click on the SRC package and paste it in. Eclipse creates the correct package and the Java source file with the same name, so the effort is minimal. If you already have a Java sample project, it only takes one step. The following Java program shows three different examples of parsing and traversing HTML files. In the first case, we parse a string of HTML content directly, in the second case we parse an HTML file downloaded from a URL, and in the third case we load an HTML document from the local file system and parse it. Both the first and third examples use the parse method to get a Document object, which you can query to extract any tag or attribute value. In the second example, we use the jsoup.connect method, which creates a connection to the URL, downloads the HTML, and parses it. This method also returns a Document, which can be used for subsequent queries and to get the value of the label or attribute.


import java.io.IOException; 
  
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
 
 
/** 
[*] Java Program to parse/read HTML documents from File using Jsoup library. 
[*] Jsoup is an open source library which allows Java developer to parse HTML 
[*] files and extract elements, manipulate data, change style using DOM, CSS and 
[*] JQuery like method. 
[*] 
[*] @author Javin Paul 
[*]/ 
public class HTMLParser{ 
  
    public static void main(String args[]) { 
  
        // Parse HTML String using JSoup library 
        String HTMLSTring = "<!DOCTYPE html>" 
                + "<html>" 
                + "<head>" 
                + "<title>JSoup Example</title>" 
                + "</head>" 
                + "<body>" 
                + "|[b]HelloWorld[/b]" 
                + "" 
                + "</body>" 
                + "</html>"; 
  
        Document html = Jsoup.parse(HTMLSTring); 
        String title = html.title(); 
        String h1 = html.body().getElementsByTag("h1").text(); 
  
        System.out.println("Input HTML String to JSoup :" + HTMLSTring); 
        System.out.println("After parsing, Title : " + title); 
        System.out.println("Afte parsing, Heading : " + h1); 
  
        // JSoup Example 2 - Reading HTML page from URL 
        Document doc; 
        try { 
            doc = Jsoup.connect("http://google.com/").get(); 
            title = doc.title(); 
        } catch (IOException e) { 
            e.printStackTrace(); 
        } 
  
        System.out.println("Jsoup Can read HTML page from URL, title : " + title); 
  
        // JSoup Example 3 - Parsing an HTML file in Java 
        //Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong 
        Document htmlFile = null; 
        try { 
            htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1"); 
        } catch (IOException e) { 
            // TODO Auto-generated catch block 
            e.printStackTrace(); 
        } // right 
        title = htmlFile.title(); 
        Element div = htmlFile.getElementById("login"); 
        String cssClass = div.className(); // getting class form HTML element 
  
        System.out.println("Jsoup can also parse HTML file directly"); 
        System.out.println("title : " + title); 
        System.out.println("class of div tag : " + cssClass); 
    } 
  

Output:


Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html> 
After parsing, Title : JSoup Example 
Afte parsing, Heading : HelloWorld 
Jsoup Can read HTML page from URL, title : Google 
Jsoup can also parse HTML file directly title : Login Page 
class of div tag : simple 

The good thing about Jsoup is that it's very robust. The Jsoup HTML parser parses the HTML you provide as cleanly as possible, regardless of whether the HTML is well-formed. It can handle errors such as tags that are not closed (e.g., Java < P> Scala to < P> JavaScala), an implicit tag (for example, a naked |Java is Great encapsulated in |) that always creates a document structure (HTML with head and body, and only the correct elements in the head). This is how HTML is parsed in Java. Jsoup is an excellent, robust open source library that makes it fairly easy to read HTML documents, body fragments, HTML strings, and parse HTML content directly from the WEB. In this article, we learned how to get a specific HTML tag in Java, just as in the first example we extracted the title and H1 tag values as text, and in the third example we learned how to get attribute values from HTML tags by extracting CSS attributes. In addition to the powerful jquery-style html.body().getelementsbytag ("h1").text() method, you can also extract arbitrary HTML tags, and it provides convenient methods like document.title () and element.classname (), where you can quickly retrieve titles and CSS classes. Hopefully JSoup will keep you entertained, and we'll see more examples of this API soon.


Related articles: