jsoup Java HTML Parser With Examples

jsoup is an open source Java HTML parser that we can use to parse HTML and extract useful information. You can also think of jsoup as web page scraping tool in java programming language.

jsoup

jsoup API can be used to fetch HTML from URL or parse it from HTML string or from HTML file.

Some of the cool features of jsoup API are;

  1. scrape HTML from URL or read it from String or from a file.
  2. Extract data from html through DOM based traversal or using CSS like selectors.
  3. jsoup API can be used to edit HTML too.
  4. jsoup API is self contained, we don’t need any other jars to use it.

You can download jsoup jar from it’s website or if you are using maven, then add below dependency for it.


<dependency>
	<groupId>org.jsoup</groupId>
	<artifactId>jsoup</artifactId>
	<version>1.8.1</version>
</dependency>

Let’s look into different jsoup features one by one.

jsoup example to load HTML document from URL

We can do it with a one liner code as shown below.


org.jsoup.nodes.Document doc = org.jsoup.Jsoup.connect("https://www.journaldev.com").get();
System.out.println(doc.html()); // prints HTML data

jsoup example to parse HTML document from String

If we have HTML data as String, we can use below code to parse it.


String source = "<html><head><title>Jsoup Example</title></head>"
		+ "<body><h1>Welcome to JournalDev!!</h1><br />"
		+ "</body></html>";
Document doc = Jsoup.parse(source);

jsoup example to load a document from file

If HTML data is saved in a file, we can load it using below code.


Document doc = Jsoup.parse(new File("data.html"), "UTF-8");

Parsing HTML Body Fragment

One of the best feature of jsoup is that if we supply html body fragmented data, it tries hard to generate a valid HTML for us, as shown in below example.


String html = "<div><p>Test Data</p>";
Document doc1 = Jsoup.parseBodyFragment(html);
System.out.println(doc1.html());

Above code prints following HTML.


<html>
 <head></head>
 <body>
  <div>
   <p>Test Data</p>
  </div>
 </body>
</html>

Let’s now look at different methods to extract data from HTML.

Jsoup DOM Methods

Just like HTML, Jsoup parse the HTML into Document. A document consists of different elements and there are many useful methods that we can use to find elements. Some of these methods in Document are;

  1. getElementById(String id)
  2. getElementsByTag(String tag)
  3. getElementsByClass(String className)
  4. getElementsByAttribute(String key)
  5. siblingElements(), firstElementSibling(), lastElementSibling() etc.

Element has different attributes, so we have some methods for element data too.

  1. attr(String key) to get and attr(String key, String value) to set attributes
  2. id(), className() and classNames()
  3. text() to get and text(String value) to set the text content
  4. html() to get and html(String value) to set the inner HTML content
  5. tag() and tagName()

There are some methods for manipulating HTML data as well.

  1. append(String html), prepend(String html)
  2. appendText(String text), prependText(String text)
  3. appendElement(String tagName), prependElement(String tagName)
  4. html(String value)

Below is a simple example where I am using jsoup DOM methods to parse my website home page and list all the links.


package com.journaldev.jsoup;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupExtractLinks {
	public static void main(String[] args) throws IOException {
		Document doc = Jsoup.connect("https://www.journaldev.com").get();
		Element content = doc.getElementById("content");
		Elements links = content.getElementsByTag("a");
		for (Element link : links) {
		  String linkHref = link.attr("href");
		  String linkText = link.text();
		  System.out.println("Text::"+linkText+", URL::"+linkHref);
		}
	}
}

Above program produces following output.


Text::jQuery Popup and Tooltip Window Animation Effects, URL::https://www.journaldev.com/6998/jquery-popup-and-tooltip-window-animation-effects
Text::Jobin Bennett, URL::https://www.journaldev.com/author/jobin
Text::March 7, 2015, URL::https://www.journaldev.com/6998/jquery-popup-and-tooltip-window-animation-effects
Text::jQuery, URL::https://www.journaldev.com/dev/jquery
Text::jQuery Plugins, URL::https://www.journaldev.com/dev/jquery/jquery-plugins
Text::Permalink, URL::https://www.journaldev.com/6998/jquery-popup-and-tooltip-window-animation-effects
Text::Apache HttpClient Example to send GET/POST HTTP Requests, URL::https://www.journaldev.com/7146/apache-httpclient-example-to-send-get-post-http-requests
Text::Pankaj, URL::https://www.journaldev.com/author/pankaj
Text::March 6, 2015, URL::https://www.journaldev.com/7146/apache-httpclient-example-to-send-get-post-http-requests
Text::Java, URL::https://www.journaldev.com/dev/java
Text::Permalink, URL::https://www.journaldev.com/7146/apache-httpclient-example-to-send-get-post-http-requests
Text::Java HttpURLConnection Example to send HTTP GET/POST Requests, URL::https://www.journaldev.com/7148/java-httpurlconnection-example-to-send-http-getpost-requests
Text::Pankaj, URL::https://www.journaldev.com/author/pankaj
Text::March 6, 2015, URL::https://www.journaldev.com/7148/java-httpurlconnection-example-to-send-http-getpost-requests
Text::Java, URL::https://www.journaldev.com/dev/java
Text::Permalink, URL::https://www.journaldev.com/7148/java-httpurlconnection-example-to-send-http-getpost-requests
Text::How to integrate Google reCAPTCHA in Java Web Application, URL::https://www.journaldev.com/7133/how-to-integrate-google-recaptcha-in-java-web-application
Text::Pankaj, URL::https://www.journaldev.com/author/pankaj
Text::March 4, 2015, URL::https://www.journaldev.com/7133/how-to-integrate-google-recaptcha-in-java-web-application
Text::Java EE, URL::https://www.journaldev.com/dev/java/j2ee
Text::Permalink, URL::https://www.journaldev.com/7133/how-to-integrate-google-recaptcha-in-java-web-application
Text::JSF Spring Hibernate Integration Example Tutorial, URL::https://www.journaldev.com/7122/jsf-spring-hibernate-integration-example-tutorial
Text::Pankaj, URL::https://www.journaldev.com/author/pankaj
Text::March 3, 2015, URL::https://www.journaldev.com/7122/jsf-spring-hibernate-integration-example-tutorial
Text::Hibernate, URL::https://www.journaldev.com/dev/hibernate
Text::JSF, URL::https://www.journaldev.com/dev/jsf
Text::Spring, URL::https://www.journaldev.com/dev/spring
Text::Permalink, URL::https://www.journaldev.com/7122/jsf-spring-hibernate-integration-example-tutorial
Text::JSF Spring Integration Example Tutorial, URL::https://www.journaldev.com/7112/spring-jsf-integration
Text::Oracle Webcenter Portal Framework Application – Modifying Home Page And Login/Logout Target Pages & Deploying Your Application Into Custom Portal Managed Server Instance, URL::https://www.journaldev.com/6938/oracle-webcenter-portal-framework-application-modifying-home-page-and-loginlogout-target-pages-deploying-your-application-into-custom-portal-managed-server-instance
Text::JSF and JDBC Integration Example Tutorial, URL::https://www.journaldev.com/7068/jsf-database-example-mysql-jdbc
Text::Count the Number of Triangles in Given Picture – Programmatic Solution, URL::https://www.journaldev.com/7064/count-the-number-of-triangles-in-given-picture-programmatic-solution
Text::JSF Expression Language (EL) Example Tutorial, URL::https://www.journaldev.com/7058/jsf-expression-language-jsf-el
Text::Read all Articles →, URL::https://www.journaldev.com/page/2

Jsoup selector syntax

We can also use CSS or jQuery like syntax to find and manipulate HTMl elements. Document and Element contains select(String cssQuery) that we can use for this.

Some quick examples are;

  1. doc.select(“a”): returns all “a” tag elements from HTML.
  2. doc.select(c|if): finds <c:if> elements
  3. doc.select(“#id1″): returns all tags with id=”id1”
  4. doc.select(“.cl1″): returns all tags with class=”cl1”
  5. doc.select(“[href]”): returns all tags with attribute href

We can combine selectors too, you can find more details at Selectors API.

Let’s now look at an example where I will fetch my Google+ author URL from my website using both DOM and Selector API.


package com.journaldev.jsoup;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupFindAuthor {
	public static void main(String[] args) throws IOException {
		//journaldev.com posts have author set as below
		//<div class="g-person" data-width="350" data-href="https://plus.google.com/u/0/118104420597648001532"
		//data-layout="landscape" data-rel="author"></div>
		findAuthorUsingDOM();
		findAuthorUsingSelector();
	}
	private static void findAuthorUsingSelector() throws IOException {
		Document doc = Jsoup.connect("https://www.journaldev.com").get();
		Elements authors = doc.select("div.g-person"); //selector combination
		for(Element author : authors){
			System.out.println("Selector:: Author Google+ URL::"+author.attr("data-href"));
		}
	}
	private static void findAuthorUsingDOM() throws IOException {
		Document doc = Jsoup.connect("https://www.journaldev.com").get();
		Elements authors = doc.getElementsByClass("g-person");
		for(Element author : authors){
			System.out.println("DOM:: Author Google+ URL::"+author.attr("data-href"));
		}
	}
}

Above program prints following output.


DOM:: Author Google+ URL:://plus.google.com/u/0/118104420597648001532
Selector:: Author Google+ URL:://plus.google.com/u/0/118104420597648001532

jsoup example to modify HTML

Let’s now look at jsoup example where I will parse input HTML and manipulate it.


package com.journaldev.jsoup;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupModifyHTML {
	public static final String SOURCE_HTML = "<html><head><title>Jsoup Example</title></head>"
			+ "<body><h1>Welcome to JournalDev!!</h1><br />"
			+ "<div id="id1">Hello</div>"
			+ "<div class="class1">Pankaj</div>"
			+ "<a href="https://journaldev.com">Home</a>"
			+ "<a href="https://wikipedia.org">Wikipedia</a>"
			+ "</body></html>";
	public static void main(String[] args) {
		Document doc = Jsoup.parse(SOURCE_HTML);
		System.out.println("Title="+doc.title());
		//let's add attribute target="_blank" to all the links
		doc.select("a[href]").attr("rel", "nofollow");
		//System.out.println(doc.html());
		//change div class="class1" to class2
		doc.select("div.class1").attr("class", "class2");
		//System.out.println(doc.html());
		//change the HTML value of first h1 element
		doc.select("h1").first().html("Welcome to JournalDev.com");
		doc.select("h1").first().append("!!");
		//System.out.println(doc.html());
		//let's make Home link bold
		doc.select("a[href]").first().html("<strong>Home</strong>");
		System.out.println(doc.html());
	}
}

Please have a look at above program carefully to understand what’s modifications are done to the input html string. Also compare it with the final document as shown below in output.


Title=Jsoup Example
<html>
 <head>
  <title>Jsoup Example</title>
 </head>
 <body>
  <h1>Welcome to JournalDev.com!!</h1>
  <br>
  <div id="id1">
   Hello
  </div>
  <div class="class2">
   Pankaj
  </div>
  <a href="https://journaldev.com" target="_blank"><strong>Home</strong></a>
  <a href="https://wikipedia.org" target="_blank">Wikipedia</a>
 </body>
</html>

jsoup example to parse Google Search Page and Find out Results

Before I conclude this post, here is an example where I am parsing google search results first page and fetching all the links.


package com.journaldev.jsoup;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ParsingGoogleSearch {
	public static void main(String[] args) throws IOException {
		Document doc = Jsoup.connect("https://www.google.com/search?q=java").userAgent("Mozilla/5.0").get();
		//System.out.println(doc.html());
		Elements resultsH3 = doc.select("h3.r > a");
		for (Element result : resultsH3) {
			String linkHref = result.attr("href");
			String linkText = result.text();
			System.out.println("Text::" + linkText + ", URL::" + linkHref.substring(6, linkHref.indexOf("&")));
		}
	}
}

It prints following output.


Text::Download Free Java Software, URL::=https://java.com/download
Text::java.com: Java + You, URL::=https://www.java.com/
Text::Oracle Technology Network for Java Developers | Oracle ..., URL::=https://www.oracle.com/technetwork/java/
Text::Java (software platform) - Wikipedia, the free encyclopedia, URL::=https://en.wikipedia.org/wiki/Java_(software_platform)
Text::Java (programming language) - Wikipedia, the free encyclopedia, URL::=https://en.wikipedia.org/wiki/Java_(programming_language)
Text::Java Tutorial - TutorialsPoint, URL::=https://www.tutorialspoint.com/java/
Text::Welcome to JavaWorld.com, URL::=https://www.javaworld.com/
Text::Java.net: Welcome, URL::=https://www.java.net/
Text::News for java, URL::h?q=java
Text::Javalobby | The heart of the Java developer community, URL::=https://java.dzone.com/

Note that currently google search results are part of h3 tag with class “r” and obviously “a” is used for the link. So if in future there is any change such as h3 tag class name is changed, then it won’t work properly and we will have to do slight modification by looking at the source html structure.

That’s all for jsoup example tutorial, I hope it will help you in parsing HTML data easily when required.

Reference: Official Website

By admin

Leave a Reply

%d bloggers like this: