jsoup Java HTML Parser With Examples

jsoup is an open source Java HTML parser that we can use to parse HTML and extract useful information. You can also think of jsoup as web page scraping tool in java programming language.

jsoup

jsoup API can be used to fetch HTML from URL or parse it from HTML string or from HTML file.

Some of the cool features of jsoup API are;

  1. scrape HTML from URL or read it from String or from a file.
  2. Extract data from html through DOM based traversal or using CSS like selectors.
  3. jsoup API can be used to edit HTML too.
  4. jsoup API is self contained, we don’t need any other jars to use it.

You can download jsoup jar from it’s website or if you are using maven, then add below dependency for it.

Let’s look into different jsoup features one by one.

jsoup example to load HTML document from URL

We can do it with a one liner code as shown below.

jsoup example to parse HTML document from String

If we have HTML data as String, we can use below code to parse it.

jsoup example to load a document from file

If HTML data is saved in a file, we can load it using below code.

Parsing HTML Body Fragment

One of the best feature of jsoup is that if we supply html body fragmented data, it tries hard to generate a valid HTML for us, as shown in below example.

Above code prints following HTML.

Let’s now look at different methods to extract data from HTML.

Jsoup DOM Methods

Just like HTML, Jsoup parse the HTML into Document. A document consists of different elements and there are many useful methods that we can use to find elements. Some of these methods in Document are;

  1. getElementById(String id)
  2. getElementsByTag(String tag)
  3. getElementsByClass(String className)
  4. getElementsByAttribute(String key)
  5. siblingElements(), firstElementSibling(), lastElementSibling() etc.

Element has different attributes, so we have some methods for element data too.

  1. attr(String key) to get and attr(String key, String value) to set attributes
  2. id(), className() and classNames()
  3. text() to get and text(String value) to set the text content
  4. html() to get and html(String value) to set the inner HTML content
  5. tag() and tagName()

There are some methods for manipulating HTML data as well.

  1. append(String html), prepend(String html)
  2. appendText(String text), prependText(String text)
  3. appendElement(String tagName), prependElement(String tagName)
  4. html(String value)

Below is a simple example where I am using jsoup DOM methods to parse my website home page and list all the links.

Above program produces following output.

Jsoup selector syntax

We can also use CSS or jQuery like syntax to find and manipulate HTMl elements. Document and Element contains select(String cssQuery) that we can use for this.

Some quick examples are;

  1. doc.select(“a”): returns all “a” tag elements from HTML.
  2. doc.select(c|if): finds <c:if> elements
  3. doc.select(“#id1″): returns all tags with id=”id1”
  4. doc.select(“.cl1″): returns all tags with class=”cl1”
  5. doc.select(“[href]”): returns all tags with attribute href

We can combine selectors too, you can find more details at Selectors API.

Let’s now look at an example where I will fetch my Google+ author URL from my website using both DOM and Selector API.

Above program prints following output.

jsoup example to modify HTML

Let’s now look at jsoup example where I will parse input HTML and manipulate it.

Please have a look at above program carefully to understand what’s modifications are done to the input html string. Also compare it with the final document as shown below in output.

jsoup example to parse Google Search Page and Find out Results

Before I conclude this post, here is an example where I am parsing google search results first page and fetching all the links.

It prints following output.

Note that currently google search results are part of h3 tag with class “r” and obviously “a” is used for the link. So if in future there is any change such as h3 tag class name is changed, then it won’t work properly and we will have to do slight modification by looking at the source html structure.

That’s all for jsoup example tutorial, I hope it will help you in parsing HTML data easily when required.

Reference: Official Website

By admin

Leave a Reply

%d bloggers like this: