Using Jsoup to Parse HTML in Java

Jsoup is a Java library from Jonathan Hedley for parsing HTML. You can read all about it at jsoup.org. Jsoup uses DOM-like methods to find elements, and to extract and manipulate element data. Here is a sampling of how I’m using it in an application to extract individual links from my NixMashup Posts.

Here is a sample NixMashup post with the individual links we’re going to extract with Jsoup. Each NixMashup contains between 8 to 10 links like the one circled below.

Of course, it’s a good idea to plan ahead for parsing HTML by providing as much intelligent DOM structuring as possible. Here is a NixMashup HTML excerpt. Each link is highly structured and has the class “mashup” which we will be seeing again when using Jsoup.

Below we see how Jsoup will read the entire post contents, then create a List of nodes called “links.”

Elements links = doc.select(“div.mashup”);

Each “links” string contains the HTML of the “div.mashup” contents which we will pass individually to a populateNixMashupLink() method to process the link HTML.

In populateNixMashupLink() we extract the link title, text (in both text and html format), and the remaining HTML elements, returning a populated NixMashupLink object.

Way too easy, eh?