Jsoup Annotations: Introduction

The description for v0.3.2 of NixMash Spring is “Jsoup Fun.” Part of that fun is to take a different approach to using the boilerplate Jsoup APIs I used a couple of years ago for my NixMashup Links site. A general rule I’ve learned in the coding business is that if you’re doing something the same way to did it a couple years ago, then you should probably be able to come up with a better way of doing it.

So I ran across this obscure blog post from 2013 called Annotation Based HTML to Object Mapper using Jsoup Parser and took it the next level with Jsoup Annotations. For now the logic is baked into the jsoup module of NixMash Spring, but depending on feedback we could break it out into its own project repository.

In this post we go over why annotations can be useful and describe the currently available Jsoup Annotations in NixMash Spring. In a future post we’ll look more at Jsoup Annotations in action and the logic behind the scenes.

Why Jsoup Annotations

Let’s say we want to parse a list of images from <src /> html and place them into a List<> object. Using standard Jsoup APIs we could do something like this.

With Jsoup Annotations we create a standard POJO container and will pass it to our parser. This is equivalent to the above code.

public List<JsoupImage> getImagesInPage;

Our Parser sees the @ImageSelector annotation and does the rest! And the beauty of this approach is that since the Parser receives a Generic Type, you can pass any POJO to the Jsoup Parser and only add property annotations.

Here is the list of currently available annotations:

  • @Selector — selects by tag, class or id. Used with @TextValue, @HtmlValue and @AttributeValue
  • @MetaName — select a Meta Tag by name attribute
  • @MetaProperty — select a Meta Tag by property attribute
  • @LinkSelector — retrieves href and text values of a Jsoup Link Element. Stores in a JsoupLink object
  • @ImageSelector — retrieves src, alt, height and width of Jsoup Src Element. Stores in JsoupImage object

Below is an image of a PagePreviewDTO class we’ll be using in NixMash Spring to show what a working annotated POJO looks like. (Since we’re using Java Reflection to set the property values they have to be public, thus the IntelliJ @SuppressWarnings designation. That’s not relevant, but I didn’t want you to think it was part of our Jsoup Annotations logic.)

A few helpful tips.

  1. @Selector(“title”) has no accompanying @TextValue, @HtmlValue or @AttributeValue, so defaults to @TextValue.
  2. CSS IDs and class names must contain “#” or “.” in the annotation value respectively.
  3. When retrieving an Image or Link List with @ImageSelector or @LinkSelector, the Annotation(“value”) indicates an area of the page from which to retrieve the links. No value would retrieve images/links from the entire page (Jsoup Document.)

Wiring Up the Jsoup Annotated POJO

You saw the simple DTO above. Nothing special there. What we need to do, however, is pass our POJO as a Generic Type to our JSoupHtmlParser<T> Class. This is how we’ll do that.

JSoupHtmlParser<PagePreviewDTO> pagePreviewParser;

Now we can call the parse() method in our Jsoup annotation parser and we’re done!

You might recognize the results, as we’re passing a copy of a Repository Home Page on GitHub.

Next up we’ll look at more examples of using Jsoup Annotations, then go over how to wire-up the JsoupHtmlParser with Generic Types as a Spring Bean.

Source Code Notes for this Post

All source code discussed in this post can be found in my NixMash Spring GitHub repo and viewed online here.