NixMash Spring Posts: Using Jsoup To Create Page Previews

We introduced NixMash Spring Posts earlier as a stream of links, news and bits of code currently up and running on the demo site. We'll begin our review of the behind-the-scenes tech making up the Posts framework by looking at the role of Jsoup in populating the Add Link form shown below. The populated form is our end game. We'll talk about how we get there with Jsoup.

Remote Source Page Parsing is a Box of Chocolates

Jsoup is awesome, but it's important to know going in that parsing pages from a variety of remote sources is a box of chocolates. You never know what you're going to get. For instance, we added a new @TwitterSelector to our Jsoup Annotation framework.

In the JsoupTwitter container we have the properties shown below. If a page has a Twitter Card meta-tag you would assume it has a title, description and other info. That would be a false assumption.

Long story short, you do a lot of ugly

if (!hasTwitterTitle) getPageTitle()...

type of checks to populate a suitable PagePreview object to fill the Add Link form we saw at the top.

As an example of the hit-and-miss nature of building Page Preview containers with Jsoup let's move to our PostsController where we display the Add Link View. Near the top we retrieve our PagePreviewDTO from our Spring jsoupService. We add the pagePreview to our Model primarily because it contains (or not) a List<JsoupImage> from the page source for our image carousel. Otherwise we rely on our PostDTO container for creating the link.

But populating the postDTO in postDtoFromPagePreview() is where the box of chocolates can get a little sticky and those if (!hasSomeField) { getSomeOtherField() } checks come into play.

The Jsoup Connection

Another issue that can be problematic when retrieving the html from a multitude of remote sites is that there are different requirements imposed by servers for who they give page parsing privileges. After some trial and error, the configuration for Jsoup.connect() in the JsoupService below seems to work pretty well.

SSL and certification requirements can get complicated so for simplicity sake we first build a Jsoup Connection with the default validateTLSCertificates setting (true) and if that fails use “false”. The second call is rarely used, but it gives us [thus far, knock-on-wood] 100% page retrieval success for populating our Add Link form.

Source Code Notes for this Post

All source code discussed in this post can be found in my NixMash Spring GitHub repo and viewed online here.

Posted June 09, 2016 03:23 PM EDT

More Like This Post