Parsing HTML java is easy today. There’s enough libraries out there today. Jsoup stands in the top in this list of java HTML parsers. Other memebers are Jsoup, HtmlUnit, TagSoup, JTidy
Java is the world’s mostly used language. This language powers up billions of devices. From the very beginning it started like a beast, and still dominating like a beast! Sometimes I feel like, it doesn’t care about the new snakes coming out daily!
And in this age of technology, content scrapping is getting more important. HTML is a markup language, a standardized system for tagging text files to achieve font, color, hyperlink and graphic effects. We simply know that, all the webpages that we see, are constructed with HTML. And hence comes the content scrapping with which we mine the contents from a website.
Lets discuss some of the best HTML scrappers/parsers for java. With which, we can achieve big and regular mining of data.
According to stackoverflow
Almost all known HTML parsers implements the W3C DOM API (part of the JAXP API, Java API for XML processing) and gives you a org.w3c.dom.Document back which is ready for direct use by JAXP API.
1. Jsoup – Leading Java Html Parser
Perhaps this is most used library for parsing xml and HTML contents. API is quite simple and predictive. from jsoup website
jsoupimplements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
- scrape and parse HTML from a URL, file, or string
- find and extract data, using DOM traversal or CSS selectors
- manipulate the HTML elements, attributes, and text
- clean user-submitted content against a safe white-list, to prevent XSS attacks
- output tidy HTML
HtmlUnit is a “GUI-Less browser for Java programs”. It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc… just like you do in your “normal” browser.
I found TagSoup while searching for python’s BeautifulSoup. This library says its built to handle error proned html(s). From its release page
TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
I couldn’t find much more about JTidy. In there release page, they says:
JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.