JSOUP Parsing and Modifying HTML Documents

Overview
What is Jsoup
Parsing with Url
Parsing with File
Modify Data
Prevent cross site scripting

JSOUP
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for
extracting and manipulating data,
● scrape and parse HTML from a URL, file, or string
● find and extract data, using DOM traversal or CSS selectors
● manipulate the HTML elements, attributes, and text
● clean user-submitted content against a safe white-list, to prevent XSS attacks
● output tidy HTML

Parse a document from a url
The connect(String url) method creates a new Connection, and get()fetches and parses a HTML file. If
an error occurs whilst fetching the URL, it will throw an IOException, which you should handle
appropriately.
Document document = Jsoup.connect("https://grails.org/").get()
String title = document.title()
.

Continue..
The Connection interface is designed for method chaining to build specific requests:
Document doc = Jsoup.connect("http://example.com")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();

Parse a document from a string
You have HTML in a Java String, and you want to parse that HTML to get at its contents, or to make
sure it's well formed, or to modify it. The String may have come from user input, a file, or from the
web.
String html = "<html><head><title>First parse</title></head>"
+ "<body>Parsed HTML into a doc.</body></html>";
Document doc = Jsoup.parse(html);

Load a document from a file
File file = new File("/home/shipra/Downloads/Jsoup.html")
Document document = Jsoup.parse(file, "UTF-8")
String content = document.getElementById(“content”)
String tag = document.getElementByTag(“p”)
String class = document.getElementByClass(“green”)

Use DOM methods to navigate a document
You have a HTML document that you want to extract data from.
File file = new File("/home/shipra/Downloads/Jsoup.html")
Document document = Jsoup.parse(file, "UTF-8")
Elements elements = document.select(".nav-sections li")
elements.each { element ->
String text = element.select("a").text()
String attr = element.select("a").attr("href")
}

Modify Data
Use the attribute setter methods Element.attr(String key, String value), and Elements.attr(String key,
String value).
If you need to modify the class attribute of an element, use the Element.addClass(String className)
and Element.removeClass(String className) methods.
The Elements collection has bulk attribue and class methods. For example, to add a rel="nofollow"
attribute to every a element inside a div:
doc.select("div.comments a").attr("rel", "nofollow");
doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");

Setting the text content of an element
Element div = document.select("div").first();
div.html("paragraph");
div.prepend("First");
div.append("Last");

Sanitize untrusted HTML (to prevent XSS)
Whitelist allows what are the features that are passed to cleaning and others are discarded.
String unsafe ="<a href='http://example.com/' onclick='stealCookies()'>Link</a>"
String safe = Jsoup.clean(unsafe, Whitelist.basic());

Tidy HTML
The parser will make every attempt to create a clean parse from the HTML you provide, regardless of
whether the HTML is well-formed or not. It handles:
● unclosed tags (e.g. Lorem Ipsum parses to Lorem Ipsum)
● implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...)
● reliably creating the document structure (html containing a head and body, and only
appropriate elements within the head)

Demo Reference
https://github.com/NexThoughts/JSOUP.git

JSOUP Parsing and Modifying HTML Documents

JSOUP Parsing and Modifying HTML Documents

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to JSOUP Parsing and Modifying HTML Documents

Similar to JSOUP Parsing and Modifying HTML Documents (20)

More from NexThoughts Technologies

More from NexThoughts Technologies (20)

Recently uploaded

Recently uploaded (20)

JSOUP Parsing and Modifying HTML Documents