SlideShare a Scribd company logo
1 of 14
Download to read offline
JSOUP
Overview
What is Jsoup
Parsing with Url
Parsing with File
Modify Data
Prevent cross site scripting
JSOUP
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for
extracting and manipulating data,
● scrape and parse HTML from a URL, file, or string
● find and extract data, using DOM traversal or CSS selectors
● manipulate the HTML elements, attributes, and text
● clean user-submitted content against a safe white-list, to prevent XSS attacks
● output tidy HTML
Parse a document from a url
The connect(String url) method creates a new Connection, and get()fetches and parses a HTML file. If
an error occurs whilst fetching the URL, it will throw an IOException, which you should handle
appropriately.
Document document = Jsoup.connect("https://grails.org/").get()
String title = document.title()
.
Continue..
The Connection interface is designed for method chaining to build specific requests:
Document doc = Jsoup.connect("http://example.com")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();
Parse a document from a string
You have HTML in a Java String, and you want to parse that HTML to get at its contents, or to make
sure it's well formed, or to modify it. The String may have come from user input, a file, or from the
web.
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Load a document from a file
File file = new File("/home/shipra/Downloads/Jsoup.html")
Document document = Jsoup.parse(file, "UTF-8")
String content = document.getElementById(“content”)
String tag = document.getElementByTag(“p”)
String class = document.getElementByClass(“green”)
Use DOM methods to navigate a document
You have a HTML document that you want to extract data from.
File file = new File("/home/shipra/Downloads/Jsoup.html")
Document document = Jsoup.parse(file, "UTF-8")
Elements elements = document.select(".nav-sections li")
elements.each { element ->
String text = element.select("a").text()
String attr = element.select("a").attr("href")
}
Modify Data
Use the attribute setter methods Element.attr(String key, String value), and Elements.attr(String key,
String value).
If you need to modify the class attribute of an element, use the Element.addClass(String className)
and Element.removeClass(String className) methods.
The Elements collection has bulk attribue and class methods. For example, to add a rel="nofollow"
attribute to every a element inside a div:
doc.select("div.comments a").attr("rel", "nofollow");
doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");
Setting the text content of an element
Element div = document.select("div").first();
div.html("<p>paragraph</p>");
div.prepend("<p>First</p>");
div.append("<p>Last</p>");
Sanitize untrusted HTML (to prevent XSS)
Whitelist allows what are the features that are passed to cleaning and others are discarded.
String unsafe ="<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>"
String safe = Jsoup.clean(unsafe, Whitelist.basic());
Tidy HTML
The parser will make every attempt to create a clean parse from the HTML you provide, regardless of
whether the HTML is well-formed or not. It handles:
● unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
● implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...)
● reliably creating the document structure (html containing a head and body, and only
appropriate elements within the head)
Demo Reference
https://github.com/NexThoughts/JSOUP.git
JSOUP Parsing and Modifying HTML Documents

More Related Content

What's hot (20)

Session 20 - Collections - Maps
Session 20 - Collections - MapsSession 20 - Collections - Maps
Session 20 - Collections - Maps
 
09.Local Database Files and Storage on WP
09.Local Database Files and Storage on WP09.Local Database Files and Storage on WP
09.Local Database Files and Storage on WP
 
Xml processors
Xml processorsXml processors
Xml processors
 
Data handling in python
Data handling in pythonData handling in python
Data handling in python
 
Introductionto xslt
Introductionto xsltIntroductionto xslt
Introductionto xslt
 
Mongo db nosql (1)
Mongo db nosql (1)Mongo db nosql (1)
Mongo db nosql (1)
 
XML - SAX
XML - SAXXML - SAX
XML - SAX
 
DOM-XML
DOM-XMLDOM-XML
DOM-XML
 
XML
XMLXML
XML
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Session 5
Session 5Session 5
Session 5
 
Users as Data
Users as DataUsers as Data
Users as Data
 
Cis166 Final Review C#
Cis166 Final Review C#Cis166 Final Review C#
Cis166 Final Review C#
 
Xml processing in scala
Xml processing in scalaXml processing in scala
Xml processing in scala
 
XSL - XML STYLE SHEET
XSL - XML STYLE SHEETXSL - XML STYLE SHEET
XSL - XML STYLE SHEET
 
OData and SharePoint
OData and SharePointOData and SharePoint
OData and SharePoint
 
Chapter iii(working with data)
Chapter iii(working with data)Chapter iii(working with data)
Chapter iii(working with data)
 
Json – java script object notation
Json – java script object notationJson – java script object notation
Json – java script object notation
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 

Viewers also liked (17)

Jmh
JmhJmh
Jmh
 
RESTEasy
RESTEasyRESTEasy
RESTEasy
 
JFree chart
JFree chartJFree chart
JFree chart
 
Spring Web Flow
Spring Web FlowSpring Web Flow
Spring Web Flow
 
Introduction to es6
Introduction to es6Introduction to es6
Introduction to es6
 
Hamcrest
HamcrestHamcrest
Hamcrest
 
Introduction to gradle
Introduction to gradleIntroduction to gradle
Introduction to gradle
 
Grails with swagger
Grails with swaggerGrails with swagger
Grails with swagger
 
Actors model in gpars
Actors model in gparsActors model in gpars
Actors model in gpars
 
Unit test-using-spock in Grails
Unit test-using-spock in GrailsUnit test-using-spock in Grails
Unit test-using-spock in Grails
 
Reactive java - Reactive Programming + RxJava
Reactive java - Reactive Programming + RxJavaReactive java - Reactive Programming + RxJava
Reactive java - Reactive Programming + RxJava
 
Cosmos DB Service
Cosmos DB ServiceCosmos DB Service
Cosmos DB Service
 
Apache tika
Apache tikaApache tika
Apache tika
 
Progressive Web-App (PWA)
Progressive Web-App (PWA)Progressive Web-App (PWA)
Progressive Web-App (PWA)
 
Java 8 features
Java 8 featuresJava 8 features
Java 8 features
 
Introduction to thymeleaf
Introduction to thymeleafIntroduction to thymeleaf
Introduction to thymeleaf
 
Vertx
VertxVertx
Vertx
 

Similar to JSOUP Parsing and Modifying HTML Documents

Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web Technology
Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web TechnologyInternet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web Technology
Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web TechnologyAyes Chinmay
 
HTML Foundations, pt 2
HTML Foundations, pt 2HTML Foundations, pt 2
HTML Foundations, pt 2Shawn Calvert
 
Xsd restrictions, xsl elements, dhtml
Xsd restrictions, xsl elements, dhtmlXsd restrictions, xsl elements, dhtml
Xsd restrictions, xsl elements, dhtmlAMIT VIRAMGAMI
 
Apache Utilities At Work V5
Apache Utilities At Work   V5Apache Utilities At Work   V5
Apache Utilities At Work V5Tom Marrs
 
Document Object Model (DOM)
Document Object Model (DOM)Document Object Model (DOM)
Document Object Model (DOM)GOPAL BASAK
 
Get docs from sp doc library
Get docs from sp doc libraryGet docs from sp doc library
Get docs from sp doc librarySudip Sengupta
 
jQuery Rescue Adventure
jQuery Rescue AdventurejQuery Rescue Adventure
jQuery Rescue AdventureAllegient
 
PostgreSQL's Secret NoSQL Superpowers
PostgreSQL's Secret NoSQL SuperpowersPostgreSQL's Secret NoSQL Superpowers
PostgreSQL's Secret NoSQL SuperpowersAmanda Gilmore
 
The Django Web Application Framework
The Django Web Application FrameworkThe Django Web Application Framework
The Django Web Application FrameworkSimon Willison
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQueryGunjan Kumar
 

Similar to JSOUP Parsing and Modifying HTML Documents (20)

Jsoup tutorial
Jsoup tutorialJsoup tutorial
Jsoup tutorial
 
Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web Technology
Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web TechnologyInternet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web Technology
Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web Technology
 
HTML Foundations, pt 2
HTML Foundations, pt 2HTML Foundations, pt 2
HTML Foundations, pt 2
 
Chapter 18
Chapter 18Chapter 18
Chapter 18
 
Xsd restrictions, xsl elements, dhtml
Xsd restrictions, xsl elements, dhtmlXsd restrictions, xsl elements, dhtml
Xsd restrictions, xsl elements, dhtml
 
Selenium-Locators
Selenium-LocatorsSelenium-Locators
Selenium-Locators
 
Xml 2
Xml  2 Xml  2
Xml 2
 
2310 b 12
2310 b 122310 b 12
2310 b 12
 
Understanding XML DOM
Understanding XML DOMUnderstanding XML DOM
Understanding XML DOM
 
Apache Utilities At Work V5
Apache Utilities At Work   V5Apache Utilities At Work   V5
Apache Utilities At Work V5
 
Document Object Model (DOM)
Document Object Model (DOM)Document Object Model (DOM)
Document Object Model (DOM)
 
Ajax workshop
Ajax workshopAjax workshop
Ajax workshop
 
Get docs from sp doc library
Get docs from sp doc libraryGet docs from sp doc library
Get docs from sp doc library
 
jQuery Rescue Adventure
jQuery Rescue AdventurejQuery Rescue Adventure
jQuery Rescue Adventure
 
Ajax
AjaxAjax
Ajax
 
Dom
Dom Dom
Dom
 
Xml & Java
Xml & JavaXml & Java
Xml & Java
 
PostgreSQL's Secret NoSQL Superpowers
PostgreSQL's Secret NoSQL SuperpowersPostgreSQL's Secret NoSQL Superpowers
PostgreSQL's Secret NoSQL Superpowers
 
The Django Web Application Framework
The Django Web Application FrameworkThe Django Web Application Framework
The Django Web Application Framework
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQuery
 

More from NexThoughts Technologies (20)

Alexa skill
Alexa skillAlexa skill
Alexa skill
 
GraalVM
GraalVMGraalVM
GraalVM
 
Docker & kubernetes
Docker & kubernetesDocker & kubernetes
Docker & kubernetes
 
Apache commons
Apache commonsApache commons
Apache commons
 
HazelCast
HazelCastHazelCast
HazelCast
 
MySQL Pro
MySQL ProMySQL Pro
MySQL Pro
 
Microservice Architecture using Spring Boot with React & Redux
Microservice Architecture using Spring Boot with React & ReduxMicroservice Architecture using Spring Boot with React & Redux
Microservice Architecture using Spring Boot with React & Redux
 
Swagger
SwaggerSwagger
Swagger
 
Solid Principles
Solid PrinciplesSolid Principles
Solid Principles
 
Arango DB
Arango DBArango DB
Arango DB
 
Jython
JythonJython
Jython
 
Introduction to TypeScript
Introduction to TypeScriptIntroduction to TypeScript
Introduction to TypeScript
 
Smart Contract samples
Smart Contract samplesSmart Contract samples
Smart Contract samples
 
My Doc of geth
My Doc of gethMy Doc of geth
My Doc of geth
 
Geth important commands
Geth important commandsGeth important commands
Geth important commands
 
Ethereum genesis
Ethereum genesisEthereum genesis
Ethereum genesis
 
Ethereum
EthereumEthereum
Ethereum
 
Springboot Microservices
Springboot MicroservicesSpringboot Microservices
Springboot Microservices
 
An Introduction to Redux
An Introduction to ReduxAn Introduction to Redux
An Introduction to Redux
 
Google authentication
Google authenticationGoogle authentication
Google authentication
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

JSOUP Parsing and Modifying HTML Documents

  • 2. Overview What is Jsoup Parsing with Url Parsing with File Modify Data Prevent cross site scripting
  • 3. JSOUP jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, ● scrape and parse HTML from a URL, file, or string ● find and extract data, using DOM traversal or CSS selectors ● manipulate the HTML elements, attributes, and text ● clean user-submitted content against a safe white-list, to prevent XSS attacks ● output tidy HTML
  • 4. Parse a document from a url The connect(String url) method creates a new Connection, and get()fetches and parses a HTML file. If an error occurs whilst fetching the URL, it will throw an IOException, which you should handle appropriately. Document document = Jsoup.connect("https://grails.org/").get() String title = document.title() .
  • 5. Continue.. The Connection interface is designed for method chaining to build specific requests: Document doc = Jsoup.connect("http://example.com") .userAgent("Mozilla") .cookie("auth", "token") .timeout(3000) .post();
  • 6. Parse a document from a string You have HTML in a Java String, and you want to parse that HTML to get at its contents, or to make sure it's well formed, or to modify it. The String may have come from user input, a file, or from the web. String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>"; Document doc = Jsoup.parse(html);
  • 7. Load a document from a file File file = new File("/home/shipra/Downloads/Jsoup.html") Document document = Jsoup.parse(file, "UTF-8") String content = document.getElementById(“content”) String tag = document.getElementByTag(“p”) String class = document.getElementByClass(“green”)
  • 8. Use DOM methods to navigate a document You have a HTML document that you want to extract data from. File file = new File("/home/shipra/Downloads/Jsoup.html") Document document = Jsoup.parse(file, "UTF-8") Elements elements = document.select(".nav-sections li") elements.each { element -> String text = element.select("a").text() String attr = element.select("a").attr("href") }
  • 9. Modify Data Use the attribute setter methods Element.attr(String key, String value), and Elements.attr(String key, String value). If you need to modify the class attribute of an element, use the Element.addClass(String className) and Element.removeClass(String className) methods. The Elements collection has bulk attribue and class methods. For example, to add a rel="nofollow" attribute to every a element inside a div: doc.select("div.comments a").attr("rel", "nofollow"); doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");
  • 10. Setting the text content of an element Element div = document.select("div").first(); div.html("<p>paragraph</p>"); div.prepend("<p>First</p>"); div.append("<p>Last</p>");
  • 11. Sanitize untrusted HTML (to prevent XSS) Whitelist allows what are the features that are passed to cleaning and others are discarded. String unsafe ="<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>" String safe = Jsoup.clean(unsafe, Whitelist.basic());
  • 12. Tidy HTML The parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It handles: ● unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>) ● implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...) ● reliably creating the document structure (html containing a head and body, and only appropriate elements within the head)