SlideShare a Scribd company logo
1 of 50
Download to read offline
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Big Data and Automated Content Analysis
Week 6 – Monday
»Web scraping«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
13 March 2017
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Today
1 We put on our magic cap, pretend we are Firefox, scrape all
comments from GeenStijl, clean up the mess, and put the
comments in a neat CSV table
2 OK, but this surely can be doe more elegantly? Yes!
3 Next steps
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
We put on our magic cap, pretend we are Firefox, scrape all
comments from GeenStijl, clean up the mess, and put the
comments in a neat CSV table
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Let’s make a plan!
Which elements from the page do we need?
• What do they mean?
• How are they represented in the source code?
How should our output look like?
• What lists do we want?
• . . .
And how can we achieve this?
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
4 Within the comment section, identify each comment
(<article>)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
4 Within the comment section, identify each comment
(<article>)
5 Within each comment, seperate the text (<p>) from the
metadata <footer>)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
4 Within the comment section, identify each comment
(<article>)
5 Within each comment, seperate the text (<p>) from the
metadata <footer>)
6 Put text and metadata in lists and save them to a csv file
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
4 Within the comment section, identify each comment
(<article>)
5 Within each comment, seperate the text (<p>) from the
metadata <footer>)
6 Put text and metadata in lists and save them to a csv file
And how can we achieve this?
Big Data and Automated Content Analysis Damian Trilling
1 from urllib import request
2 import re
3 import csv
4
5 onlycommentslist=[]
6 metalist=[]
7
8 req = request.Request("http://www.geenstijl.nl/mt/archieven/2014/05/
das_toch_niet_normaal.html", headers={"User-Agent" : "Mozilla
/5.0"})
9 tekst=request.urlopen(req).read()
10 tekst=tekst.decode(encoding="utf-8",errors="ignore").replace("n"," ").
replace("t"," ")
11
12 commentsection=re.findall(r’<div class="commentlist">.*?</div>’,tekst)
13 print (commentsection)
14 comments=re.findall(r’<article.*?>(.*?)</article>’,commentsection[0])
15 print (comments)
16 print ("There are",len(comments),"comments")
17 for co in comments:
18 metalist.append(re.findall(r’<footer>(.*?)</footer>’,co))
19 onlycommentslist.append(re.findall(r’<p>(.*?)</p>’,co))
20 writer=csv.writer(open("geenstijlcomments.csv",mode="w",encoding="utf
-8"))
21 output=zip(onlycommentslist,metalist)
22 writer.writerows(output)
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
• The parentheses in (.*?) make sure that the function only
returns what’s between them and not the surrounding stuff
(like <footer> and </footer>)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
• The parentheses in (.*?) make sure that the function only
returns what’s between them and not the surrounding stuff
(like <footer> and </footer>)
Optimization
• Only save the 0th (and only) element of the list
• Seperate the username and interpret date and time
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Further reading
Doing this with other sites?
• It’s basically puzzling with regular expressions.
• Look at the source code of the website to see how
well-structured it is.
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
OK, but this surely can be doe more elegantly? Yes!
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
• especially when the structure of the site is more complex
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
• especially when the structure of the site is more complex
The following example is based on http://www.chicagoreader.com/
chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228. It
uses the module lxml
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
What do we need?
• the URL (of course)
• the XPATH of the element we want to scrape (you’ll see in a
minute what this is)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Let’s install a useful add-on: the Firefox XPath Checker
. . . and after that, inspect the website we’re interested in:
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Playing around with the Firefox XPath Checker
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Playing around with the Firefox XPath Checker
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use @. For example, every
*[@id="reviews-container"] would grap a tag like <div
id=reviews-container” class=”’user-content’
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use @. For example, every
*[@id="reviews-container"] would grap a tag like <div
id=reviews-container” class=”’user-content’
• Let the XPATH end with /text() to get all text
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use @. For example, every
*[@id="reviews-container"] would grap a tag like <div
id=reviews-container” class=”’user-content’
• Let the XPATH end with /text() to get all text
• Have a look at the source code of the web page to think of
other possible XPATHs!
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
The XPATH
You get something like
//*[@id="tabbedReviewsDiv"]/dl[1]/dd
//*[@id="tabbedReviewsDiv"]/dl[2]/dd
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
The XPATH
You get something like
//*[@id="tabbedReviewsDiv"]/dl[1]/dd
//*[@id="tabbedReviewsDiv"]/dl[2]/dd
The * means “every”.
Also, to get the text of the element, the XPATH should end on
/text().
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
The XPATH
You get something like
//*[@id="tabbedReviewsDiv"]/dl[1]/dd
//*[@id="tabbedReviewsDiv"]/dl[2]/dd
The * means “every”.
Also, to get the text of the element, the XPATH should end on
/text().
We can infer that we (probably) get all comments with
//*[@id="tabbedReviewsDiv"]/dl[*]/dd/text()
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Let’s scrape them!
1 from lxml import html
2 from urllib import request
3
4 req=request.Request("http://www.kieskeurig.nl/smartphone/product
/2334455-apple-iphone-6/reviews")
5 tree =
6
7 html.fromstring(request.urlopen(req).read().decode(encoding="utf-8",
errors="ignore"))
8 reviews = tree.xpath(’//*[@class="reviews-single__text"]/text()’)
9
10 # remove empty reviews and remove leading/trailing whitespace
11 reviews = [r.strip() for r in reviews if r.strip()!=""]
12
13 print (len(reviews),"reviews scraped. Showing the first 60 characters of
each:")
14 i=0
15 for review in reviews:
16 print("Review",i,":",review[:60])
17 i+=1
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
The output – perfect!
1 63 reviews scraped. Showing the first 60 characters of each:
2 Review 0 : Apple maakt mooie toestellen met hard- en software uitsteken
3 Review 1 : Vanaf de iPhone 4 ben ik erg te spreken over de intuitieve i
4 Review 2 : Helaas ontbreekt het Apple toestellen aan een noodzakelijk i
5 Review 3 : Met een enorme mate van pech hebben wij als (beschaafd!!) ge
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Recap
General idea
1 Identify each element by its XPATH (look it up in your
browser)
2 Read the webpage into a (loooooong) string
3 Use the XPATH to extract the relevant text into a list (with a
module like lxml)
4 Do something with the list (preprocess, analyze, save)
Alternatives: scrapy, beautifulsoup, regular expressions, . . .
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Last remarks
There is often more than one way to specify an XPATH
1 You can usually leave away the namespace (the x:)
2 Sometimes, you might want to use a different suggestion to
be able to generalize better (e.g., using the attributes rather
than the tags)
3 in that case, it makes sense to look deeper into the structure
of the HTML code, for example with “Inspect Element” and
use that information to play around with in the XPATH
Checker
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Next steps
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
From now on. . .
. . . focus on individual projects!
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps
Wednesday
• Write a scraper for a website of your choice!
• Prepare a bit, so that you know where to start
• Suggestions: Review texts and grades from http://iens.nl,
articles from a news site of your choice, . . .
Big Data and Automated Content Analysis Damian Trilling
BDACA1617s2 - Lecture6

More Related Content

What's hot

Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...Matt Stubbs
 

What's hot (20)

BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
BD-ACA week5
BD-ACA week5BD-ACA week5
BD-ACA week5
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
BD-ACA week2
BD-ACA week2BD-ACA week2
BD-ACA week2
 
BD-ACA week1a
BD-ACA week1aBD-ACA week1a
BD-ACA week1a
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
BD-ACA week3a
BD-ACA week3aBD-ACA week3a
BD-ACA week3a
 
BD-ACA Week6
BD-ACA Week6BD-ACA Week6
BD-ACA Week6
 
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
 

Viewers also liked

The blackhole origins........
The blackhole origins........The blackhole origins........
The blackhole origins........Jahnavi jaanu
 
Обзор газеты "Добрая дорога детства"
 Обзор газеты "Добрая дорога детства" Обзор газеты "Добрая дорога детства"
Обзор газеты "Добрая дорога детства"tagobr58
 
GobPress - Diseñando un tema responsive para las Administraciones Locales
GobPress - Diseñando un tema responsive para las Administraciones LocalesGobPress - Diseñando un tema responsive para las Administraciones Locales
GobPress - Diseñando un tema responsive para las Administraciones LocalesAinet Consulting S.L.
 
Multiplicador de la renta tema 8
Multiplicador de la renta tema 8Multiplicador de la renta tema 8
Multiplicador de la renta tema 8Nerea Apraiz
 
3Com 3C96100D-AMGT
3Com 3C96100D-AMGT3Com 3C96100D-AMGT
3Com 3C96100D-AMGTsavomir
 
Amalan terbaik
Amalan terbaikAmalan terbaik
Amalan terbaikWan Azlina
 
How to quickly install a bike
How to quickly install a bikeHow to quickly install a bike
How to quickly install a bike福镜 于
 
Chiropractic care and herniated disc
Chiropractic care and herniated discChiropractic care and herniated disc
Chiropractic care and herniated discEason Chan
 

Viewers also liked (13)

The blackhole origins........
The blackhole origins........The blackhole origins........
The blackhole origins........
 
International trade ppt[1]
International trade ppt[1]International trade ppt[1]
International trade ppt[1]
 
Обзор газеты "Добрая дорога детства"
 Обзор газеты "Добрая дорога детства" Обзор газеты "Добрая дорога детства"
Обзор газеты "Добрая дорога детства"
 
GobPress - Diseñando un tema responsive para las Administraciones Locales
GobPress - Diseñando un tema responsive para las Administraciones LocalesGobPress - Diseñando un tema responsive para las Administraciones Locales
GobPress - Diseñando un tema responsive para las Administraciones Locales
 
Resumen TLCAN
Resumen TLCANResumen TLCAN
Resumen TLCAN
 
Multiplicador de la renta tema 8
Multiplicador de la renta tema 8Multiplicador de la renta tema 8
Multiplicador de la renta tema 8
 
Ford HR Issues
Ford HR IssuesFord HR Issues
Ford HR Issues
 
3Com 3C96100D-AMGT
3Com 3C96100D-AMGT3Com 3C96100D-AMGT
3Com 3C96100D-AMGT
 
Amalan terbaik
Amalan terbaikAmalan terbaik
Amalan terbaik
 
How to quickly install a bike
How to quickly install a bikeHow to quickly install a bike
How to quickly install a bike
 
Rabri
RabriRabri
Rabri
 
Chiropractic care and herniated disc
Chiropractic care and herniated discChiropractic care and herniated disc
Chiropractic care and herniated disc
 
MITIGASI BENCANA
MITIGASI BENCANAMITIGASI BENCANA
MITIGASI BENCANA
 

Similar to BDACA1617s2 - Lecture6

SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideDominic Woodman
 
JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4alexsaves
 
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...Thoughtworks
 
Even faster web sites
Even faster web sitesEven faster web sites
Even faster web sitesFelipe Lavín
 
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...Martin Leyrer
 
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 20197 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019Dave Stokes
 
Stefan Judis "Did we(b development) lose the right direction?"
Stefan Judis "Did we(b development) lose the right direction?"Stefan Judis "Did we(b development) lose the right direction?"
Stefan Judis "Did we(b development) lose the right direction?"Fwdays
 
The Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThe Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThoughtworks
 
Alexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot XAlexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot XAlexis Sanders
 
MongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB Galaxy
MongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB GalaxyMongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB Galaxy
MongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB GalaxyMongoDB
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Esteve Castells
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relationalTony Tam
 
Essential Technical SEO learnings from 120+ site migrations
Essential Technical SEO learnings from 120+ site migrationsEssential Technical SEO learnings from 120+ site migrations
Essential Technical SEO learnings from 120+ site migrationsChris Green
 
DevSecCon Singapore 2018 - Remove developers’ shameful secrets or simply rem...
DevSecCon Singapore 2018 -  Remove developers’ shameful secrets or simply rem...DevSecCon Singapore 2018 -  Remove developers’ shameful secrets or simply rem...
DevSecCon Singapore 2018 - Remove developers’ shameful secrets or simply rem...DevSecCon
 
GTM Clowns, fun and hacks - Search Elite - May 2017 Gerry White
GTM Clowns, fun and hacks - Search Elite - May 2017 Gerry WhiteGTM Clowns, fun and hacks - Search Elite - May 2017 Gerry White
GTM Clowns, fun and hacks - Search Elite - May 2017 Gerry WhiteGerry White
 
DevSecCon SG 2018 Fabian Presentation Slides
DevSecCon SG 2018 Fabian Presentation SlidesDevSecCon SG 2018 Fabian Presentation Slides
DevSecCon SG 2018 Fabian Presentation SlidesFab L
 

Similar to BDACA1617s2 - Lecture6 (20)

SEO for Large Websites
SEO for Large WebsitesSEO for Large Websites
SEO for Large Websites
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech Side
 
JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4
 
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
 
Even faster web sites
Even faster web sitesEven faster web sites
Even faster web sites
 
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
 
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 20197 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
 
Stefan Judis "Did we(b development) lose the right direction?"
Stefan Judis "Did we(b development) lose the right direction?"Stefan Judis "Did we(b development) lose the right direction?"
Stefan Judis "Did we(b development) lose the right direction?"
 
The Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThe Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always Wanted
 
Alexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot XAlexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot X
 
MongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB Galaxy
MongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB GalaxyMongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB Galaxy
MongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB Galaxy
 
Real_World_0days.pdf
Real_World_0days.pdfReal_World_0days.pdf
Real_World_0days.pdf
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relational
 
Essential Technical SEO learnings from 120+ site migrations
Essential Technical SEO learnings from 120+ site migrationsEssential Technical SEO learnings from 120+ site migrations
Essential Technical SEO learnings from 120+ site migrations
 
Shifting Gears
Shifting GearsShifting Gears
Shifting Gears
 
DevSecCon Singapore 2018 - Remove developers’ shameful secrets or simply rem...
DevSecCon Singapore 2018 -  Remove developers’ shameful secrets or simply rem...DevSecCon Singapore 2018 -  Remove developers’ shameful secrets or simply rem...
DevSecCon Singapore 2018 - Remove developers’ shameful secrets or simply rem...
 
GTM Clowns, fun and hacks - Search Elite - May 2017 Gerry White
GTM Clowns, fun and hacks - Search Elite - May 2017 Gerry WhiteGTM Clowns, fun and hacks - Search Elite - May 2017 Gerry White
GTM Clowns, fun and hacks - Search Elite - May 2017 Gerry White
 
DevSecCon SG 2018 Fabian Presentation Slides
DevSecCon SG 2018 Fabian Presentation SlidesDevSecCon SG 2018 Fabian Presentation Slides
DevSecCon SG 2018 Fabian Presentation Slides
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
 

More from Department of Communication Science, University of Amsterdam (10)

BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
 
Should we worry about filter bubbles?
Should we worry about filter bubbles?Should we worry about filter bubbles?
Should we worry about filter bubbles?
 

Recently uploaded

Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 

Recently uploaded (20)

Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 

BDACA1617s2 - Lecture6

  • 1. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Big Data and Automated Content Analysis Week 6 – Monday »Web scraping« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 13 March 2017 Big Data and Automated Content Analysis Damian Trilling
  • 2. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Today 1 We put on our magic cap, pretend we are Firefox, scrape all comments from GeenStijl, clean up the mess, and put the comments in a neat CSV table 2 OK, but this surely can be doe more elegantly? Yes! 3 Next steps Big Data and Automated Content Analysis Damian Trilling
  • 3. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps We put on our magic cap, pretend we are Firefox, scrape all comments from GeenStijl, clean up the mess, and put the comments in a neat CSV table Big Data and Automated Content Analysis Damian Trilling
  • 4.
  • 5.
  • 6.
  • 7. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Let’s make a plan! Which elements from the page do we need? • What do they mean? • How are they represented in the source code? How should our output look like? • What lists do we want? • . . . And how can we achieve this? Big Data and Automated Content Analysis Damian Trilling
  • 8. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap Big Data and Automated Content Analysis Damian Trilling
  • 9. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page Big Data and Automated Content Analysis Damian Trilling
  • 10. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! Big Data and Automated Content Analysis Damian Trilling
  • 11. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string Big Data and Automated Content Analysis Damian Trilling
  • 12. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) Big Data and Automated Content Analysis Damian Trilling
  • 13. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) 4 Within the comment section, identify each comment (<article>) Big Data and Automated Content Analysis Damian Trilling
  • 14. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) 4 Within the comment section, identify each comment (<article>) 5 Within each comment, seperate the text (<p>) from the metadata <footer>) Big Data and Automated Content Analysis Damian Trilling
  • 15. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) 4 Within the comment section, identify each comment (<article>) 5 Within each comment, seperate the text (<p>) from the metadata <footer>) 6 Put text and metadata in lists and save them to a csv file Big Data and Automated Content Analysis Damian Trilling
  • 16. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) 4 Within the comment section, identify each comment (<article>) 5 Within each comment, seperate the text (<p>) from the metadata <footer>) 6 Put text and metadata in lists and save them to a csv file And how can we achieve this? Big Data and Automated Content Analysis Damian Trilling
  • 17. 1 from urllib import request 2 import re 3 import csv 4 5 onlycommentslist=[] 6 metalist=[] 7 8 req = request.Request("http://www.geenstijl.nl/mt/archieven/2014/05/ das_toch_niet_normaal.html", headers={"User-Agent" : "Mozilla /5.0"}) 9 tekst=request.urlopen(req).read() 10 tekst=tekst.decode(encoding="utf-8",errors="ignore").replace("n"," "). replace("t"," ") 11 12 commentsection=re.findall(r’<div class="commentlist">.*?</div>’,tekst) 13 print (commentsection) 14 comments=re.findall(r’<article.*?>(.*?)</article>’,commentsection[0]) 15 print (comments) 16 print ("There are",len(comments),"comments") 17 for co in comments: 18 metalist.append(re.findall(r’<footer>(.*?)</footer>’,co)) 19 onlycommentslist.append(re.findall(r’<p>(.*?)</p>’,co)) 20 writer=csv.writer(open("geenstijlcomments.csv",mode="w",encoding="utf -8")) 21 output=zip(onlycommentslist,metalist) 22 writer.writerows(output)
  • 18.
  • 19. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). Big Data and Automated Content Analysis Damian Trilling
  • 20. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). • The parentheses in (.*?) make sure that the function only returns what’s between them and not the surrounding stuff (like <footer> and </footer>) Big Data and Automated Content Analysis Damian Trilling
  • 21. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). • The parentheses in (.*?) make sure that the function only returns what’s between them and not the surrounding stuff (like <footer> and </footer>) Optimization • Only save the 0th (and only) element of the list • Seperate the username and interpret date and time Big Data and Automated Content Analysis Damian Trilling
  • 22. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Further reading Doing this with other sites? • It’s basically puzzling with regular expressions. • Look at the source code of the website to see how well-structured it is. Big Data and Automated Content Analysis Damian Trilling
  • 23. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps OK, but this surely can be doe more elegantly? Yes! Big Data and Automated Content Analysis Damian Trilling
  • 24. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) Big Data and Automated Content Analysis Damian Trilling
  • 25. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) Big Data and Automated Content Analysis Damian Trilling
  • 26. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) • especially when the structure of the site is more complex Big Data and Automated Content Analysis Damian Trilling
  • 27. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) • especially when the structure of the site is more complex The following example is based on http://www.chicagoreader.com/ chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228. It uses the module lxml Big Data and Automated Content Analysis Damian Trilling
  • 28. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps What do we need? • the URL (of course) • the XPATH of the element we want to scrape (you’ll see in a minute what this is) Big Data and Automated Content Analysis Damian Trilling
  • 29. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Let’s install a useful add-on: the Firefox XPath Checker . . . and after that, inspect the website we’re interested in: Big Data and Automated Content Analysis Damian Trilling
  • 30.
  • 31.
  • 32. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Playing around with the Firefox XPath Checker Big Data and Automated Content Analysis Damian Trilling
  • 33. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Playing around with the Firefox XPath Checker Big Data and Automated Content Analysis Damian Trilling
  • 34. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) Big Data and Automated Content Analysis Damian Trilling
  • 35. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all Big Data and Automated Content Analysis Damian Trilling
  • 36. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use @. For example, every *[@id="reviews-container"] would grap a tag like <div id=reviews-container” class=”’user-content’ Big Data and Automated Content Analysis Damian Trilling
  • 37. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use @. For example, every *[@id="reviews-container"] would grap a tag like <div id=reviews-container” class=”’user-content’ • Let the XPATH end with /text() to get all text Big Data and Automated Content Analysis Damian Trilling
  • 38. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use @. For example, every *[@id="reviews-container"] would grap a tag like <div id=reviews-container” class=”’user-content’ • Let the XPATH end with /text() to get all text • Have a look at the source code of the web page to think of other possible XPATHs! Big Data and Automated Content Analysis Damian Trilling
  • 39. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps The XPATH You get something like //*[@id="tabbedReviewsDiv"]/dl[1]/dd //*[@id="tabbedReviewsDiv"]/dl[2]/dd Big Data and Automated Content Analysis Damian Trilling
  • 40. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps The XPATH You get something like //*[@id="tabbedReviewsDiv"]/dl[1]/dd //*[@id="tabbedReviewsDiv"]/dl[2]/dd The * means “every”. Also, to get the text of the element, the XPATH should end on /text(). Big Data and Automated Content Analysis Damian Trilling
  • 41. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps The XPATH You get something like //*[@id="tabbedReviewsDiv"]/dl[1]/dd //*[@id="tabbedReviewsDiv"]/dl[2]/dd The * means “every”. Also, to get the text of the element, the XPATH should end on /text(). We can infer that we (probably) get all comments with //*[@id="tabbedReviewsDiv"]/dl[*]/dd/text() Big Data and Automated Content Analysis Damian Trilling
  • 42. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Let’s scrape them! 1 from lxml import html 2 from urllib import request 3 4 req=request.Request("http://www.kieskeurig.nl/smartphone/product /2334455-apple-iphone-6/reviews") 5 tree = 6 7 html.fromstring(request.urlopen(req).read().decode(encoding="utf-8", errors="ignore")) 8 reviews = tree.xpath(’//*[@class="reviews-single__text"]/text()’) 9 10 # remove empty reviews and remove leading/trailing whitespace 11 reviews = [r.strip() for r in reviews if r.strip()!=""] 12 13 print (len(reviews),"reviews scraped. Showing the first 60 characters of each:") 14 i=0 15 for review in reviews: 16 print("Review",i,":",review[:60]) 17 i+=1 Big Data and Automated Content Analysis Damian Trilling
  • 43. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps The output – perfect! 1 63 reviews scraped. Showing the first 60 characters of each: 2 Review 0 : Apple maakt mooie toestellen met hard- en software uitsteken 3 Review 1 : Vanaf de iPhone 4 ben ik erg te spreken over de intuitieve i 4 Review 2 : Helaas ontbreekt het Apple toestellen aan een noodzakelijk i 5 Review 3 : Met een enorme mate van pech hebben wij als (beschaafd!!) ge Big Data and Automated Content Analysis Damian Trilling
  • 44. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Recap General idea 1 Identify each element by its XPATH (look it up in your browser) 2 Read the webpage into a (loooooong) string 3 Use the XPATH to extract the relevant text into a list (with a module like lxml) 4 Do something with the list (preprocess, analyze, save) Alternatives: scrapy, beautifulsoup, regular expressions, . . . Big Data and Automated Content Analysis Damian Trilling
  • 45. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Last remarks There is often more than one way to specify an XPATH 1 You can usually leave away the namespace (the x:) 2 Sometimes, you might want to use a different suggestion to be able to generalize better (e.g., using the attributes rather than the tags) 3 in that case, it makes sense to look deeper into the structure of the HTML code, for example with “Inspect Element” and use that information to play around with in the XPATH Checker Big Data and Automated Content Analysis Damian Trilling
  • 46.
  • 47. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Next steps Big Data and Automated Content Analysis Damian Trilling
  • 48. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps From now on. . . . . . focus on individual projects! Big Data and Automated Content Analysis Damian Trilling
  • 49. Magic Cap OK, but this surely can be doe more elegantly? Yes! Next steps Wednesday • Write a scraper for a website of your choice! • Prepare a bit, so that you know where to start • Suggestions: Review texts and grades from http://iens.nl, articles from a news site of your choice, . . . Big Data and Automated Content Analysis Damian Trilling