SlideShare a Scribd company logo
1 of 12
Download to read offline
Scraping Data from the Web using
Scrapy & Beautiful Soup
Nithish Raghunandanan
nithishr@gmail.com
PyData Munich | 8th November 2017
About Me
● MSc. Informatics Student at the Technical University of Munich
○ Focus on Data Science & Software Engineering
● Student Employee at KI labs, part of KI Group
● Love to play with different technologies
● Connect
■ nithishr1
@nithishr
What is Scraping?
● Extract data from the web pages
● Store the data into structured formats
● Data not available directly or via APIs
Use Cases
Tools for Scraping
● Scrapy
○ Python framework to extract data from web pages
● Beautiful Soup
○ Python library to parse HTML/XML documents
● Alternatives
○ Selenium
○ Requests
○ Octoparse
Scraping 101
● Spider
○ A bot that downloads web pages
● robots.txt
○ File present on the server specifying access limits to bots
Pitfalls in Crawling
● Javascript heavy websites
○ Splash plugin
○ Selenium
● Default settings not too friendly to website
owners
○ Inbuilt Auto throttle extension
● Captchas
Why Yellow Pages?
Email Marketing for Customer Acquisition
Email Marketing for Customer Acquisition
Initial Approach
● Buy Email Lists
● Send via 3rd Parties
● Poor Quality
○ Non transparent
○ Generic emails
● Expensive
Crawling
● Scrapy + Beautiful Soup
● Over 500k Emails
● Quality Improvement
○ Categorized into segments
○ Targeted emails
● Cheap
nithishr1
@nithishr
nithishr@gmail.com
Connect
Nithish Raghunandanan
www.ki-labs.com
Resources
● Scrapy Guide
○ https://doc.scrapy.org/en/latest/intro/tutorial.html
● Beautiful Soup Guide
○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/
● Crawling Etiquette
○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/
● Code
○ https://github.com/nithishr/meetup_scraping

More Related Content

What's hot

Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupTushar Mittal
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
Web scraping
Web scrapingWeb scraping
Web scrapingSelecto
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction ServicePromptCloud
 
Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSkillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSchool of Data
 
Web scraping & browser automation
Web scraping & browser automationWeb scraping & browser automation
Web scraping & browser automationBHAWESH RAJPAL
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
 
Creating data apps using Streamlit in Python
Creating data apps using Streamlit in PythonCreating data apps using Streamlit in Python
Creating data apps using Streamlit in PythonNithish Raghunandanan
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data ScienceArc & Codementor
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to PythonNowell Strite
 
Full-Stack Development
Full-Stack DevelopmentFull-Stack Development
Full-Stack DevelopmentDhilipsiva DS
 

What's hot (20)

Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful Soup
 
WEB Scraping.pptx
WEB Scraping.pptxWEB Scraping.pptx
WEB Scraping.pptx
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Web scraping
Web scrapingWeb scraping
Web scraping
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSkillshare - Introduction to Data Scraping
Skillshare - Introduction to Data Scraping
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
 
Web scraping & browser automation
Web scraping & browser automationWeb scraping & browser automation
Web scraping & browser automation
 
Scrapy
ScrapyScrapy
Scrapy
 
Intro to beautiful soup
Intro to beautiful soupIntro to beautiful soup
Intro to beautiful soup
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Web mining
Web miningWeb mining
Web mining
 
Creating data apps using Streamlit in Python
Creating data apps using Streamlit in PythonCreating data apps using Streamlit in Python
Creating data apps using Streamlit in Python
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Full-Stack Development
Full-Stack DevelopmentFull-Stack Development
Full-Stack Development
 

Viewers also liked

Linux Introduction (Commands)
Linux Introduction (Commands)Linux Introduction (Commands)
Linux Introduction (Commands)anandvaidya
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
Linux.ppt
Linux.ppt Linux.ppt
Linux.ppt onu9
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017Carol Smith
 

Viewers also liked (9)

Mr
MrMr
Mr
 
Linux Introduction (Commands)
Linux Introduction (Commands)Linux Introduction (Commands)
Linux Introduction (Commands)
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Linux File System
Linux File SystemLinux File System
Linux File System
 
Linux.ppt
Linux.ppt Linux.ppt
Linux.ppt
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
 

Similar to Tutorial on Web Scraping in Python

Using Web Data for Finance
Using Web Data for FinanceUsing Web Data for Finance
Using Web Data for FinanceScrapinghub
 
Python in Industry
Python in IndustryPython in Industry
Python in IndustryDharmit Shah
 
Django on app engine
Django on app engineDjango on app engine
Django on app enginebenpotato
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with PythonBenjamin Bengfort
 
Getting started with Scrapy in Python
Getting started with Scrapy in PythonGetting started with Scrapy in Python
Getting started with Scrapy in PythonViren Rajput
 
Computer Science Career Guidance
Computer Science Career GuidanceComputer Science Career Guidance
Computer Science Career GuidanceDeepak Sood
 
Glowing bear
Glowing bear Glowing bear
Glowing bear thehyve
 
Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04Torben Brodt
 
Dynatech presentation for TSI Career Day
Dynatech presentation for TSI Career DayDynatech presentation for TSI Career Day
Dynatech presentation for TSI Career DayArtur Babyuk
 
Curtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahooCurtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahoo羽祈 張
 
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ..."Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...Dataconomy Media
 
Security .NET.pdf
Security .NET.pdfSecurity .NET.pdf
Security .NET.pdfAbhi Jain
 
Introduction to AI and Cognitive Services For Microsoft 365 Developers and In...
Introduction to AI and Cognitive Services For Microsoft 365 Developers and In...Introduction to AI and Cognitive Services For Microsoft 365 Developers and In...
Introduction to AI and Cognitive Services For Microsoft 365 Developers and In...Prashant G Bhoyar (Microsoft MVP)
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryMarcus Hanwell
 
Introduction to Annotation, Content Search, and IIIF Authentication from the ...
Introduction to Annotation, Content Search, and IIIF Authentication from the ...Introduction to Annotation, Content Search, and IIIF Authentication from the ...
Introduction to Annotation, Content Search, and IIIF Authentication from the ...Glen Robson
 

Similar to Tutorial on Web Scraping in Python (20)

Life of a data engineer
Life of a data engineerLife of a data engineer
Life of a data engineer
 
Using Web Data for Finance
Using Web Data for FinanceUsing Web Data for Finance
Using Web Data for Finance
 
Python in Industry
Python in IndustryPython in Industry
Python in Industry
 
Data science at OLX
Data science at OLXData science at OLX
Data science at OLX
 
Django on app engine
Django on app engineDjango on app engine
Django on app engine
 
R vs Python vs SAS
R vs Python vs SASR vs Python vs SAS
R vs Python vs SAS
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
 
Getting started with Scrapy in Python
Getting started with Scrapy in PythonGetting started with Scrapy in Python
Getting started with Scrapy in Python
 
Computer Science Career Guidance
Computer Science Career GuidanceComputer Science Career Guidance
Computer Science Career Guidance
 
Web mining
Web miningWeb mining
Web mining
 
Glowing bear
Glowing bear Glowing bear
Glowing bear
 
Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04
 
Dynatech presentation for TSI Career Day
Dynatech presentation for TSI Career DayDynatech presentation for TSI Career Day
Dynatech presentation for TSI Career Day
 
Curtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahooCurtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahoo
 
Application Presentation
Application PresentationApplication Presentation
Application Presentation
 
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ..."Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
 
Security .NET.pdf
Security .NET.pdfSecurity .NET.pdf
Security .NET.pdf
 
Introduction to AI and Cognitive Services For Microsoft 365 Developers and In...
Introduction to AI and Cognitive Services For Microsoft 365 Developers and In...Introduction to AI and Cognitive Services For Microsoft 365 Developers and In...
Introduction to AI and Cognitive Services For Microsoft 365 Developers and In...
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
 
Introduction to Annotation, Content Search, and IIIF Authentication from the ...
Introduction to Annotation, Content Search, and IIIF Authentication from the ...Introduction to Annotation, Content Search, and IIIF Authentication from the ...
Introduction to Annotation, Content Search, and IIIF Authentication from the ...
 

More from Nithish Raghunandanan

More from Nithish Raghunandanan (7)

Select ML from Databases.pdf
Select ML from Databases.pdfSelect ML from Databases.pdf
Select ML from Databases.pdf
 
Select ML from Databases
Select ML from DatabasesSelect ML from Databases
Select ML from Databases
 
Virtual tourism in covid times
Virtual tourism in covid timesVirtual tourism in covid times
Virtual tourism in covid times
 
Learnings from Organizing Internal Hackathons
Learnings from Organizing Internal HackathonsLearnings from Organizing Internal Hackathons
Learnings from Organizing Internal Hackathons
 
Learnings from Organizing an Internal Hackathon
Learnings from Organizing an Internal HackathonLearnings from Organizing an Internal Hackathon
Learnings from Organizing an Internal Hackathon
 
Pecha kucha Talk on web scraping
Pecha kucha Talk on web scrapingPecha kucha Talk on web scraping
Pecha kucha Talk on web scraping
 
Hodor: Solving Everyday Problems with Tech
Hodor: Solving Everyday Problems with TechHodor: Solving Everyday Problems with Tech
Hodor: Solving Everyday Problems with Tech
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Tutorial on Web Scraping in Python

  • 1. Scraping Data from the Web using Scrapy & Beautiful Soup Nithish Raghunandanan nithishr@gmail.com PyData Munich | 8th November 2017
  • 2. About Me ● MSc. Informatics Student at the Technical University of Munich ○ Focus on Data Science & Software Engineering ● Student Employee at KI labs, part of KI Group ● Love to play with different technologies ● Connect ■ nithishr1 @nithishr
  • 3. What is Scraping? ● Extract data from the web pages ● Store the data into structured formats ● Data not available directly or via APIs
  • 5. Tools for Scraping ● Scrapy ○ Python framework to extract data from web pages ● Beautiful Soup ○ Python library to parse HTML/XML documents ● Alternatives ○ Selenium ○ Requests ○ Octoparse
  • 6.
  • 7. Scraping 101 ● Spider ○ A bot that downloads web pages ● robots.txt ○ File present on the server specifying access limits to bots
  • 8. Pitfalls in Crawling ● Javascript heavy websites ○ Splash plugin ○ Selenium ● Default settings not too friendly to website owners ○ Inbuilt Auto throttle extension ● Captchas
  • 9. Why Yellow Pages? Email Marketing for Customer Acquisition
  • 10. Email Marketing for Customer Acquisition Initial Approach ● Buy Email Lists ● Send via 3rd Parties ● Poor Quality ○ Non transparent ○ Generic emails ● Expensive Crawling ● Scrapy + Beautiful Soup ● Over 500k Emails ● Quality Improvement ○ Categorized into segments ○ Targeted emails ● Cheap
  • 12. Resources ● Scrapy Guide ○ https://doc.scrapy.org/en/latest/intro/tutorial.html ● Beautiful Soup Guide ○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/ ● Crawling Etiquette ○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/ ● Code ○ https://github.com/nithishr/meetup_scraping