Tutorial on Web Scraping in Python

•

4 likes•1,574 views

Tutorial on Scraping Data from the Web with Python using Scrapy and BeautifulSoup at PyData Munich held at Burda Bootcamp.

Technology

Scraping Data from the Web using
Scrapy & Beautiful Soup
Nithish Raghunandanan
nithishr@gmail.com
PyData Munich | 8th November 2017

About Me
● MSc. Informatics Student at the Technical University of Munich
○ Focus on Data Science & Software Engineering
● Student Employee at KI labs, part of KI Group
● Love to play with different technologies
● Connect
■ nithishr1
@nithishr

What is Scraping?
● Extract data from the web pages
● Store the data into structured formats
● Data not available directly or via APIs

Tools for Scraping
● Scrapy
○ Python framework to extract data from web pages
● Beautiful Soup
○ Python library to parse HTML/XML documents
● Alternatives
○ Selenium
○ Requests
○ Octoparse

Scraping 101
● Spider
○ A bot that downloads web pages
● robots.txt
○ File present on the server specifying access limits to bots

Pitfalls in Crawling
● Javascript heavy websites
○ Splash plugin
○ Selenium
● Default settings not too friendly to website
owners
○ Inbuilt Auto throttle extension
● Captchas

Why Yellow Pages?
Email Marketing for Customer Acquisition

Email Marketing for Customer Acquisition
Initial Approach
● Buy Email Lists
● Send via 3rd Parties
● Poor Quality
○ Non transparent
○ Generic emails
● Expensive
Crawling
● Scrapy + Beautiful Soup
● Over 500k Emails
● Quality Improvement
○ Categorized into segments
○ Targeted emails
● Cheap

nithishr1
@nithishr
nithishr@gmail.com
Connect
Nithish Raghunandanan
www.ki-labs.com

Resources
● Scrapy Guide
○ https://doc.scrapy.org/en/latest/intro/tutorial.html
● Beautiful Soup Guide
○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/
● Crawling Etiquette
○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/
● Code
○ https://github.com/nithishr/meetup_scraping

What's hot

Web ScrapingCarlos Rodriguez

Introduction to Web Scraping using Python and Beautiful SoupTushar Mittal

WEB Scraping.pptxShubham Jaybhaye

Scraping data from the web and documentsTommy Tavenner

Web scrapingSelecto

Web Scraping and Data Extraction ServicePromptCloud

Skillshare - Introduction to Data ScrapingSchool of Data

What is web scraping?Brijesh Prajapati

Web Scraping BasicsKyle Banerjee

Web Scrapingprimeteacher32

Web Scrapping Using PythonComputerScienceJunct

Web scraping & browser automationBHAWESH RAJPAL

ScrapyFrancisco Sousa

Intro to beautiful soupAndreas Chandra

What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn

Web miningSarthakSahoo8

Creating data apps using Streamlit in PythonNithish Raghunandanan

Introduction to Python for Data ScienceArc & Codementor

Introduction to PythonNowell Strite

Full-Stack DevelopmentDhilipsiva DS

What's hot (20)

Web Scraping

Introduction to Web Scraping using Python and Beautiful Soup

WEB Scraping.pptx

Scraping data from the web and documents

Web scraping

Web Scraping and Data Extraction Service

Skillshare - Introduction to Data Scraping

What is web scraping?

Web Scraping Basics

Web Scraping

Web Scrapping Using Python

Web scraping & browser automation

Scrapy

Intro to beautiful soup

What Is Data Science? | Introduction to Data Science | Data Science For Begin...

Web mining

Creating data apps using Streamlit in Python

Introduction to Python for Data Science

Introduction to Python

Full-Stack Development

Viewers also liked

MrTianwei Liu

Linux Introduction (Commands)anandvaidya

Hadoop introduction 2Tianwei Liu

Scraping the web with pythonJose Manuel Ortega Candel

Linux File SystemAnil Kumar Pugalia

Linux.ppt onu9

Big Data & Hadoop TutorialEdureka!

Web Scraping with PythonPaul Schreiber

AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017Carol Smith

Viewers also liked (9)

Linux Introduction (Commands)

Hadoop introduction 2

Scraping the web with python

Linux File System

Linux.ppt

Big Data & Hadoop Tutorial

Web Scraping with Python

AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

From Family Reminiscence to Scholarly Archive .Alan Dix

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024

Take control of your SAP testing with UiPath Test Suite

Anypoint Exchange: It’s Not Just a Repo!

From Family Reminiscence to Scholarly Archive .

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Designing IA for AI - Information Architecture Conference 2024

Human Factors of XR: Using Human Factors to Design XR Systems

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

"Debugging python applications inside k8s environment", Andrii Soldatenko

Unraveling Multimodality with Large Language Models.pdf

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Connect Wave/ connectwave Pitch Deck Presentation

WordPress Websites for Engineers: Elevate Your Brand

Ensuring Technical Readiness For Copilot in Microsoft 365

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Developer Data Modeling Mistakes: From Postgres to NoSQL

DMCC Future of Trade Web3 - Special Edition

Tutorial on Web Scraping in Python

1. Scraping Data from the Web using Scrapy & Beautiful Soup Nithish Raghunandanan nithishr@gmail.com PyData Munich | 8th November 2017

2. About Me ● MSc. Informatics Student at the Technical University of Munich ○ Focus on Data Science & Software Engineering ● Student Employee at KI labs, part of KI Group ● Love to play with different technologies ● Connect ■ nithishr1 @nithishr

3. What is Scraping? ● Extract data from the web pages ● Store the data into structured formats ● Data not available directly or via APIs

4. Use Cases

5. Tools for Scraping ● Scrapy ○ Python framework to extract data from web pages ● Beautiful Soup ○ Python library to parse HTML/XML documents ● Alternatives ○ Selenium ○ Requests ○ Octoparse

7. Scraping 101 ● Spider ○ A bot that downloads web pages ● robots.txt ○ File present on the server specifying access limits to bots

8. Pitfalls in Crawling ● Javascript heavy websites ○ Splash plugin ○ Selenium ● Default settings not too friendly to website owners ○ Inbuilt Auto throttle extension ● Captchas

9. Why Yellow Pages? Email Marketing for Customer Acquisition

10. Email Marketing for Customer Acquisition Initial Approach ● Buy Email Lists ● Send via 3rd Parties ● Poor Quality ○ Non transparent ○ Generic emails ● Expensive Crawling ● Scrapy + Beautiful Soup ● Over 500k Emails ● Quality Improvement ○ Categorized into segments ○ Targeted emails ● Cheap

11. nithishr1 @nithishr nithishr@gmail.com Connect Nithish Raghunandanan www.ki-labs.com

12. Resources ● Scrapy Guide ○ https://doc.scrapy.org/en/latest/intro/tutorial.html ● Beautiful Soup Guide ○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/ ● Crawling Etiquette ○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/ ● Code ○ https://github.com/nithishr/meetup_scraping

Tutorial on Web Scraping in Python

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Tutorial on Web Scraping in Python

Similar to Tutorial on Web Scraping in Python (20)

More from Nithish Raghunandanan

More from Nithish Raghunandanan (7)

Recently uploaded

Recently uploaded (20)

Tutorial on Web Scraping in Python