1. Scraping Data from the Web using
Scrapy & Beautiful Soup
Nithish Raghunandanan
nithishr@gmail.com
PyData Munich | 8th November 2017
2. About Me
● MSc. Informatics Student at the Technical University of Munich
○ Focus on Data Science & Software Engineering
● Student Employee at KI labs, part of KI Group
● Love to play with different technologies
● Connect
■ nithishr1
@nithishr
3. What is Scraping?
● Extract data from the web pages
● Store the data into structured formats
● Data not available directly or via APIs
5. Tools for Scraping
● Scrapy
○ Python framework to extract data from web pages
● Beautiful Soup
○ Python library to parse HTML/XML documents
● Alternatives
○ Selenium
○ Requests
○ Octoparse
6.
7. Scraping 101
● Spider
○ A bot that downloads web pages
● robots.txt
○ File present on the server specifying access limits to bots
8. Pitfalls in Crawling
● Javascript heavy websites
○ Splash plugin
○ Selenium
● Default settings not too friendly to website
owners
○ Inbuilt Auto throttle extension
● Captchas