The slides from a Workshop presentation on Data Science and Big Data given to academic social scientists. Lots of links to sources, should be interesting to those outside the original target field.
1. With Dr Ian Hopkinson
Data Science for Social Scientists
LJMU 2014-09-12
ian@scraperwiki.com
2. Aims
●Explain “Data Science” and “Big Data”
●Show some tools
●Show some examples
●Take home
●New methodology
●Plan a project
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
3. Overview
●Introductions ~15 minutes
●What is Data Science?~40 minutes
●What is Big Data?~20 minutes
●Coffee Break~15 minutes
●Case Studies~90 minutes
●Discussion~30 minutes
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
4. Background
●Background in physics
●Computer fiddler for many years
●Academic at Cambridge and UMIST 8 years
●Unilever PLC (large FMCG) for 8 years
●Lots of experience with all sorts of data
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
5. Mindset
●What can I do with this data?
●Is there a plot I can do with other data?
●Can I make a map?
●Just how many bottles are there in the Science Museum collection?
●How do I automate this?
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
6. How do you work?
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
7. What is Data Science?
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
8. Data science
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
9. Classification of data analysts
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Source: Enterprise Data Analysis and Visualization: An Interview Study
10. Tools
●Excel –analysis, processing, visualisation
●Tableau (Public) –better visualisation
●Python –a programming language
●Gephi–network visualisation
●R -statistics
●Databases (SQLite, MySQL, Postgresql)
●…
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
11. Data Science –what does it look like?
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Discover
Wrangle
Profile
Model
Report
Workflow
12. Wrangling –joining and cleaning data
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
●Manual
●Dictionary
●Substitutions
●Geocoding
●NewsReader
●Natural language processing
●Face recognition
●…
13. Data Science in the wild…
●Google Flu Trends, credit scoring, recommendation systems, fleet management, search engines, price comparisons, Google Translate…
●Commercially: all about prediction
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
14. Data Science
●Statistics
●Machine Learning
●Natural Language Processing
●Computer Vision
●Data visualisation
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
15. Machine Learning
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
●Classification
●“Supervised” –training set
●“Unsupervised” –no training set
●Algorithms
●Naïve Bayesian, Logistic regression, Support vector machines, decision trees…
16. Machine Learning
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Source: scikit-learn, Supervised learning
17. Classifier evaluation
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Source: Understanding uncertainty, visualising probabilities
TP
FN
FP
= 9/(9 + 99) = 8.3%
TN
= 9/(9 + 1) = 90%
18. Natural Language Processing
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
●Parts of speech
●Named entity recognition
●Sentiment analysis
●Summarisation
●Document search
19. Computer vision
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
●Face / object recognition
●Image forensics
●Camera poise
20. What is Big Data?
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
21. Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Big Data
●Velocity, value, variability…
Source: Doug Laney at Gartner
●N = all, messy, correlation not causation
Source: Big Data by Mayer-Schönbergerand Cukier
●Thirty other definitions…
Source: What is Big Data?
22. Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Source: Analyzing the Analyzers
23. Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Big Data –my view
●Lots of data, collected almost casually
●Cloud storage and processing
●Computing frameworks
●Data mining algorithms, visualisation
24. Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Hadoop Ecosystem
Source: ADASTRA
25. ●Everyone talks about it,
●nobody really knows how to do it,
●everyone thinks everyone else is doing it,
●so everyone claims they are doing it too.
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Big Data is like teenage sex…
26. What is your data?
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
27. Case Studies
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
28. Case Studies
●BIG Lottery
●MOT data
●NewsReader
●Inspiring Women
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
●London Underground
●Machine learning
●Face Recognition
●UN democracy
●Google Ngram
29. Google Ngram
●How does the popularity of scientists vary over time?
●Ngrams(frequency of word combinations)
●6 million digitized books
●Big Data in the wild
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Source: Google Ngram, the data
30. Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
William Thomson aka Lord Kelvin
31. Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
An assortment
32. Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Isaac Newton and Albert Einstein
33. Google Ngram-lessons
●Huge dataset drives the site (1TB for just two corpuses)
●No book metadata
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
34. BIG Lottery
●Where is lottery money allocated?
●BIG Lottery data
●ONS population data
●Sport data
●Fun with natural language processing
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: BIG Lottery
35. BIG Lottery
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: BIG Lottery
36. BIG Lottery
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: BIG Lottery
37. BIG Lottery
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: BIG Lottery
38. BIG Lottery
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: BIG Lottery
39. NLP demo
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
40. BIG Lottery
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: BIG Lottery
41. BIG Lottery
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: BIG Lottery
42. BIG Lottery -lessons
●Initial (geographic) analysis didn’t work as expected
●Revealed some organisational history
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: BIG Lottery
43. NewsReader World Cup Hack Day
●How can we explore and understand huge volumes of news articles?
●300,000 news articles on the World Cup
●Cutting edge NLP/semantic web
●Making an API
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: NewsReader
44. NewsReader World Cup Hack Day
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: NewsReader
45. NewsReader World Cup Hack Day
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: NewsReader
46. NewsReader World Cup Hack Day
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation: NewsReader
47. NewsReader Demo
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
https://newsreader.scraperwiki.com/
49. NewsReader -lessons
●Proper research project –this is actually hard stuff!
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
50. #InspiringWomenon twitter
●Who are the #InspiringWomen?
●40,000 Tweets from a hashtag
●Tableau for quick and easy visualisation
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: InspiringWomen
51. #InspiringWomenon Twitter
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: InspiringWomen
52. #InspiringWomenon Twitter
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Top 5 retweets
1.Emma Watson
2.Ada Lovelace
3.Delia Derbyshire
4.JK Rowling
5.HedyLamarr
Visualisation and blog: InspiringWomen
53. Twitter Demo
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
54. #InspiringWomen-lessons
●Spamming on twitter
●Dynamics of retweeting
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: InspiringWomen
55. MOT Data
●Are some makes of car prone to particular faults at MOT?
●Data on every single MOT test
●Handling a large dataset
●180,000,000 data points, 20GB
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
56. SQL Demo
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
57. MOT Data
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
58. MOT Data
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
59. MOT Data
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
60. MOT Data
●Easy to get lost in a huge dataset
●Even sharing the derived data can be difficult
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
61. London Underground Visualisation
●How do passenger numbers vary across the London Underground?
●Wikipedia
●Openstreetmap
●London Transport
●Melding/tidying data from disparate sources
●Fancy visualisation
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: London Underground
62. London Underground Visualisation
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: London Underground
63. London Underground –can I walk it?
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: London Underground –can I walk it?
64. London Underground -lessons
●Fixed a problem with Table Xtract
●Whizzy visualisations can have a big impact
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Visualisation and blog: London Underground
65. Twitter Machine Learning
●ScraperWikistwitter followers
●Which websites are businesses?
●1,000 website URLs
●Machine learning to classify websites
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Blog: Twitter Machine Learning
66. Twitter Machine Learning
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Blog: Twitter Machine Learning
67. Twitter Machine Learning –lift curve
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Blog: Twitter Machine Learning
68. Twitter Machine Learning -lessons
●Scikit-learn libraries make this easy
●Try out different algorithms –it’s cheap
●Getting the training set the key problem
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Blog: Twitter Machine Learning
69. Face recognition
●What are the demographics of twitter users?
●Followers of @ScraperWiki
●Face recognition, online services
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Blog: Face ReKognition
70. Face recognition
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Blog: Face ReKognition
71. Face recognition
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
72. Face recognition
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Blog: Face ReKognition
73. Face recognition
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Blog: Face ReKognition
74. Face Recognition -lessons
●Complex analysis as a service
●Works reasonably well
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Blog: Face ReKognition
75. UN Democracy
●Improving access to the UN verbatim proceedings
●Processing PDF
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
76. UN Democracy
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
77. UN Democracy
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
78. UN Democracy -lessons
●Extracting data from websites can be hard
●PDF is an important resource –general extraction mechanisms are hard
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
79. Digital Humanities on the web
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Source: @Sharon_Howard
80. Digital Humanities on the web (1)
●Digging into Data Challenge
●Nineteeth Century Scholarship Online
●Eighteenth Century Scholarship Online
●Medieval Electronic Scholarly Alliance
●The Great Parchment Book
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Source: @Sharon_Howard
81. Digital Humanities on the web (2)
●Quantifying Kissinger
●Swansea –City Witness
●Circulation of Knowledge and Learned Practices in the 17th-century Dutch Republic
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Source: @Sharon_Howard
82. Digital Humanities on the web (3)
●City and Region -Urban and Agricultural Rent in England, 1400-1914.
●The Proceedings of the Old Bailey 1674- 1913
●Mapping the republic of letters
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
Source: @Sharon_Howard
83. Bibliography
●Natural Language Processing with Pythonby Steven Bird, Ewan Klein & Edward Loper
●Mining the Social Webby Matthew A. Russell
●Data Mining –Practical Machine Learning Tools and Techniquesby Witten, Frank and Hall
●Data Science for Business Foster Provost and Tom Fawcett
●Big Data by Viktor Mayer-Schönbergerand Kenneth Cukier
●…more book reviews at ScraperWiki
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
84. Aims
●Explain “Data Science” and “Big Data”
●Show some tools
●Show some examples
●Take home
●New methodology
●Plan a project
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
85. Thank You! Please fill in survey
Data Science for Social Scientists
With Dr Ian Hopkinson from ScraperWiki
ian@scraperwiki.com