SlideShare a Scribd company logo
1 of 31
THE FORTUNE
TELLER API
Bas Geerdink
Doing data science with Apache Spark
ABOUT ME
bas.geerdink@ing.nl
@bgeerdink
2
TODAY’S MISSION:
TO PREDICT THE FUTURE…
• Data Science
• Spark and MLlib
• API
DATA SCIENCE
Process:
1. Formulate a question
2. Gather data
3. Model data
4. Create data product
Source: Drew Conway, The Data Science Venn Diagram, 2013
DATA SCIENCE METHOD
Source: Foundational Methodology for Data Science, IBM, 2015
1. Formulate a question
3. Analyze data
4. Product
2. Gather data
DATA SCIENCE METHOD
1. Formulate a question
BUSINESS PROBLEM
Fortune Teller at the circus
Input:
• Glass ball
• Lines on hand
• Star sign
• Astrology
• Tarot cards
Output:
• Vague prediction about future
Product Owner:
“We should be able to do better than this!”
HOW TO CALCULATE HAPPINESS
Input: (personal details)
• Country of residence
• Age
• Male / female
• Partner (yes / no)
• Number of children
• Level of education (yes / no)
Output: (the happiness score)
• Health
• Life expectancy
• Disease
• Wealth
• Poverty yes or no
• Income
• “Psychological well-being”
• Enjoyment
• Stress
• Anger
• Worry
• Sadness
DATA SCIENCE METHOD
1. Formulate a question
2. Gather data
DATA SOURCES
• Gallup-Healthways Well-Being Index
• The World Bank
• Google Scholar
• www.data.gov
• Global Health Data Exchange
• World Health Organization
• Simple Online Data Archive for Population Studies
(Sodapop)
• The World Factbook
• UCI Machine Learning Repository
WINNING DATASET
National Health Interview Survey 2012
• 43345 surveys
• 133 questions
• Well documented
• Free to download and use
HOW TO CALCULATE HAPPINESS
Input: (personal details)
• Country of residence
• Age
• Male / female
• Partner (yes / no)
• Number of children
• Level of education
Output: (the happiness score)
• Health
• Life expectancy
• Disease
• Wealth
• Poverty yes or no
• Income
• “Psychological well-being”
• Enjoyment
• Stress
• Anger
• Worry
• Sadness
DATA SCIENCE METHOD
1. Formulate a question
3. Analyze data
2. Gather data
• General purpose computing engine
• In-memory processing
• Support of streaming data, machine learning,
graphs
• (much) faster than Hadoop MapReduce
• Small player in the (OS) world of Machine
Learning: Python and R are leading, followed by
SAS, Weka, RapidMiner, …
• It’s just a tool… no solution or holy grail
• “I predict that mean cluster size will remain very
close to one until the end of humanity. The vast
majority of problems are small. Honestly, the
combined utility of PyData and Spark pales in
comparison to the utility of Excel.”
SPARK OVERVIEW
Spark Core
Spark
SQL
Spark
Streaming
GraphXMLlib
Standalone YARN Mesos
Scala
Python
R
Java
File system
HDFS
HBase
Cassandra
…
SPARK CLUSTER MODE
• Standalone
• Mesos
• YARN
DEMO
CORRELATION <> CAUSATION
BIG DATA IS OUT, ML IS IN
Source: Gartner, Hype Cycle for Emerging Technologies, 2015
MACHINE LEARNING
• Actually, this is…
algorithms maximizing scores using a statistical
approach to problem solving
• Producing…
systems that can learn from and make decisions
and predictions based on data
The field of study that gives computers the ability
to learn without being explicitly programmed.
(Arthur Samuel, 1959)
MACHINE LEARNING TASKS
Recommendation Using Association Rules (Similarity Matching)
• Predict items that have a high similarity to others within a given set of items.
• Example: Predicting movies or books based on someone’s historic purchase behavior.
Classification
• Predict to which class/category a certain item belongs. These categories are predefined. A
classification task can be binary or multi-class.
• Example: Determining whether a message is spam or non-spam (binary); determining
characters from a handwriting sample (multi-class).
Regression
• Focus on predicting numeric values.
• Example: Predicting the number of ice cream cones to be sold on a certain day based on
weather data.
Clustering
• Divide items into groups, but unlike in classification tasks, these groups are not previously
defined.
• Example: Grouping customers based on certain properties to discover customer segments.
PICK AN ALGORITHM…
DEMO
DATA SCIENCE METHOD
1. Formulate a question
3. Analyze data
4. Product
2. Gather data
API DESIGN
• Start Spark server: GET http://fortuneteller/start
• Stop Spark server: GET http://fortuneteller/stop
• Add survey records: POST http://fortuneteller/survey
• Train model: GET http://fortuneteller/train
• Correlations: GET http://fortuneteller/correlations
• Predict Health: GET http://fortuneteller/prediction/health
• Predict Wealth: GET http://fortuneteller/prediction/wealth
DEMO
Web app?
Deploy to cloud?
Streaming linear
regression?
Next steps…
Questions?
https://github.com/geerdink/FortuneTellerApi

More Related Content

What's hot

Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Gabriel Moreira
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data CircleDataiku
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)heba_ahmad
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Ilkay Altintas: Kepler
Ilkay Altintas: KeplerIlkay Altintas: Kepler
Ilkay Altintas: KeplerDavid LeBauer
 
2015 data-science-salary-survey
2015 data-science-salary-survey2015 data-science-salary-survey
2015 data-science-salary-surveyAdam Rabinovitch
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data ScienceRoger Huang
 
Begin with Data Scientist
Begin with Data ScientistBegin with Data Scientist
Begin with Data ScientistNarong Intiruk
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsIlkay Altintas, Ph.D.
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data ScienceDhiana Deva
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science Mahesh Kumar CV
 

What's hot (20)

Agile data science
Agile data scienceAgile data science
Agile data science
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Ilkay Altintas: Kepler
Ilkay Altintas: KeplerIlkay Altintas: Kepler
Ilkay Altintas: Kepler
 
2015 data-science-salary-survey
2015 data-science-salary-survey2015 data-science-salary-survey
2015 data-science-salary-survey
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data Science
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Begin with Data Scientist
Begin with Data ScientistBegin with Data Scientist
Begin with Data Scientist
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable Workflows
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 

Similar to Fortune Teller API - Doing Data Science with Apache Spark

Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data ScienceKrishna Sankar
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk KnowledgeKrishna Sankar
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchRachel Berryman
 
big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptxNATASHABANO
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Fabricio Quintanilla
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSara-Jayne Terp
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Data visualisations as a gateway to programming
Data visualisations as a gateway to programmingData visualisations as a gateway to programming
Data visualisations as a gateway to programmingMia
 
Data, AI and Tokens: A Glimpse of What is to Come
Data, AI and Tokens: A Glimpse of What is to ComeData, AI and Tokens: A Glimpse of What is to Come
Data, AI and Tokens: A Glimpse of What is to ComeClaire Ingram Bogusz
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...jybufgofasfbkpoovh
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
How to Build Data Science Teams that Deliver Business Value
How to Build Data Science Teams that Deliver Business ValueHow to Build Data Science Teams that Deliver Business Value
How to Build Data Science Teams that Deliver Business ValueGanes Kesari
 

Similar to Fortune Teller API - Doing Data Science with Apache Spark (20)

Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptx
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Data visualisations as a gateway to programming
Data visualisations as a gateway to programmingData visualisations as a gateway to programming
Data visualisations as a gateway to programming
 
Data, AI and Tokens: A Glimpse of What is to Come
Data, AI and Tokens: A Glimpse of What is to ComeData, AI and Tokens: A Glimpse of What is to Come
Data, AI and Tokens: A Glimpse of What is to Come
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
DataScience_introduction.pdf
DataScience_introduction.pdfDataScience_introduction.pdf
DataScience_introduction.pdf
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
 
Data Science in Digital Marketing - Forest Cassidy, LeadFerret
Data Science in Digital Marketing - Forest Cassidy, LeadFerretData Science in Digital Marketing - Forest Cassidy, LeadFerret
Data Science in Digital Marketing - Forest Cassidy, LeadFerret
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
How to Build Data Science Teams that Deliver Business Value
How to Build Data Science Teams that Deliver Business ValueHow to Build Data Science Teams that Deliver Business Value
How to Build Data Science Teams that Deliver Business Value
 

Recently uploaded

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Recently uploaded (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Fortune Teller API - Doing Data Science with Apache Spark

  • 1. THE FORTUNE TELLER API Bas Geerdink Doing data science with Apache Spark
  • 3. TODAY’S MISSION: TO PREDICT THE FUTURE… • Data Science • Spark and MLlib • API
  • 4. DATA SCIENCE Process: 1. Formulate a question 2. Gather data 3. Model data 4. Create data product Source: Drew Conway, The Data Science Venn Diagram, 2013
  • 5. DATA SCIENCE METHOD Source: Foundational Methodology for Data Science, IBM, 2015 1. Formulate a question 3. Analyze data 4. Product 2. Gather data
  • 6. DATA SCIENCE METHOD 1. Formulate a question
  • 7. BUSINESS PROBLEM Fortune Teller at the circus Input: • Glass ball • Lines on hand • Star sign • Astrology • Tarot cards Output: • Vague prediction about future Product Owner: “We should be able to do better than this!”
  • 8.
  • 9. HOW TO CALCULATE HAPPINESS Input: (personal details) • Country of residence • Age • Male / female • Partner (yes / no) • Number of children • Level of education (yes / no) Output: (the happiness score) • Health • Life expectancy • Disease • Wealth • Poverty yes or no • Income • “Psychological well-being” • Enjoyment • Stress • Anger • Worry • Sadness
  • 10. DATA SCIENCE METHOD 1. Formulate a question 2. Gather data
  • 11. DATA SOURCES • Gallup-Healthways Well-Being Index • The World Bank • Google Scholar • www.data.gov • Global Health Data Exchange • World Health Organization • Simple Online Data Archive for Population Studies (Sodapop) • The World Factbook • UCI Machine Learning Repository
  • 12. WINNING DATASET National Health Interview Survey 2012 • 43345 surveys • 133 questions • Well documented • Free to download and use
  • 13. HOW TO CALCULATE HAPPINESS Input: (personal details) • Country of residence • Age • Male / female • Partner (yes / no) • Number of children • Level of education Output: (the happiness score) • Health • Life expectancy • Disease • Wealth • Poverty yes or no • Income • “Psychological well-being” • Enjoyment • Stress • Anger • Worry • Sadness
  • 14. DATA SCIENCE METHOD 1. Formulate a question 3. Analyze data 2. Gather data
  • 15. • General purpose computing engine • In-memory processing • Support of streaming data, machine learning, graphs • (much) faster than Hadoop MapReduce
  • 16. • Small player in the (OS) world of Machine Learning: Python and R are leading, followed by SAS, Weka, RapidMiner, … • It’s just a tool… no solution or holy grail • “I predict that mean cluster size will remain very close to one until the end of humanity. The vast majority of problems are small. Honestly, the combined utility of PyData and Spark pales in comparison to the utility of Excel.”
  • 17. SPARK OVERVIEW Spark Core Spark SQL Spark Streaming GraphXMLlib Standalone YARN Mesos Scala Python R Java File system HDFS HBase Cassandra …
  • 18. SPARK CLUSTER MODE • Standalone • Mesos • YARN
  • 19. DEMO
  • 21. BIG DATA IS OUT, ML IS IN Source: Gartner, Hype Cycle for Emerging Technologies, 2015
  • 22. MACHINE LEARNING • Actually, this is… algorithms maximizing scores using a statistical approach to problem solving • Producing… systems that can learn from and make decisions and predictions based on data The field of study that gives computers the ability to learn without being explicitly programmed. (Arthur Samuel, 1959)
  • 23. MACHINE LEARNING TASKS Recommendation Using Association Rules (Similarity Matching) • Predict items that have a high similarity to others within a given set of items. • Example: Predicting movies or books based on someone’s historic purchase behavior. Classification • Predict to which class/category a certain item belongs. These categories are predefined. A classification task can be binary or multi-class. • Example: Determining whether a message is spam or non-spam (binary); determining characters from a handwriting sample (multi-class). Regression • Focus on predicting numeric values. • Example: Predicting the number of ice cream cones to be sold on a certain day based on weather data. Clustering • Divide items into groups, but unlike in classification tasks, these groups are not previously defined. • Example: Grouping customers based on certain properties to discover customer segments.
  • 25. DEMO
  • 26. DATA SCIENCE METHOD 1. Formulate a question 3. Analyze data 4. Product 2. Gather data
  • 27. API DESIGN • Start Spark server: GET http://fortuneteller/start • Stop Spark server: GET http://fortuneteller/stop • Add survey records: POST http://fortuneteller/survey • Train model: GET http://fortuneteller/train • Correlations: GET http://fortuneteller/correlations • Predict Health: GET http://fortuneteller/prediction/health • Predict Wealth: GET http://fortuneteller/prediction/wealth
  • 28. DEMO
  • 29. Web app? Deploy to cloud? Streaming linear regression? Next steps…

Editor's Notes

  1. Goal of this talk: Give a short intro into data science and its tools, processes, way of working Explain and show Apache Spark in combination with Play API framework NOT to produces ground-breaking results… About this talk: Some theoretic background Design of a solution based on patterns, architecture principles, etc Demo and source code show-off Who believes in astrology? In this session, we'll create an app to predict the future based on your horoscope! Let's see if we can do better than your average circus artist... Join this session for a good portion of machine learning with MLlib, live coding, and most of all: fun! Horoscopes are nonsense… right? Let’s have a closer look at that statement from a data-driven perspective! In this session, we’ll create an API for predicting your happiness and well-being based on your horoscope, with machine learning technology. The end-goal is to predict the future based on the alignment of the stars!
  2. http://www.datasciencecentral.com/profiles/blogs/a-simple-introduction-to-data-science http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram http://www.kdnuggets.com/2014/11/9-must-have-skills-data-scientist.html Substantive Expertise = domain knowledge Usually: hypothesis-driven. Gather also includes cleaning, filtering (slicing&dicing), storing, … (biggest step!) Model also includes analysis, interpret results. Do some quick hacking with Spark shell Structure it with a Zeppelin notebook Build an application with an IDE (e.g. IntelliJ IDEA) and build tool (e.g. sbt) Publish the app as jar file to a Spark cluster Run the app Point 5: is new since my education (1998-2004). Previously data science / ML / AI was only academic. Took lots of time, lots of processing power, lots of preparation and programming the algorithms. Now the work has shifted to using readily available platforms (R, Python, Weka, SAS) and gathering data. Also the end-product comes into focus and is mainstream: e.g. Google self-driving car, speech recognition, recommendation of books on Amazon, (at ING) balance forecasting.
  3. 1. Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems. 2. Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake. 3. Analysis – This is the part of the process where insight is to be extracted from the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data. 4. Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation. http://www.ibmbigdatahub.com/blog/why-we-need-methodology-data-science
  4. 1. Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems. 2. Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake. 3. Analysis – This is the part of the process where insight is to be extracted from the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data. 4. Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation. http://www.ibmbigdatahub.com/blog/why-we-need-methodology-data-science
  5. Input = random Output = vague And by the way: it costs money!
  6. Fortune  Happiness ? Movie: happiness = money, financial success… (movie ends Will Smith has a job) Scientific: health-wealth nexus
  7. Model based on scientific papers
  8. 1. Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems. 2. Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake. 3. Analysis – This is the part of the process where insight is to be extracted from the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data. 4. Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation. http://www.ibmbigdatahub.com/blog/why-we-need-methodology-data-science
  9. First problem: select a dataset that is suitable. Tons of information, only not just the right one… Google, dive deeper, send requests to researchers, …
  10. Also: How to calculate FUTURE happiness??? Regression to the mean More happy than benchmark  become less healthy / wealthy / happy Good  Average Bad  Average OR Continuous Improvement / self-fulfilling prophecy More happy than benchmark  become even healthier / wealthier / happier Good  Great Bad  Worse
  11. Data quality is a large problem Also: data governance, metadata, …
  12. 1. Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems. 2. Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake. 3. Analysis – This is the part of the process where insight is to be extracted from the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data. 4. Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation. http://www.ibmbigdatahub.com/blog/why-we-need-methodology-data-science
  13. Why Spark? Combination of: Scala Hadoop AI / ML Growing support and community, mature but still under heavy development Spark is a fast and general purpose engine for large-scale data processing in memory. Spark supports streaming analytics, machine learning and graph computation. Spark's in-memory primitives provide performance up to 100 times faster than that of traditional open source big data frameworks (Hadoop MR). Spark can be seen as a newer version of MapReduce. It uses memory (RAM) rather than the hard-drive. As memory is a thousand times faster than the hard drive, Spark can yield significant gains. Spark does not store intermediate results. Instead, it stores the path of how it got to the results, so how the results were computed. If anything has gone wrong during the process, the system can simply re-do the computation by following this path. http://spark.apache.org/ https://github.com/apache/spark
  14. Critical note
  15. Work with any file system that is supported by Hadoop. Current version: 1.4.1 (1.5 in beta) API’s Supported adapters Alternatives for MLlib: R, Python, (leading), Mahout, RapidMiner, SAS, SPSS, IBM, Oracle, SAP, Microsoft, …
  16. DEMO 1 Spark Zeppelin
  17. Freakonomics
  18. Hidden / backup
  19. http://www.kdnuggets.com/2015/08/gartner-2015-hype-cycle-big-data-is-out-machine-learning-is-in.html
  20. Subfield of AI: computer science and math Machine Learning is often defined as programming by example, based on observations or data. If a computer has learned from experience (examples, data, observations) to perform a certain task better or make a better prediction, it can be said to have learned. Arthur Samuel (1959): “[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.” Tom Mitchell (1997): “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” Examples of questions: what is the market value of this car? Which of these people are family/friends?  problems that cannot be solved by numerical means alone. Strong ties to computational statistics and mathematical optimization  prediction making Output: Classification  fixed categories (linear) regression  real value Clustering Density estimation Dimension reduction  e.g. facial recognition Types: Supervised “learning” is optimizing a mathmatical function (hypothesis) Often many (millions?) of input variables / dimensions Unsupervised VERY BIG field, rapidly expanding and growing in popularity
  21. It’s a broad field, you really have to be careful to pick tools and techniques (algorithms). Supervised: classification, regression Unsupervised: clustering
  22. org.apache.spark.mllib.classification/clustering/evaluation/linalg/optimization/recommendation/regression/stat/tree
  23. There are some guidelines and cheat sheets available for helping you choose an algorithm.
  24. DEMO 2 Linear Regression
  25. 1. Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems. 2. Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake. 3. Analysis – This is the part of the process where insight is to be extracted from the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data. 4. Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation. http://www.ibmbigdatahub.com/blog/why-we-need-methodology-data-science
  26. Virtual server: 4GB RAM, 4 CPUs, 5GB in size.
  27. https://www.playframework.com/documentation/2.1.1/ScalaJson
  28. DEMO 3 IntelliJ + code