Fortune Teller API - Doing Data Science with Apache Spark

THE FORTUNE
TELLER API
Bas Geerdink
Doing data science with Apache Spark

ABOUT ME
bas.geerdink@ing.nl
@bgeerdink
2

TODAY’S MISSION:
TO PREDICT THE FUTURE…
• Data Science
• Spark and MLlib
• API

DATA SCIENCE
Process:
1. Formulate a question
2. Gather data
3. Model data
4. Create data product
Source: Drew Conway, The Data Science Venn Diagram, 2013

DATA SCIENCE METHOD
Source: Foundational Methodology for Data Science, IBM, 2015
3. Analyze data
4. Product
2. Gather data

DATA SCIENCE METHOD

BUSINESS PROBLEM
Fortune Teller at the circus
Input:
• Glass ball
• Lines on hand
• Star sign
• Astrology
• Tarot cards
Output:
• Vague prediction about future
Product Owner:
“We should be able to do better than this!”

HOW TO CALCULATE HAPPINESS
Input: (personal details)
• Country of residence
• Age
• Male / female
• Partner (yes / no)
• Number of children
• Level of education (yes / no)
Output: (the happiness score)
• Health
• Life expectancy
• Disease
• Wealth
• Poverty yes or no
• Income
• “Psychological well-being”
• Enjoyment
• Stress
• Anger
• Worry
• Sadness

DATA SCIENCE METHOD
2. Gather data

DATA SOURCES
• Gallup-Healthways Well-Being Index
• The World Bank
• Google Scholar
• www.data.gov
• Global Health Data Exchange
• World Health Organization
• Simple Online Data Archive for Population Studies
(Sodapop)
• The World Factbook
• UCI Machine Learning Repository

WINNING DATASET
National Health Interview Survey 2012
• 43345 surveys
• 133 questions
• Well documented
• Free to download and use

HOW TO CALCULATE HAPPINESS
Input: (personal details)
• Country of residence
• Age
• Male / female
• Partner (yes / no)
• Number of children
• Level of education
Output: (the happiness score)
• Health
• Life expectancy
• Disease
• Wealth
• Poverty yes or no
• Income
• “Psychological well-being”
• Enjoyment
• Stress
• Anger
• Worry
• Sadness

DATA SCIENCE METHOD
3. Analyze data
2. Gather data

• General purpose computing engine
• In-memory processing
• Support of streaming data, machine learning,
graphs
• (much) faster than Hadoop MapReduce

• Small player in the (OS) world of Machine
Learning: Python and R are leading, followed by
SAS, Weka, RapidMiner, …
• It’s just a tool… no solution or holy grail
• “I predict that mean cluster size will remain very
close to one until the end of humanity. The vast
majority of problems are small. Honestly, the
combined utility of PyData and Spark pales in
comparison to the utility of Excel.”

SPARK OVERVIEW
Spark Core
Spark
SQL
Spark
Streaming
GraphXMLlib
Standalone YARN Mesos
Scala
Python
R
Java
File system
HDFS
HBase
Cassandra
…

SPARK CLUSTER MODE
• Standalone
• Mesos
• YARN

BIG DATA IS OUT, ML IS IN
Source: Gartner, Hype Cycle for Emerging Technologies, 2015

MACHINE LEARNING
• Actually, this is…
algorithms maximizing scores using a statistical
approach to problem solving
• Producing…
systems that can learn from and make decisions
and predictions based on data
The field of study that gives computers the ability
to learn without being explicitly programmed.
(Arthur Samuel, 1959)

MACHINE LEARNING TASKS
Recommendation Using Association Rules (Similarity Matching)
• Predict items that have a high similarity to others within a given set of items.
• Example: Predicting movies or books based on someone’s historic purchase behavior.
Classification
• Predict to which class/category a certain item belongs. These categories are predefined. A
classification task can be binary or multi-class.
• Example: Determining whether a message is spam or non-spam (binary); determining
characters from a handwriting sample (multi-class).
Regression
• Focus on predicting numeric values.
• Example: Predicting the number of ice cream cones to be sold on a certain day based on
weather data.
Clustering
• Divide items into groups, but unlike in classification tasks, these groups are not previously
defined.
• Example: Grouping customers based on certain properties to discover customer segments.

DATA SCIENCE METHOD
3. Analyze data
4. Product
2. Gather data

API DESIGN
• Start Spark server: GET http://fortuneteller/start
• Stop Spark server: GET http://fortuneteller/stop
• Add survey records: POST http://fortuneteller/survey
• Train model: GET http://fortuneteller/train
• Correlations: GET http://fortuneteller/correlations
• Predict Health: GET http://fortuneteller/prediction/health
• Predict Wealth: GET http://fortuneteller/prediction/wealth

Web app?
Deploy to cloud?
Streaming linear
regression?
Next steps…

https://github.com/geerdink/FortuneTellerApi

Fortune Teller API - Doing Data Science with Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fortune Teller API - Doing Data Science with Apache Spark

Similar to Fortune Teller API - Doing Data Science with Apache Spark (20)

Recently uploaded

Recently uploaded (20)

Fortune Teller API - Doing Data Science with Apache Spark

Editor's Notes