This document discusses building an API using Apache Spark and machine learning to predict happiness based on personal details. It outlines gathering survey data, analyzing it using Spark and MLlib, and creating an API to make predictions. Key points covered include formulating the problem as predicting happiness scores, gathering national health survey data, using Spark for in-memory processing and machine learning algorithms to find correlations and make predictions, and designing an API to interface with the trained model.
4. DATA SCIENCE
Process:
1. Formulate a question
2. Gather data
3. Model data
4. Create data product
Source: Drew Conway, The Data Science Venn Diagram, 2013
5. DATA SCIENCE METHOD
Source: Foundational Methodology for Data Science, IBM, 2015
1. Formulate a question
3. Analyze data
4. Product
2. Gather data
7. BUSINESS PROBLEM
Fortune Teller at the circus
Input:
• Glass ball
• Lines on hand
• Star sign
• Astrology
• Tarot cards
Output:
• Vague prediction about future
Product Owner:
“We should be able to do better than this!”
8.
9. HOW TO CALCULATE HAPPINESS
Input: (personal details)
• Country of residence
• Age
• Male / female
• Partner (yes / no)
• Number of children
• Level of education (yes / no)
Output: (the happiness score)
• Health
• Life expectancy
• Disease
• Wealth
• Poverty yes or no
• Income
• “Psychological well-being”
• Enjoyment
• Stress
• Anger
• Worry
• Sadness
11. DATA SOURCES
• Gallup-Healthways Well-Being Index
• The World Bank
• Google Scholar
• www.data.gov
• Global Health Data Exchange
• World Health Organization
• Simple Online Data Archive for Population Studies
(Sodapop)
• The World Factbook
• UCI Machine Learning Repository
12. WINNING DATASET
National Health Interview Survey 2012
• 43345 surveys
• 133 questions
• Well documented
• Free to download and use
13. HOW TO CALCULATE HAPPINESS
Input: (personal details)
• Country of residence
• Age
• Male / female
• Partner (yes / no)
• Number of children
• Level of education
Output: (the happiness score)
• Health
• Life expectancy
• Disease
• Wealth
• Poverty yes or no
• Income
• “Psychological well-being”
• Enjoyment
• Stress
• Anger
• Worry
• Sadness
15. • General purpose computing engine
• In-memory processing
• Support of streaming data, machine learning,
graphs
• (much) faster than Hadoop MapReduce
16. • Small player in the (OS) world of Machine
Learning: Python and R are leading, followed by
SAS, Weka, RapidMiner, …
• It’s just a tool… no solution or holy grail
• “I predict that mean cluster size will remain very
close to one until the end of humanity. The vast
majority of problems are small. Honestly, the
combined utility of PyData and Spark pales in
comparison to the utility of Excel.”
21. BIG DATA IS OUT, ML IS IN
Source: Gartner, Hype Cycle for Emerging Technologies, 2015
22. MACHINE LEARNING
• Actually, this is…
algorithms maximizing scores using a statistical
approach to problem solving
• Producing…
systems that can learn from and make decisions
and predictions based on data
The field of study that gives computers the ability
to learn without being explicitly programmed.
(Arthur Samuel, 1959)
23. MACHINE LEARNING TASKS
Recommendation Using Association Rules (Similarity Matching)
• Predict items that have a high similarity to others within a given set of items.
• Example: Predicting movies or books based on someone’s historic purchase behavior.
Classification
• Predict to which class/category a certain item belongs. These categories are predefined. A
classification task can be binary or multi-class.
• Example: Determining whether a message is spam or non-spam (binary); determining
characters from a handwriting sample (multi-class).
Regression
• Focus on predicting numeric values.
• Example: Predicting the number of ice cream cones to be sold on a certain day based on
weather data.
Clustering
• Divide items into groups, but unlike in classification tasks, these groups are not previously
defined.
• Example: Grouping customers based on certain properties to discover customer segments.
27. API DESIGN
• Start Spark server: GET http://fortuneteller/start
• Stop Spark server: GET http://fortuneteller/stop
• Add survey records: POST http://fortuneteller/survey
• Train model: GET http://fortuneteller/train
• Correlations: GET http://fortuneteller/correlations
• Predict Health: GET http://fortuneteller/prediction/health
• Predict Wealth: GET http://fortuneteller/prediction/wealth
Goal of this talk:
Give a short intro into data science and its tools, processes, way of working
Explain and show Apache Spark in combination with Play API framework
NOT to produces ground-breaking results…
About this talk:
Some theoretic background
Design of a solution based on patterns, architecture principles, etc
Demo and source code show-off
Who believes in astrology?
In this session, we'll create an app to predict the future based on your horoscope! Let's see if we can do better than your average circus artist... Join this session for a good portion of machine learning with MLlib, live coding, and most of all: fun!
Horoscopes are nonsense… right?
Let’s have a closer look at that statement from a data-driven perspective! In this session, we’ll create an API for predicting your happiness and well-being based on your horoscope, with machine learning technology. The end-goal is to predict the future based on the alignment of the stars!
http://www.datasciencecentral.com/profiles/blogs/a-simple-introduction-to-data-science
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
http://www.kdnuggets.com/2014/11/9-must-have-skills-data-scientist.html
Substantive Expertise = domain knowledge
Usually: hypothesis-driven.
Gather also includes cleaning, filtering (slicing&dicing), storing, … (biggest step!)
Model also includes analysis, interpret results.
Do some quick hacking with Spark shell
Structure it with a Zeppelin notebook
Build an application with an IDE (e.g. IntelliJ IDEA) and build tool (e.g. sbt)
Publish the app as jar file to a Spark cluster
Run the app
Point 5: is new since my education (1998-2004). Previously data science / ML / AI was only academic. Took lots of time, lots of processing power, lots of preparation and programming the algorithms. Now the work has shifted to using readily available platforms (R, Python, Weka, SAS) and gathering data. Also the end-product comes into focus and is mainstream: e.g. Google self-driving car, speech recognition, recommendation of books on Amazon, (at ING) balance forecasting.
1. Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems.
2. Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake.
3. Analysis – This is the part of the process where insight is to be extractedfrom the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data.
4. Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation.
http://www.ibmbigdatahub.com/blog/why-we-need-methodology-data-science
1. Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems.
2. Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake.
3. Analysis – This is the part of the process where insight is to be extractedfrom the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data.
4. Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation.
http://www.ibmbigdatahub.com/blog/why-we-need-methodology-data-science
Input = random
Output = vague
And by the way: it costs money!
Fortune Happiness ?
Movie: happiness = money, financial success… (movie ends Will Smith has a job)
Scientific:
health-wealth nexus
Model based on scientific papers
1. Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems.
2. Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake.
3. Analysis – This is the part of the process where insight is to be extractedfrom the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data.
4. Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation.
http://www.ibmbigdatahub.com/blog/why-we-need-methodology-data-science
First problem: select a dataset that is suitable. Tons of information, only not just the right one…
Google, dive deeper, send requests to researchers, …
Also:
How to calculate FUTURE happiness???
Regression to the mean
More happy than benchmark become less healthy / wealthy / happy
Good Average
Bad Average
OR Continuous Improvement / self-fulfilling prophecy
More happy than benchmark become even healthier / wealthier / happier
Good Great
Bad Worse
Data quality is a large problem
Also: data governance, metadata, …
1. Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems.
2. Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake.
3. Analysis – This is the part of the process where insight is to be extractedfrom the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data.
4. Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation.
http://www.ibmbigdatahub.com/blog/why-we-need-methodology-data-science
Why Spark?
Combination of:
Scala
Hadoop
AI / ML
Growing support and community, mature but still under heavy development
Spark is a fast and general purpose engine for large-scale data processing in memory. Spark supports streaming analytics, machine learning and graph computation. Spark's in-memory primitives provide performance up to 100 times faster than that of traditional open source big data frameworks (Hadoop MR).
Spark can be seen as a newer version of MapReduce. It uses memory (RAM) rather than the hard-drive. As memory is a thousand times faster than the hard drive, Spark can yield significant gains. Spark does not store intermediate results. Instead, it stores the path of how it got to the results, so how the results were computed. If anything has gone wrong during the process, the system can simply re-do the computation by following this path.
http://spark.apache.org/
https://github.com/apache/spark
Critical note
Work with any file system that is supported by Hadoop.
Current version: 1.4.1 (1.5 in beta)
API’s
Supported adapters
Alternatives for MLlib: R, Python, (leading), Mahout, RapidMiner, SAS, SPSS, IBM, Oracle, SAP, Microsoft, …
Subfield of AI: computer science and math
Machine Learning is often defined as programming by example, based on observations or data. If a computer has learned from experience (examples, data, observations) to perform a certain task better or make a better prediction, it can be said to have learned.
Arthur Samuel (1959): “[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.”
Tom Mitchell (1997): “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
Examples of questions: what is the market value of this car? Which of these people are family/friends? problems that cannot be solved by numerical means alone.
Strong ties to computational statistics and mathematical optimization prediction making
Output:
Classification fixed categories
(linear) regression real value
Clustering
Density estimation
Dimension reduction e.g. facial recognition
Types:
Supervised
“learning” is optimizing a mathmatical function (hypothesis)
Often many (millions?) of input variables / dimensions
Unsupervised
VERY BIG field, rapidly expanding and growing in popularity
It’s a broad field, you really have to be careful to pick tools and techniques (algorithms).
Supervised: classification, regression
Unsupervised: clustering
There are some guidelines and cheat sheets available for helping you choose an algorithm.
DEMO 2
Linear Regression
1. Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems.
2. Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake.
3. Analysis – This is the part of the process where insight is to be extractedfrom the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data.
4. Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation.
http://www.ibmbigdatahub.com/blog/why-we-need-methodology-data-science