The business world is increasingly adopting the Moneyball principle of using data to predict and gain a competitive advantage in healthcare, telecommunications, retail, media, energy, and many other industries. Some argue that organizations that do not possess strong data and the skills to create value out of it will not survive. How can companies leverage data - sometimes described as the “new gold” - for consumer insights, improved processes or new product ideas? Can data assets be leveraged effectively for the overall business?
4. LutzFinger.com
Agenda
The right Ask
Data is King
Team-Work: Discover an Ask
Lunch
Decision Tree
Team-Work: Your Model
Pitfalls with Data
Break
Technology
Team-Work: Become Data Driven?
BI vs. Data Science
Build A Team
15. LutzFinger.com
Already Known Asks
byrgiesekingunderCreativeCommons(CCBY2.0)
Who should get an
E-Shot?
Territory Planning
for my Sales Force
Budget Planning of
Marketing Spent
Online Product
Recommendation
Real Time Betting
for Ad-spaces
Customer
Segmentation
Social Media
Influencers
Call Center
Routing based on
Questions
Capacity
Forecasting
…. and more
21. LutzFinger.com
A bad ‘so what’
300+ Million Member at LinkedIn
60.000 with a Job Title that might fit
19.000 who switched after 3 to 8 years
24 who had the same career path
27. LutzFinger.com
Agenda
The right Ask
Data is King
Team-Work: Discover an Ask
Lunch
Decision Tree
Team-Work: Your Model
Pitfalls with Data
Technology
Team-Work: Become Data Driven?
BI vs. Data Science
Build A Team
30. LutzFinger.com
V OF “BIG DATA”
Data at scale
(TB, PB … )
Data in many forms
(Structured,
unstructured ...)
Speed
(Streaming, real
time, near time ..)
Uncertainty
(imprecise, not always
up-to-date ..)
41. LutzFinger.com
Sometimes,
it’s worth it.
RE @dave_mcgregor: Publicly
pledging to never fly @delta again.
The worst airline ever. U have lost my
patronage forever du to ur
incompetence
Completely unimpressed with
@continental or @united. Poor
communication, goofy reservations
systems and all to turn my trip into a
mess.
@SouthWestAir I know you don't
make the weather. But at least pretend
I am not a bother when I ask if the
delay will make miss my connection
42. LutzFinger.com
But Data Is King
This will give birth to devices (i.e., the Star Trek Tricorder)
that allow you, the consumer, to self-diagnose, anytime,
anywhere.
43. LutzFinger.com
Agenda
The right Ask
Data is King
Team-Work: Discover an Ask
Lunch
Decision Tree
Team-Work: Your Model
Pitfalls with Data
Technology
Team-Work: Become Data Driven?
BI vs. Data Science
Build A Team
45. LutzFinger.com
The Media industry has changed! The retail industry has change! The
Education sector is changing! Finance Industry and healthcare sector are
under attack. Which industry will be next?
46. LutzFinger.com
Team Work
Photo by Creative Sustainability under the Creative Commons (CC BY 2.0)
A. The Ask
○ is it actionable? “So What?
○ is it Benchmarking / is it Recommendation
B. The Data
○ do only you have this data?
○ do you have a feedback loop?
48. LutzFinger.com
Agenda
The right Ask
Data is King
Team-Work: Discover an Ask
Lunch
Decision Tree
Team-Work: Your Model
Pitfalls with Data
Technology
Team-Work: Become Data Driven?
BI vs. Data Science
Build A Team
54. LutzFinger.com
What Is The Features To
Describe The Target?
• Weight: light, medium, heavy - or x gram
• Size: round or not
• Color:green, orange, red
• Surface: flat or porous surface
• …
55. LutzFinger.com
Which Feature Works
Best?
● The variable with the most important information
about target variable.
● Which variable can split the group as
homogeneous with respect to the target variable.
(pure vs. inpure)
61. LutzFinger.com
1st. Entropy Without Split
entropy =
-p1 * log (p1) - p2 * log (p2)
Apple: 8 out of 15
p(apple)= 8/15
Mandarines: 7 out of 15
p(mandarine)= 7/15
ENTROPY (Without Split):
-p(apple)*log(p(apple)) -p(mandarins)*log(p
(mandarines))
= 0.996791632 = 1
very impure
62. LutzFinger.com
Color Red?
Color Orange?
entropy =
-p1 * log (p1) - p2 * log (p2)
ENTROPY (After Split on Red):
= 8/15* ENTROPY (Split on Red=’no’)
+ 7/15* ENTROPY (Split on Red=’yes’)
= 0.43 + 0.28 = 0.71
INFORMATION GAIN
= Entropy (Before) - Entropy (After) = 1 - 0.71 = 0.29
ENTROPY (Split on Red=’
no’):
= -6/8*(log2
(6/8))-2/8*(log2
(2/8))
= 0.81
ENTROPY (Split on Red=’yes’):
= -6/7*(log2
(6/7)) -1/7*(log2
(1/7))
= 0.59
ENTROPY (Split on
Orange=’yes’):
= -6/6*(log2
(6/6))
= 0
ENTROPY (Split on Orange=’no’):
= -8/9*(log2
(8/9))-1/9*(log2
(1/9))
= 0.50
ENTROPY (After Split on Orange):
= 6/15* ENTROPY (Split on Orange=’no’)
+ 9/15* ENTROPY (Split on Orange=’yes’)
= 0 + 0.23 = 0.23
INFORMATION GAIN
= Entropy (Before) - Entropy (After) = 1 - 0.23 = 0.77
63. LutzFinger.com
INFORMATION GAIN (IG)
Information gain measures how much a
given feature improves (decreases) entropy
over the whole segmentation it creates.
How important is this feature for the
prediction?
70. LutzFinger.com
Agenda
The right Ask
Data is King
Team-Work: Discover an Ask
Lunch
Decision Tree
Team-Work: Your Model
Pitfalls with Data
Technology
Team-Work: Become Data Driven?
BI vs. Data Science
Build A Team
84. LutzFinger.com
Agenda
The right Ask
Data is King
Team-Work: Discover an Ask
Lunch
Decision Tree
Team-Work: Your Model
Pitfalls with Data
Technology
Team-Work: Become Data Driven?
BI vs. Data Science
Build A Team
87. LutzFinger.com
Social Network Info
Could Social Network improve the quality of our
prediction?
Who is more credit-worthy?
a. Tim whose friends are all very credit worthy
b. Tom whose friends are not creditworthy
90. LutzFinger.com
In the EU insurers will no
longer be allowed to take the
gender of their customers into
account for insurance
premiums:
● young men's premiums
will fall by up to 10%
● young women's premiums
will rise by up to 30%
by: BBC News: http://www.bbc.com/news/business-12608777
Not Everything That Is Possible Is
Legal
93. LutzFinger.com
Overfitting
To tailor a model to training data at the expense of
being generalizable for previously unseen data
points. The model becomes perfect in describing
noise and spurious correlations.
TRADE OFF
Complexity of a Model & Overfitting Likelihood
99. LutzFinger.com
Agenda
The right Ask
Data is King
Team-Work: Discover an Ask
Lunch
Decision Tree
Team-Work: Your Model
Pitfalls with Data
Technology
Team-Work: Become Data Driven?
BI vs. Data Science
Build A Team
102. LutzFinger.com
Issue Of Yahoo
CENTRALIZED SYSTEMS ARE EXPENSIVE
• diminishing returns in power (overhead issue)
• exponential cost to scale.
• slow to transport (ETL) the data
Scan 1000 TB Datasets on a 1000 node cluster:
• Remote Storage @ 10 MB/s = 165 min
• Local Storage @ 200 MB's = 8 min
MAKE SYSTEMS FAULT TOLERANT
1000 nodes - a machine a day will break
103. LutzFinger.com
The Vision
CHEAP Systems
• can run on commodity hardware
Computation are done DECENTRAL
• ability to ‘dispatch’ a task
• parallelize work-streams
Fault TOLERANT
no matter where and when break is not an issue
106. LutzFinger.com
Via The Normal Languages
Hadoop Storage (HDFS /
HBase / Solr)
Map Reduce
MapReduce
Hive
Pig/Casscading
Giraph
Mahout
SQL Like
Scripting Like
Graph Oriented
ML Engine
107. LutzFinger.com
Pro & Con
Hadoop Storage (HDFS /
HBase / Solr)
Map Reduce
MapReduce
Hive
Pig/Casscading
Giraph
Mahout
SQL Like
Scripting Like
Graph Oriented
ML Engine
Store
ETL:
Extract /
Transform /
Load
DB / Key Value Store
Visualize
Pro:
way better than traditional BI
Con:
Heavy tech involvement. 12-18
month for non-tech company to
implement a schema
108. LutzFinger.com
Pro & Con
Hadoop Storage (HDFS /
HBase / Solr)
Map Reduce
MapReduce
Hive
Pig/Casscading
Giraph
Mahout
SQL Like
Scripting Like
Graph Oriented
ML Engine
DB / Key Value Store
Visualize
New Approaches:
● Spark
● Tez
● Flink
109. LutzFinger.com
Agenda
The right Ask
Data is King
Team-Work: Discover an Ask
Lunch
Decision Tree
Team-Work: Your Model
Pitfalls with Data
Technology
Team-Work: Become Data Driven?
BI vs. Data Science
Build A Team
111. LutzFinger.com
Ingredients of Data Products
The question?
Ask
The need?
The Why? Measure
The Data?
The features?
Team
All of them are necessary - None of them is sufficient!
The algorithms?
The right Skills?
Collaboration
111
112. LutzFinger.com
How To Ingest Ideas
Hack - Days & Incubator
Internal Process
External Competition
Close Collaboration between
Business & Data Scientists“All we do is Data” - Jeff Weiner
112
114. LutzFinger.com
Agenda
The right Ask
Data is King
Team-Work: Discover an Ask
Lunch
Decision Tree
Team-Work: Your Model
Pitfalls with Data
Technology
Team-Work: Become Data Driven?
BI vs. Data Science
Build A Team
115. LutzFinger.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types Structured Structured & unstructured
Schema Stable schema Schema on the fly
When and How is the ASK
formulated?
Set ask Ad-hoc ask
121. LutzFinger.com
You Learned
image by Mike under Creative Commons
• The Ask is the most Important part -
you need Domain Knowledge
• Data Science is NO Rocket Science
• Data is King & There is Monopoly
Game happening
• Data Can be misleading
• Data is a Team Sport