4. LutzFinger.com
Agenda: 9:00 - 17:00
9:00 The right Ask
9:45 Teamwork: Discover an Ask
10:30 Coffee Break
10:45 Data is King
11:15 Decision Tree
13:00 Lunch
14:00 Pitfalls with Data
14:30 Teamwork: Which Data?
15:30 Coffee Break
15:45 Innovation & Technology
16:30 Build A Team
16:45 Privacy & Ethics
18. LutzFinger.com
Data Without Action
300+ Million Member at LinkedIn
60.000 with a Job Title that might fit
19.000 who switched after 3 to 8 years
24 who had the same career path
20. LutzFinger.com
How To Work With Data?
Past Future
What
happened?
What is
happening?
What is
likely to happen?
Reporting,
Dashboards
Real-Time
Analytics
Predictive
Analytics
Forensics & Data
Mining
Real-Time Data
Mining
Prescriptive
Analytics
Why did it
happen?
Why is
it happening?
What should I do
about it?
Ref. Gartner
28. LutzFinger.com
Team Work
Photo by Creative Sustainability under the Creative Commons (CC BY 2.0)
What Would You Like To Do With Data?
○ Is it Actionable? “So What?”
○ Is it Reporting or Predictions?
○ Is it Sustaining, Adjunct or Disruptive?
Please Stay REAL!
29. LutzFinger.com
Agenda: 9:00 - 17:00
9:00 The right Ask
9:45 Teamwork: Discover an Ask
10:30 Coffee Break
10:45 Data is King
11:15 Decision Tree
13:00 Lunch
14:00 Pitfalls with Data
14:30 Teamwork: Which Data?
15:30 Coffee Break
15:45 Innovation
16:15 Technology
16:45 Build A Team
32. LutzFinger.com
V OF “BIG DATA”
Data at scale
(TB, PB … )
Data in many forms
(Structured,
unstructured ...)
Speed
(Streaming, real
time, near time ..)
Uncertainty
(Imprecise, not
always up-to-date ..)
33. LutzFinger.com
DATA
Categorical
• Ordinal: Monday, Tuesday, Wednesday
• Nominal: Man, Woman
Quantitative:
• Ratio: Kelvin, Height, Weight
• Interval: Celsius, Fahrenheit
Structure:
• Structured
• Unstructured
• Semi-structured / Meta data
Read more: “On the Theory of Scales of Measurement”
S.Stevens 1946
35. LutzFinger.com
The Media Industry Is One Step
Removed From The Customer
Photo by Norimutsu Nogami under the Creative Commons (CC BY 2.0)
They Do Not Know
Who Reads What &
When?
50. LutzFinger.com
Sometimes,
it’s worth it.
Source: Jeffrey Breen
RE @dave_mcgregor: Publicly
pledging to never fly @delta again.
The worst airline ever. U have lost my
patronage forever du to ur
incompetence
Completely unimpressed with
@continental or @united. Poor
communication, goofy reservations
systems and all to turn my trip into a
mess.
@SouthWestAir I know you don't
make the weather. But at least pretend
I am not a bother when I ask if the
delay will make miss my connection
51. LutzFinger.com
Agenda: 9:00 - 17:00
9:00 The right Ask
9:45 Teamwork: Discover an Ask
10:30 Coffee Break
10:45 Data is King
11:15 Decision Tree
13:00 Lunch
14:00 Pitfalls with Data
14:30 Teamwork: Which Data?
15:30 Coffee Break
15:45 Innovation & Technology
16:30 Build A Team
16:45 Privacy & Ethics
57. LutzFinger.com
What Are The Features
That Describe The Target?
• Weight: light, medium, heavy - or x gram
• Size: round or not
• Color: green, orange, red
• Surface: flat or porous surface
• …
58. LutzFinger.com
Which Feature Works
Best?
● The variable with the most important information
about the target variable.
● Which variable can split the group as
homogeneous with respect to the target variable?
(pure vs. impure)
64. LutzFinger.com
1st. Entropy Without Split
entropy =
-p1 * log (p1) - p2 * log (p2)
Apple: 8 out of 15
p(apple)= 8/15
Mandarines: 7 out of 15
p(mandarine)= 7/15
ENTROPY (Without Split):
-p(apple)*log(p(apple))
-p(mandarins)*log(p(mandarines))
= 0.996791632 = 1
very impure
65. LutzFinger.com
Color Red?
Color Orange?
entropy =
-p1 * log (p1) - p2 * log (p2)
ENTROPY (After Split on Red):
= 8/15* ENTROPY (Split on Red=’no’)
+ 7/15* ENTROPY (Split on Red=’yes’)
= 0.43 + 0.28 = 0.71
INFORMATION GAIN
= Entropy (Before) - Entropy (After) = 1 - 0.71 = 0.29
ENTROPY (Split on
Red=’no’):
=
-6/8*(log2
(6/8))-2/8*(log2
(2/
8))
= 0.81
ENTROPY (Split on Red=’yes’):
= -6/7*(log2
(6/7)) -1/7*(log2
(1/7))
= 0.59
ENTROPY (Split on
Orange=’yes’):
= -6/6*(log2
(6/6))
= 0
ENTROPY (Split on Orange=’no’):
= -8/9*(log2
(8/9))-1/9*(log2
(1/9))
= 0.50
ENTROPY (After Split on Orange):
= 6/15* ENTROPY (Split on Orange=’no’)
+ 9/15* ENTROPY (Split on Orange=’yes’)
= 0 + 0.23 = 0.23
INFORMATION GAIN
= Entropy (Before) - Entropy (After) = 1 - 0.23 = 0.77
66. LutzFinger.com
INFORMATION GAIN (IG)
Information Gain measures how much a
given feature improves (decreases) entropy
over the whole segmentation it creates.
How important is this feature for the
prediction?
86. LutzFinger.com
TRUE NEGATIVE
Specificity
# of true negative / truth
also: Specificity = 1 - False positive
rate
Bought Did Not Buy
Bought true positive false positive
Did Not Buy false
negative
true negative
Classifier
Truth
87. LutzFinger.com
PRECISION
# of true positives / Total in this
prediction class
Bought Did Not Buy
Bought true positive false positive
Did Not Buy false
negative
true negative
Classifier
Truth
92. LutzFinger.com
Agenda: 9:00 - 17:00
9:00 The right Ask
9:45 Teamwork: Discover an Ask
10:30 Coffee Break
10:45 Data is King
11:15 Decision Tree
13:00 Lunch
14:00 Pitfalls with Data
14:30 Teamwork: Which Data?
15:30 Coffee Break
15:45 Innovation & Technology
16:30 Build A Team
16:45 Privacy & Ethics
94. LutzFinger.com
Overfitting
To tailor a model to training data at the expense of
being generalizable for previously unseen data
points. The model becomes perfect in describing
noise and spurious correlations.
TRADE OFF
Complexity of a Model & Overfitting Likelihood
96. LutzFinger.com
The Story of MORE Data
Decision Trees are good in identifying LOCAL
patterns, but they often need more data.
by Claudia Perlich et. al., “Tree Induction vs. Logistic Regression: A Learning-Curve Analysis”,
Journal of Machine Learning Research 4 (2003) 211-255
98. LutzFinger.com
Team Work
Photo by Creative Sustainability under the Creative Commons (CC BY 2.0)
○ do only you have this data?
○ do you have a positive feedback loop?
○ is the data sustainable?
○ who else could get the data?
○ how much data is needed?
99. LutzFinger.com
Agenda: 9:00 - 17:00
9:00 The right Ask
9:45 Teamwork: Discover an Ask
10:30 Coffee Break
10:45 Data is King
11:15 Decision Tree
13:00 Lunch
14:00 Pitfalls with Data
14:30 Teamwork: Which Data?
15:30 Coffee Break
15:45 Innovation & Technology
16:30 Build A Team
16:45 Privacy & Ethics
101. LutzFinger.com
Issue Of Yahoo
CENTRALIZED SYSTEMS ARE EXPENSIVE
• diminishing returns in power (overhead issue)
• exponential cost to scale
• slow to transport (ETL) the data
Scan 1000 TB Datasets on a 1000 node cluster:
• Remote Storage @ 10 MB’s = 165 min
• Local Storage @ 200 MB’s = 8 min
MAKE SYSTEMS FAULT TOLERANT
1000 nodes - a machine a day will break
102. LutzFinger.com
The Vision
CHEAP Systems
• can run on commodity hardware
Computation are done DECENTRAL
• ability to ‘dispatch’ a task
• parallelize work-streams
Fault TOLERANT
no matter where and when, is not an issue
104. LutzFinger.com
Typical Workflow
· Load data into the cluster (HDFS writes)
· Analyze the data (Map Reduce)
· Store results in the cluster (HDFS writes)
· Read the results from the cluster (HDFS reads)
Sample Scenario:
Huge file containing all emails sent
to customer service
Ref. Brad Hedlund .com
How many times did our customers type the word “Refund”
into emails sent to customer service?
File. Txt
106. LutzFinger.com
Via The Normal Languages
Hadoop Storage (HDFS /
HBase / Solr)
Map Reduce
MapReduce
Hive
Pig/Casscading
Giraph
Mahout
SQL Like
Scripting Like
Graph Oriented
ML Engine
107. LutzFinger.com
Pro & Con
Hadoop Storage (HDFS /
HBase / Solr)
Map Reduce
MapReduce
Hive
Pig/Casscading
Giraph
Mahout
SQL Like
Scripting Like
Graph Oriented
ML Engine
Store
ETL:
Extract /
Transform /
Load
DB / Key Value Store
Visualize
Pro:
way better than traditional BI
Con:
Heavy tech involvement. 12-18
month for non-tech company to
implement a schema
108. LutzFinger.com
Hadoop 2.0
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce Spark Tez
MapReduce
Hive
Pig/Casscading
Giraph
Mahout
Spark
Hive
Pig/Casscading
Giraph
Mahout
Tez
Pig/Casscading
Hive
Impala/Presto
H2O/Oryx
SQL Like
Scripting Like
Graph Oriented
ML Engine
Store in DB
Visualize
Visualize
110. LutzFinger.com
Ingredients of Data Products
The question?
Ask
The need?
The Why? Measure
The Data?
The features?
Team
All of them are necessary - None of them are sufficient!
The algorithms?
The right Skills?
Collaboration
110
111. LutzFinger.com
How To Ingest Ideas
Hack - Days & Incubator
Internal Process
External Competition
Close Collaboration between
Business & Data Scientists“All we do is Data” - Jeff Weiner
111
112. LutzFinger.com
Agenda: 9:00 - 17:00
9:00 The right Ask
9:45 Teamwork: Discover an Ask
10:30 Coffee Break
10:45 Data is King
11:15 Decision Tree
13:00 Lunch
14:00 Pitfalls with Data
14:30 Teamwork: Which Data?
15:30 Coffee Break
15:45 Innovation & Technology
16:30 Build A Team
16:45 Privacy & Ethics
113. LutzFinger.com
Old vs. New
Old School Today / Big data
Data Amount
IT Infrastructure
Data Types
Schema
When and How is the ASK
formulated?
114. LutzFinger.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure
Data Types
Schema
When and How is the ASK
formulated?
115. LutzFinger.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types
Schema
When and How is the ASK
formulated?
116. LutzFinger.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types Structured Structured & Unstructured
Schema
When and How is the ASK
formulated?
117. LutzFinger.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types Structured Structured & unstructured
Schema Stable schema Schema on the fly
When and How is the ASK
formulated?
118. LutzFinger.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types Structured Structured & unstructured
Schema Stable schema Schema on the fly
When and How is the ASK
formulated?
Set ask Ad-hoc ask
128. LutzFinger.com
Agenda: 9:00 - 17:00
9:00 The right Ask
9:45 Teamwork: Discover an Ask
10:30 Coffee Break
10:45 Data is King
11:15 Decision Tree
13:00 Lunch
14:00 Pitfalls with Data
14:30 Teamwork: Which Data?
15:30 Coffee Break
15:45 Innovation & Technology
16:30 Build A Team
16:45 Privacy & Ethics
129. LutzFinger.com
In the EU, insurers will no
longer be allowed to take the
gender of their customers into
account for insurance
premiums:
● young men's premiums
will fall by up to 10%
● young women's premiums
will rise by up to 30%
by: BBC News: http://www.bbc.com/news/business-12608777
Not Everything That Is Possible Is
Legal
130. LutzFinger.com
Let me analyze your Social
Network Connections. If they
are “trustworthy” you will
become easier a Credit.
Ethical or Not?
by: BBC News: http://www.bbc.com/news/business-12608777
How About Community Profiling