SlideShare a Scribd company logo
1 of 20
10/24/2017
1
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
the black art of
machine learning
Michael Wu, PhD (@mich8elwu)
chief scientist @ lithium tech
2017.10.31
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
Michael Wu, PhD (@mich8elwu)
chief scientist @ lithium tech
2017.09.28
@mich8elwu
2
10/24/2017
2
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
• data  info  insight
 buying calcium, zinc, magnesium,
cotton balls, and switching to
unscented lotions + soaps is a
predictor of pregnancy
• decision  action
 coupons for moms, timed to specific
stages of pregnancy
• result
 ↗ revenue
$44B (2002) → $67B (2010)
THE POWER OF BIG DATA + DATA SCIENCE
btw, did you know your
daughter is pregnant?
big data + analytics
7
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
• data  info  insight
 filling out an loan application with only
capital or lower case letter is
predictive of loan default
• decision  action
 augment traditional underwriting
regression model w/ thousands of
variables & 10+ models
• result
 ↘ loan default rate by 40%
 ↗ market share by 25%
THE POWER OF BIG DATA + DATA SCIENCE
8
10/24/2017
3
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD 9
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
• data has huge amount of
statistical redundancy
 duplication
 spatial + temporal correlation
 collinearity (causality)
• much info we extract from the
data are not insightful
• insights must be
 interpretable
 relevant
 novel (not already known)
DATA ≠ INFORMATION ≠ INSIGHT
big that’s
not statistically
redundant = information
data
that’s not
already known = insight
10
10/24/2017
4
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
• relevant data: signal vs. noise
• relevance is context specific
 who:
one man’s signal is
another man’s noise
NOT ALL DATA/INFORMATION ARE RELEVANT
information
data
insight
relevant
to me
relevant to you
noise
14
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
• relevant data: signal vs. noise
• relevance is context specific
 who:
one man’s signal is
another man’s noise
 when?
 where?
 what’s relevant is determined by the
problem you are trying to solve or the
question you are trying to answer
NOT ALL DATA/INFORMATION ARE RELEVANT
information
data
insight
relevant
to me when I
am traveling
in Istanbul
today
noise
context is usually
specified in the
problem/question
15
10/24/2017
5
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
big data is
very noisy
17
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
• how do people use data before big data?
WHY IS BIG DATA SO NOISY?
data is almost
always relevant
problem/question
Q
data collection
data
data is collected
specifically to address the
problem/question
18
10/24/2017
6
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
• enables data capture/storage before we have a question
WHAT HAPPENS W/ BIG DATA TECHNOLOGIES?
most big data will be
irrelevant (only a tiny
% of it will be relevant)
data collection problem/question
Q
data is collected
irrespective of any
specific problem /
question / purpose
must find the “relevant
data” whenever we got
a problem/question
19
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
• for all data (any data):
data ≥ information ≥ insight
• for big data:
data information insight
• “a single grain of rice can tip the
scale”
• “1 bit of insightful info. may be
the difference between victory
and defeat”
DATA ≠ INFORMATION ≠ INSIGHT
information
data
insight
>> >>
20
10/24/2017
7
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
• look beyond what’s relevant
 look at what you thought were the
irrelevant data/info
• don’t look too far beyond your
relevance boundary
 it’s costly and wasteful
 hard to establish causality
• you might not find anything, but
when you do, it will be insightful
 zest finance
WHERE DO YOU LOOK IN YOUR BIG DATA TO FIND INSIGHTS?
information
data
insight
noise
relevant
signal
21
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
big
data
infor-
mation
DO BIZ REALLY WANT BIG DATA?
insight business
needs
hadoop hivehbase pig
big data
tech.
noSQL impalaspark storm …
hugegap
22
10/24/2017
8
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
big
data
infor-
mation
THE BIG DATA GAP: FROM DATA TO INSIGHTS
insight
?
23
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
big
data
infor-
mation
FROM DATA TO INSIGHTS
insight
data scientist
is currently the only way
companies know how to fill this gap
24
10/24/2017
9
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
so what do data
scientists do?
29
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
DATA SCIENCE INPUT + OUTPUT
30
input output
10/24/2017
10
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
STEP 1: GET THE DATA
data scientist is
~50% data janitor
normalization: type, range, unit,
format, foreign key ref …
exception handling: spam, missing
data, incomplete data …
dedupe, metadata tagging …
POS tagging
entity detection
sampling + sample selection
special handling for rich media
…
31
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
RAW DATA USUALLY DON’T PERFORM WELL
raw data
text, image, sound, video
directly measured data, etc.
“Hello, how are you?”
072 101 108 108 111 044 032 104 111 119 032 097 114 101 032 121 111 117 063
can a machine tell a bird from a plane?
how?
34
10/24/2017
11
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
RAW DATA USUALLY DON’T PERFORM WELL
raw data
text, image, sound, video
directly measured data, etc.
35
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
RAW DATA USUALLY DON’T PERFORM WELL
raw data
text, image, sound, video
directly measured data, etc.
bird (0)
plane (1)
probabilityofbird/plane
any pixel’s color/intensity
~50%
~50%
the info in a pixel is not
discriminating enough
for this task
36
10/24/2017
12
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
probabilityofbird/plane
RAW DATA USUALLY DON’T PERFORM WELL
raw data
text, image, sound, video
directly measured data, etc.
bird (0)
plane (1)
anotherpixel’scolor/intensity
any pixel’s color/intensity
0
0
0
0
0
0
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
any pixel can be part
of a bird or a plane.
37
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
are bad features 38
10/24/2017
13
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
raw data are not only noisy
and “dirty,” they are bad
features!
39
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
FEATURES AND FEATURE ENGINEERING
raw data
text, image, sound, video
directly measured data, etc.
features
any information you derive from the raw
data and make explicit to the learning
algorithm
namesageloan amountincome
normalized
defaultrate
~10%
~10%
any raw data
loan
income
frequency of late (or early) payment,
income (or spent) volatility (stdev),
married or not, have kids or not …
debt
income
avg. monthly spent
income
use of proper capitalization in the application
# saves before submitting,
avg. time between saving (or opening) the application,
date, time, day of week when filling the application …
hair color, eye color,
height, weight …
where did they fill out the application,
sunny or rainy when filling the application …
online application
x = name, age, ID
info, loan amount,
income, spending
+ payment habit…
any
any feature
normalized
defaultrate
the info doesn’t even have
to be in the raw data, they
just have to be derivable
height
feature engineering
the extraction of implicit (or externally
derived) information in (from) the raw data
feature
engineering
feature engineering
the extraction of implicit (or externally
derived) information in (from) the raw data
40
10/24/2017
14
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
occurrenceprobabilityofbeak
occurrence probability of stabilizer
COMING UP WITH BETTER FEATURES
raw data
text, image, sound, video
directly measured data, etc.
birds:
have beak,
have eyes,
have feets,
have feathers …
plane:
have stabilizers,
have engines,
have windows …
feature
engineering
model
obtained by optimizing some
objective function (error, likelihood,
etc.) + model validation
statistics
features
any information you derive from the raw
data and make explicit to the learning
algorithm
41
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
occurrenceprobabilityofbeak
occurrence probability of stabilizer
COMING UP WITH BETTER FEATURES
raw data
text, image, sound, video
directly measured data, etc.
birds:
have beak,
have eyes,
have feets,
have feathers …
plane:
have stabilizers,
have engines,
have windows …
model
obtained by optimizing some
objective function (error, likelihood,
etc.) + model validation
features
any information you derive from the raw
data and make explicit to the learning
algorithm
42
machine learningfeature
engineering statistics
10/24/2017
15
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
data science is
~25% handcrafting
… of features
44
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
“Coming up with features is difficult,
time-consuming, requires expert
knowledge. Applied machine learning
is basically feature engineering.”
—Andrew Ng
hand crafted features are:
- domain specific,
- task specific,
- not generalizable
45
10/24/2017
16
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
CAN WE LEARN “GOOD FEATURES” DIRECTLY FROM THE DATA?
47
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
CAN WE LEARN “GOOD FEATURES” DIRECTLY FROM THE DATA?
raw data
text, image, sound, video
directly measured data, etc.
birds:
have beak,
have eyes,
have bird feet,
have feathers …
plane:
have stabilizers,
have engines,
have windows …
feature
engineering
model
obtained by optimizing some
objective function (error, likelihood,
etc.) + model validation
statistics
features
any information you derive from the raw
data and make explicit to the learning
algorithm
48
10/24/2017
17
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
CAN WE LEARN “GOOD FEATURES” DIRECTLY FROM THE DATA?
raw data
text, image, sound, video
directly measured data, etc.
birds:
have beak,
have eyes,
have bird feet,
have feathers …
plane:
have stabilizers,
have engines,
have windows …
feature
engineering
model
obtained by optimizing some
objective function (error, likelihood,
etc.) + model validation
statistics
features
any information you derive from the raw
data and make explicit to the learning
algorithm
shapes edges pixels
49
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
traditional
machine learning
handcrafted
by experts
work for most (80%) of the
problems in business
faces
DEEP LEARNING
raw data
text, image, sound, video
directly measured data, etc.
feature
engineering
model
obtained by optimizing some
objective function (error, likelihood,
etc.) + model validation
statistics
features
any information you derive from the raw
data and make explicit to the learning
algorithm
deep learning
deep neural network
automatically learned from the data
with different levels of abstraction
....
input=
layer3
layer2
layer1
carselephantschairsfaces
+cars
+airplanes,
+motorbikes
combination of
pixels → edges
combination of edges
→ object parts combination of parts → the object
50
10/24/2017
18
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
traditional
machine learning
handcrafted
by experts
work for most (80%) of the
problems in business
DEEP LEARNING
raw data
text, image, sound, video
directly measured data, etc.
feature
engineering
model
obtained by optimizing some
objective function (error, likelihood,
etc.) + model validation
statistics
features
any information you derive from the raw
data and make explicit to the learning
algorithm
deep learning
deep neural network
automatically learned from the data
with different levels of abstraction
google brain:
16,000 cpu
1,000,000,000+
connections
10,000,000
training images
from youtube
extraordinarily generalizable:
makes machine behaves & think more like
human, but requires lots of data to train
success stories:
computer vision: image labeling, search …
audio signal processing: speaker ID, speech
recognition (speech-text) …
text processing: machine translation, etc.
interesting problems in the industry:
—sentiment analysis
—actionability & intention prediction
—fraud, spam detection …
51
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
domain
expertise
WHAT DO DATA SCIENTIST DO?
raw data
text, image, sound, video
directly measured data, etc.
feature
engineering
model
obtained by optimizing some
objective function (error, likelihood,
etc.) + model validation
statistics
features
any information you derive from the raw
data and make explicit to the learning
algorithm
computer
science
math +
statistics
communication
data visualization, storytelling,
translation of data to business
insights, decisions, and action
domain
expertise
plumbing
cleaning
janitoring
handcrafting
52
10/24/2017
19
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
domain
expertise
WHAT DO DATA SCIENTIST DO?
raw data
text, image, sound, video
directly measured data, etc.
feature
engineering
model
obtained by optimizing some
objective function (error, likelihood,
etc.) + model validation
statistics
features
any information you derive from the raw
data and make explicit to the learning
algorithm
computer
science
math +
statistics
communication
data visualization, storytelling,
translation of data to business
insights, decisions, and action
domain
expertise
domain
expertise
math +
statistics
computer
science
data
science
plumbing
cleaning
janitoring
handcrafting
54
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
thank you, q&a,
+ follow me
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
135
10/24/2017
20
c o n f i d e n t i a l
twitter: @mich8elwu
linkedin.com/in/MichaelWuPhD
want to dig deeper?
sos sos2
http://pages.lithium.com/science-of-social
http://www.lithium.com/library/science-of-social-2
136

More Related Content

Viewers also liked

1215 daa lunch track 1 tmm_data_using our laptop
1215 daa lunch track 1 tmm_data_using our laptop1215 daa lunch track 1 tmm_data_using our laptop
1215 daa lunch track 1 tmm_data_using our laptopRising Media, Inc.
 
915 keynote stern_using our laptop
915 keynote stern_using our laptop915 keynote stern_using our laptop
915 keynote stern_using our laptopRising Media, Inc.
 
1555 track 3 cowan_using our laptop
1555 track 3 cowan_using our laptop1555 track 3 cowan_using our laptop
1555 track 3 cowan_using our laptopRising Media, Inc.
 
1140 track 3 ramirez_using our laptop
1140 track 3 ramirez_using our laptop1140 track 3 ramirez_using our laptop
1140 track 3 ramirez_using our laptopRising Media, Inc.
 
1530 track 3 gunther_using our laptop
1530 track 3 gunther_using our laptop1530 track 3 gunther_using our laptop
1530 track 3 gunther_using our laptopRising Media, Inc.
 
1000 track 2 redman_using our laptop
1000 track 2 redman_using our laptop1000 track 2 redman_using our laptop
1000 track 2 redman_using our laptopRising Media, Inc.
 
840 plenary elder_using his laptop
840 plenary elder_using his laptop840 plenary elder_using his laptop
840 plenary elder_using his laptopRising Media, Inc.
 
925 plenary rexer_using our laptop
925 plenary rexer_using our laptop925 plenary rexer_using our laptop
925 plenary rexer_using our laptopRising Media, Inc.
 

Viewers also liked (12)

1215 daa lunch track 1 tmm_data_using our laptop
1215 daa lunch track 1 tmm_data_using our laptop1215 daa lunch track 1 tmm_data_using our laptop
1215 daa lunch track 1 tmm_data_using our laptop
 
915 keynote stern_using our laptop
915 keynote stern_using our laptop915 keynote stern_using our laptop
915 keynote stern_using our laptop
 
Keynote adam greco
Keynote adam grecoKeynote adam greco
Keynote adam greco
 
1555 track 3 cowan_using our laptop
1555 track 3 cowan_using our laptop1555 track 3 cowan_using our laptop
1555 track 3 cowan_using our laptop
 
1140 track 3 ramirez_using our laptop
1140 track 3 ramirez_using our laptop1140 track 3 ramirez_using our laptop
1140 track 3 ramirez_using our laptop
 
1530 track 3 gunther_using our laptop
1530 track 3 gunther_using our laptop1530 track 3 gunther_using our laptop
1530 track 3 gunther_using our laptop
 
1000 track 2 redman_using our laptop
1000 track 2 redman_using our laptop1000 track 2 redman_using our laptop
1000 track 2 redman_using our laptop
 
1645 track 3 porter
1645 track 3 porter1645 track 3 porter
1645 track 3 porter
 
840 plenary elder_using his laptop
840 plenary elder_using his laptop840 plenary elder_using his laptop
840 plenary elder_using his laptop
 
1615 track 3 haensel
1615 track 3 haensel1615 track 3 haensel
1615 track 3 haensel
 
1615 track2 burt-do not share
1615 track2 burt-do not share1615 track2 burt-do not share
1615 track2 burt-do not share
 
925 plenary rexer_using our laptop
925 plenary rexer_using our laptop925 plenary rexer_using our laptop
925 plenary rexer_using our laptop
 

Similar to 1415 track 1 wu_using his laptop

Iste 3 out of 5 tech trends that bend 2-2014 final
Iste 3 out of 5 tech trends that bend  2-2014 finalIste 3 out of 5 tech trends that bend  2-2014 final
Iste 3 out of 5 tech trends that bend 2-2014 finalJason Ohler
 
Social marketing, digital overthrow, and you
Social marketing, digital overthrow, and youSocial marketing, digital overthrow, and you
Social marketing, digital overthrow, and youMary Trigiani
 
BIG DATA MANAGEMENT - forget the hype, let's talk about the facts!
BIG DATA MANAGEMENT - forget the hype, let's talk about the facts! BIG DATA MANAGEMENT - forget the hype, let's talk about the facts!
BIG DATA MANAGEMENT - forget the hype, let's talk about the facts! Lisa Lang
 
What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century Human Capital Media
 
Data and Creativity: The Perfect Marriage
Data and Creativity: The Perfect MarriageData and Creativity: The Perfect Marriage
Data and Creativity: The Perfect MarriageRafael Lebrón Febles
 
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...Grammarly
 
National Tropical Weather Conference infographics presentation April 2014
National Tropical Weather Conference infographics presentation April 2014National Tropical Weather Conference infographics presentation April 2014
National Tropical Weather Conference infographics presentation April 2014Randy Krum
 
The Future Travel Consumer (Amadeus Retail Summit 2019)
The Future Travel Consumer (Amadeus Retail Summit 2019)The Future Travel Consumer (Amadeus Retail Summit 2019)
The Future Travel Consumer (Amadeus Retail Summit 2019)Scott Bales
 
Data and Journalism
Data and JournalismData and Journalism
Data and JournalismLutz Finger
 
Bart vanhaelewyn digimeter 2017 - apestaartjaren mediawijscongres -20180517
Bart vanhaelewyn   digimeter 2017 - apestaartjaren mediawijscongres -20180517Bart vanhaelewyn   digimeter 2017 - apestaartjaren mediawijscongres -20180517
Bart vanhaelewyn digimeter 2017 - apestaartjaren mediawijscongres -20180517Apestaartjaren
 
Learning to read for automated fact checking
Learning to read for automated fact checkingLearning to read for automated fact checking
Learning to read for automated fact checkingIsabelle Augenstein
 
Explore Data: Data Science + Visualization
Explore Data: Data Science + VisualizationExplore Data: Data Science + Visualization
Explore Data: Data Science + VisualizationRoelof Pieters
 
TED Wiley Visualizing .docx
TED  Wiley Visualizing .docxTED  Wiley Visualizing .docx
TED Wiley Visualizing .docxssuserf9c51d
 
Extended deck around data phenomena from (big)data to Extended deck around d...
Extended deck around data phenomena  from (big)data to Extended deck around d...Extended deck around data phenomena  from (big)data to Extended deck around d...
Extended deck around data phenomena from (big)data to Extended deck around d...Pietro Leo
 
Challenges for PR in the Real Time Web
Challenges for PR in the Real Time WebChallenges for PR in the Real Time Web
Challenges for PR in the Real Time WebMauro Turcatti
 
Dont wait what 300 ld leaders have learned about building data fluency
 Dont wait what 300 ld leaders have learned about building data fluency Dont wait what 300 ld leaders have learned about building data fluency
Dont wait what 300 ld leaders have learned about building data fluencyHuman Capital Media
 
Decoding Social Data Employing Non Discriminatory Analytics in Creating New D...
Decoding Social Data Employing Non Discriminatory Analytics in Creating New D...Decoding Social Data Employing Non Discriminatory Analytics in Creating New D...
Decoding Social Data Employing Non Discriminatory Analytics in Creating New D...Mahmoud Dasser
 
Research Arena
Research ArenaResearch Arena
Research ArenaBAQMaR
 

Similar to 1415 track 1 wu_using his laptop (20)

Iste 3 out of 5 tech trends that bend 2-2014 final
Iste 3 out of 5 tech trends that bend  2-2014 finalIste 3 out of 5 tech trends that bend  2-2014 final
Iste 3 out of 5 tech trends that bend 2-2014 final
 
Social marketing, digital overthrow, and you
Social marketing, digital overthrow, and youSocial marketing, digital overthrow, and you
Social marketing, digital overthrow, and you
 
BIG DATA MANAGEMENT - forget the hype, let's talk about the facts!
BIG DATA MANAGEMENT - forget the hype, let's talk about the facts! BIG DATA MANAGEMENT - forget the hype, let's talk about the facts!
BIG DATA MANAGEMENT - forget the hype, let's talk about the facts!
 
What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century
 
Data and Creativity: The Perfect Marriage
Data and Creativity: The Perfect MarriageData and Creativity: The Perfect Marriage
Data and Creativity: The Perfect Marriage
 
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
 
National Tropical Weather Conference infographics presentation April 2014
National Tropical Weather Conference infographics presentation April 2014National Tropical Weather Conference infographics presentation April 2014
National Tropical Weather Conference infographics presentation April 2014
 
The Future Travel Consumer (Amadeus Retail Summit 2019)
The Future Travel Consumer (Amadeus Retail Summit 2019)The Future Travel Consumer (Amadeus Retail Summit 2019)
The Future Travel Consumer (Amadeus Retail Summit 2019)
 
Data and Journalism
Data and JournalismData and Journalism
Data and Journalism
 
Bart vanhaelewyn digimeter 2017 - apestaartjaren mediawijscongres -20180517
Bart vanhaelewyn   digimeter 2017 - apestaartjaren mediawijscongres -20180517Bart vanhaelewyn   digimeter 2017 - apestaartjaren mediawijscongres -20180517
Bart vanhaelewyn digimeter 2017 - apestaartjaren mediawijscongres -20180517
 
Learning to read for automated fact checking
Learning to read for automated fact checkingLearning to read for automated fact checking
Learning to read for automated fact checking
 
SENCER_panel.ppt
SENCER_panel.pptSENCER_panel.ppt
SENCER_panel.ppt
 
Explore Data: Data Science + Visualization
Explore Data: Data Science + VisualizationExplore Data: Data Science + Visualization
Explore Data: Data Science + Visualization
 
TED Wiley Visualizing .docx
TED  Wiley Visualizing .docxTED  Wiley Visualizing .docx
TED Wiley Visualizing .docx
 
Extended deck around data phenomena from (big)data to Extended deck around d...
Extended deck around data phenomena  from (big)data to Extended deck around d...Extended deck around data phenomena  from (big)data to Extended deck around d...
Extended deck around data phenomena from (big)data to Extended deck around d...
 
Challenges for PR in the Real Time Web
Challenges for PR in the Real Time WebChallenges for PR in the Real Time Web
Challenges for PR in the Real Time Web
 
Dont wait what 300 ld leaders have learned about building data fluency
 Dont wait what 300 ld leaders have learned about building data fluency Dont wait what 300 ld leaders have learned about building data fluency
Dont wait what 300 ld leaders have learned about building data fluency
 
Decoding Social Data Employing Non Discriminatory Analytics in Creating New D...
Decoding Social Data Employing Non Discriminatory Analytics in Creating New D...Decoding Social Data Employing Non Discriminatory Analytics in Creating New D...
Decoding Social Data Employing Non Discriminatory Analytics in Creating New D...
 
Small data big impact
Small data big impactSmall data big impact
Small data big impact
 
Research Arena
Research ArenaResearch Arena
Research Arena
 

More from Rising Media, Inc.

1620 keynote olson_using our laptop
1620 keynote olson_using our laptop1620 keynote olson_using our laptop
1620 keynote olson_using our laptopRising Media, Inc.
 
1530 track 2 stuart_using our laptop
1530 track 2 stuart_using our laptop1530 track 2 stuart_using our laptop
1530 track 2 stuart_using our laptopRising Media, Inc.
 
1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptop1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptopRising Media, Inc.
 
1215 daa lunch owusu_using our laptop
1215 daa lunch owusu_using our laptop1215 daa lunch owusu_using our laptop
1215 daa lunch owusu_using our laptopRising Media, Inc.
 
1215 daa lunch a bos intro slides_using our laptop
1215 daa lunch a bos intro slides_using our laptop1215 daa lunch a bos intro slides_using our laptop
1215 daa lunch a bos intro slides_using our laptopRising Media, Inc.
 
855 sponsor movassate_using our laptop
855 sponsor movassate_using our laptop855 sponsor movassate_using our laptop
855 sponsor movassate_using our laptopRising Media, Inc.
 
1325 keynote yale_pdf shareable
1325 keynote yale_pdf shareable1325 keynote yale_pdf shareable
1325 keynote yale_pdf shareableRising Media, Inc.
 
905 keynote peele_using our laptop
905 keynote peele_using our laptop905 keynote peele_using our laptop
905 keynote peele_using our laptopRising Media, Inc.
 
940 sponsor kallakuri_do not share
940 sponsor kallakuri_do not share940 sponsor kallakuri_do not share
940 sponsor kallakuri_do not shareRising Media, Inc.
 
900 keynote gottshall_using his laptop
900 keynote gottshall_using his laptop900 keynote gottshall_using his laptop
900 keynote gottshall_using his laptopRising Media, Inc.
 

More from Rising Media, Inc. (20)

Matt gershoff
Matt gershoffMatt gershoff
Matt gershoff
 
1620 keynote olson_using our laptop
1620 keynote olson_using our laptop1620 keynote olson_using our laptop
1620 keynote olson_using our laptop
 
1530 track 2 stuart_using our laptop
1530 track 2 stuart_using our laptop1530 track 2 stuart_using our laptop
1530 track 2 stuart_using our laptop
 
1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptop1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptop
 
1415 track 2 richardson
1415 track 2 richardson1415 track 2 richardson
1415 track 2 richardson
 
1215 daa lunch owusu_using our laptop
1215 daa lunch owusu_using our laptop1215 daa lunch owusu_using our laptop
1215 daa lunch owusu_using our laptop
 
1215 daa lunch a bos intro slides_using our laptop
1215 daa lunch a bos intro slides_using our laptop1215 daa lunch a bos intro slides_using our laptop
1215 daa lunch a bos intro slides_using our laptop
 
855 sponsor movassate_using our laptop
855 sponsor movassate_using our laptop855 sponsor movassate_using our laptop
855 sponsor movassate_using our laptop
 
1615 plack using our laptop
1615 plack using our laptop1615 plack using our laptop
1615 plack using our laptop
 
1530 rimmele do not share
1530 rimmele do not share1530 rimmele do not share
1530 rimmele do not share
 
1325 keynote yale_pdf shareable
1325 keynote yale_pdf shareable1325 keynote yale_pdf shareable
1325 keynote yale_pdf shareable
 
1115 fiztgerald schuchardt
1115 fiztgerald schuchardt1115 fiztgerald schuchardt
1115 fiztgerald schuchardt
 
1000 kondic do not share
1000 kondic do not share1000 kondic do not share
1000 kondic do not share
 
905 keynote peele_using our laptop
905 keynote peele_using our laptop905 keynote peele_using our laptop
905 keynote peele_using our laptop
 
Stephen morse sharable
Stephen morse sharableStephen morse sharable
Stephen morse sharable
 
Elder shareable
Elder shareableElder shareable
Elder shareable
 
1115 ramirez using our laptop
1115 ramirez using our laptop1115 ramirez using our laptop
1115 ramirez using our laptop
 
1000 grandy using our laptop
1000 grandy using our laptop1000 grandy using our laptop
1000 grandy using our laptop
 
940 sponsor kallakuri_do not share
940 sponsor kallakuri_do not share940 sponsor kallakuri_do not share
940 sponsor kallakuri_do not share
 
900 keynote gottshall_using his laptop
900 keynote gottshall_using his laptop900 keynote gottshall_using his laptop
900 keynote gottshall_using his laptop
 

Recently uploaded

marketing strategy of tanishq word PPROJECT.pdf
marketing strategy of tanishq word PPROJECT.pdfmarketing strategy of tanishq word PPROJECT.pdf
marketing strategy of tanishq word PPROJECT.pdfarsathsahil
 
pptx.marketing strategy of tanishq. pptx
pptx.marketing strategy of tanishq. pptxpptx.marketing strategy of tanishq. pptx
pptx.marketing strategy of tanishq. pptxarsathsahil
 
Do More with Less: Navigating Customer Acquisition Challenges for Today's Ent...
Do More with Less: Navigating Customer Acquisition Challenges for Today's Ent...Do More with Less: Navigating Customer Acquisition Challenges for Today's Ent...
Do More with Less: Navigating Customer Acquisition Challenges for Today's Ent...Search Engine Journal
 
Kraft Mac and Cheese campaign presentation
Kraft Mac and Cheese campaign presentationKraft Mac and Cheese campaign presentation
Kraft Mac and Cheese campaign presentationtbatkhuu1
 
Social Samosa Guidebook for SAMMIES 2024.pdf
Social Samosa Guidebook for SAMMIES 2024.pdfSocial Samosa Guidebook for SAMMIES 2024.pdf
Social Samosa Guidebook for SAMMIES 2024.pdfSocial Samosa
 
Mastering SEO in the Evolving AI-driven World
Mastering SEO in the Evolving AI-driven WorldMastering SEO in the Evolving AI-driven World
Mastering SEO in the Evolving AI-driven WorldScalenut
 
Unraveling the Mystery of Roanoke Colony: What Really Happened?
Unraveling the Mystery of Roanoke Colony: What Really Happened?Unraveling the Mystery of Roanoke Colony: What Really Happened?
Unraveling the Mystery of Roanoke Colony: What Really Happened?elizabethella096
 
GreenSEO April 2024: Join the Green Web Revolution
GreenSEO April 2024: Join the Green Web RevolutionGreenSEO April 2024: Join the Green Web Revolution
GreenSEO April 2024: Join the Green Web RevolutionWilliam Barnes
 
Branding strategies of new company .pptx
Branding strategies of new company .pptxBranding strategies of new company .pptx
Branding strategies of new company .pptxVikasTiwari846641
 
Avoid the 2025 web accessibility rush: do not fear WCAG compliance
Avoid the 2025 web accessibility rush: do not fear WCAG complianceAvoid the 2025 web accessibility rush: do not fear WCAG compliance
Avoid the 2025 web accessibility rush: do not fear WCAG complianceDamien ROBERT
 
The Skin Games 2024 25 - Sponsorship Deck
The Skin Games 2024 25 - Sponsorship DeckThe Skin Games 2024 25 - Sponsorship Deck
The Skin Games 2024 25 - Sponsorship DeckToluwanimi Balogun
 
TOP DUBAI AGENCY OFFERS EXPERT DIGITAL MARKETING SERVICES.pdf
TOP DUBAI AGENCY OFFERS EXPERT DIGITAL MARKETING SERVICES.pdfTOP DUBAI AGENCY OFFERS EXPERT DIGITAL MARKETING SERVICES.pdf
TOP DUBAI AGENCY OFFERS EXPERT DIGITAL MARKETING SERVICES.pdfasiyahanif9977
 
How To Utilize Calculated Properties in your HubSpot Setup
How To Utilize Calculated Properties in your HubSpot SetupHow To Utilize Calculated Properties in your HubSpot Setup
How To Utilize Calculated Properties in your HubSpot Setupssuser4571da
 
Call Us ➥9654467111▻Call Girls In Delhi NCR
Call Us ➥9654467111▻Call Girls In Delhi NCRCall Us ➥9654467111▻Call Girls In Delhi NCR
Call Us ➥9654467111▻Call Girls In Delhi NCRSapana Sha
 
Beyond Resumes_ How Volunteering Shapes Career Trajectories by Kent Kubie
Beyond Resumes_ How Volunteering Shapes Career Trajectories by Kent KubieBeyond Resumes_ How Volunteering Shapes Career Trajectories by Kent Kubie
Beyond Resumes_ How Volunteering Shapes Career Trajectories by Kent KubieKent Kubie
 
Netflix Ads The Game Changer in Video Ads – Who Needs YouTube.pptx (Chester Y...
Netflix Ads The Game Changer in Video Ads – Who Needs YouTube.pptx (Chester Y...Netflix Ads The Game Changer in Video Ads – Who Needs YouTube.pptx (Chester Y...
Netflix Ads The Game Changer in Video Ads – Who Needs YouTube.pptx (Chester Y...ChesterYang6
 

Recently uploaded (20)

marketing strategy of tanishq word PPROJECT.pdf
marketing strategy of tanishq word PPROJECT.pdfmarketing strategy of tanishq word PPROJECT.pdf
marketing strategy of tanishq word PPROJECT.pdf
 
Brand Strategy Master Class - Juntae DeLane
Brand Strategy Master Class - Juntae DeLaneBrand Strategy Master Class - Juntae DeLane
Brand Strategy Master Class - Juntae DeLane
 
pptx.marketing strategy of tanishq. pptx
pptx.marketing strategy of tanishq. pptxpptx.marketing strategy of tanishq. pptx
pptx.marketing strategy of tanishq. pptx
 
Do More with Less: Navigating Customer Acquisition Challenges for Today's Ent...
Do More with Less: Navigating Customer Acquisition Challenges for Today's Ent...Do More with Less: Navigating Customer Acquisition Challenges for Today's Ent...
Do More with Less: Navigating Customer Acquisition Challenges for Today's Ent...
 
Kraft Mac and Cheese campaign presentation
Kraft Mac and Cheese campaign presentationKraft Mac and Cheese campaign presentation
Kraft Mac and Cheese campaign presentation
 
Turn Digital Reputation Threats into Offense Tactics - Daniel Lemin
Turn Digital Reputation Threats into Offense Tactics - Daniel LeminTurn Digital Reputation Threats into Offense Tactics - Daniel Lemin
Turn Digital Reputation Threats into Offense Tactics - Daniel Lemin
 
Social Samosa Guidebook for SAMMIES 2024.pdf
Social Samosa Guidebook for SAMMIES 2024.pdfSocial Samosa Guidebook for SAMMIES 2024.pdf
Social Samosa Guidebook for SAMMIES 2024.pdf
 
Mastering SEO in the Evolving AI-driven World
Mastering SEO in the Evolving AI-driven WorldMastering SEO in the Evolving AI-driven World
Mastering SEO in the Evolving AI-driven World
 
Unraveling the Mystery of Roanoke Colony: What Really Happened?
Unraveling the Mystery of Roanoke Colony: What Really Happened?Unraveling the Mystery of Roanoke Colony: What Really Happened?
Unraveling the Mystery of Roanoke Colony: What Really Happened?
 
GreenSEO April 2024: Join the Green Web Revolution
GreenSEO April 2024: Join the Green Web RevolutionGreenSEO April 2024: Join the Green Web Revolution
GreenSEO April 2024: Join the Green Web Revolution
 
Branding strategies of new company .pptx
Branding strategies of new company .pptxBranding strategies of new company .pptx
Branding strategies of new company .pptx
 
SEO Master Class - Steve Wiideman, Wiideman Consulting Group
SEO Master Class - Steve Wiideman, Wiideman Consulting GroupSEO Master Class - Steve Wiideman, Wiideman Consulting Group
SEO Master Class - Steve Wiideman, Wiideman Consulting Group
 
BUY GMAIL ACCOUNTS PVA USA IP INDIAN IP GMAIL
BUY GMAIL ACCOUNTS PVA USA IP INDIAN IP GMAILBUY GMAIL ACCOUNTS PVA USA IP INDIAN IP GMAIL
BUY GMAIL ACCOUNTS PVA USA IP INDIAN IP GMAIL
 
Avoid the 2025 web accessibility rush: do not fear WCAG compliance
Avoid the 2025 web accessibility rush: do not fear WCAG complianceAvoid the 2025 web accessibility rush: do not fear WCAG compliance
Avoid the 2025 web accessibility rush: do not fear WCAG compliance
 
The Skin Games 2024 25 - Sponsorship Deck
The Skin Games 2024 25 - Sponsorship DeckThe Skin Games 2024 25 - Sponsorship Deck
The Skin Games 2024 25 - Sponsorship Deck
 
TOP DUBAI AGENCY OFFERS EXPERT DIGITAL MARKETING SERVICES.pdf
TOP DUBAI AGENCY OFFERS EXPERT DIGITAL MARKETING SERVICES.pdfTOP DUBAI AGENCY OFFERS EXPERT DIGITAL MARKETING SERVICES.pdf
TOP DUBAI AGENCY OFFERS EXPERT DIGITAL MARKETING SERVICES.pdf
 
How To Utilize Calculated Properties in your HubSpot Setup
How To Utilize Calculated Properties in your HubSpot SetupHow To Utilize Calculated Properties in your HubSpot Setup
How To Utilize Calculated Properties in your HubSpot Setup
 
Call Us ➥9654467111▻Call Girls In Delhi NCR
Call Us ➥9654467111▻Call Girls In Delhi NCRCall Us ➥9654467111▻Call Girls In Delhi NCR
Call Us ➥9654467111▻Call Girls In Delhi NCR
 
Beyond Resumes_ How Volunteering Shapes Career Trajectories by Kent Kubie
Beyond Resumes_ How Volunteering Shapes Career Trajectories by Kent KubieBeyond Resumes_ How Volunteering Shapes Career Trajectories by Kent Kubie
Beyond Resumes_ How Volunteering Shapes Career Trajectories by Kent Kubie
 
Netflix Ads The Game Changer in Video Ads – Who Needs YouTube.pptx (Chester Y...
Netflix Ads The Game Changer in Video Ads – Who Needs YouTube.pptx (Chester Y...Netflix Ads The Game Changer in Video Ads – Who Needs YouTube.pptx (Chester Y...
Netflix Ads The Game Changer in Video Ads – Who Needs YouTube.pptx (Chester Y...
 

1415 track 1 wu_using his laptop

  • 1. 10/24/2017 1 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD the black art of machine learning Michael Wu, PhD (@mich8elwu) chief scientist @ lithium tech 2017.10.31 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD Michael Wu, PhD (@mich8elwu) chief scientist @ lithium tech 2017.09.28 @mich8elwu 2
  • 2. 10/24/2017 2 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD • data  info  insight  buying calcium, zinc, magnesium, cotton balls, and switching to unscented lotions + soaps is a predictor of pregnancy • decision  action  coupons for moms, timed to specific stages of pregnancy • result  ↗ revenue $44B (2002) → $67B (2010) THE POWER OF BIG DATA + DATA SCIENCE btw, did you know your daughter is pregnant? big data + analytics 7 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD • data  info  insight  filling out an loan application with only capital or lower case letter is predictive of loan default • decision  action  augment traditional underwriting regression model w/ thousands of variables & 10+ models • result  ↘ loan default rate by 40%  ↗ market share by 25% THE POWER OF BIG DATA + DATA SCIENCE 8
  • 3. 10/24/2017 3 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD 9 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD • data has huge amount of statistical redundancy  duplication  spatial + temporal correlation  collinearity (causality) • much info we extract from the data are not insightful • insights must be  interpretable  relevant  novel (not already known) DATA ≠ INFORMATION ≠ INSIGHT big that’s not statistically redundant = information data that’s not already known = insight 10
  • 4. 10/24/2017 4 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD • relevant data: signal vs. noise • relevance is context specific  who: one man’s signal is another man’s noise NOT ALL DATA/INFORMATION ARE RELEVANT information data insight relevant to me relevant to you noise 14 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD • relevant data: signal vs. noise • relevance is context specific  who: one man’s signal is another man’s noise  when?  where?  what’s relevant is determined by the problem you are trying to solve or the question you are trying to answer NOT ALL DATA/INFORMATION ARE RELEVANT information data insight relevant to me when I am traveling in Istanbul today noise context is usually specified in the problem/question 15
  • 5. 10/24/2017 5 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD big data is very noisy 17 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD • how do people use data before big data? WHY IS BIG DATA SO NOISY? data is almost always relevant problem/question Q data collection data data is collected specifically to address the problem/question 18
  • 6. 10/24/2017 6 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD • enables data capture/storage before we have a question WHAT HAPPENS W/ BIG DATA TECHNOLOGIES? most big data will be irrelevant (only a tiny % of it will be relevant) data collection problem/question Q data is collected irrespective of any specific problem / question / purpose must find the “relevant data” whenever we got a problem/question 19 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD • for all data (any data): data ≥ information ≥ insight • for big data: data information insight • “a single grain of rice can tip the scale” • “1 bit of insightful info. may be the difference between victory and defeat” DATA ≠ INFORMATION ≠ INSIGHT information data insight >> >> 20
  • 7. 10/24/2017 7 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD • look beyond what’s relevant  look at what you thought were the irrelevant data/info • don’t look too far beyond your relevance boundary  it’s costly and wasteful  hard to establish causality • you might not find anything, but when you do, it will be insightful  zest finance WHERE DO YOU LOOK IN YOUR BIG DATA TO FIND INSIGHTS? information data insight noise relevant signal 21 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD big data infor- mation DO BIZ REALLY WANT BIG DATA? insight business needs hadoop hivehbase pig big data tech. noSQL impalaspark storm … hugegap 22
  • 8. 10/24/2017 8 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD big data infor- mation THE BIG DATA GAP: FROM DATA TO INSIGHTS insight ? 23 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD big data infor- mation FROM DATA TO INSIGHTS insight data scientist is currently the only way companies know how to fill this gap 24
  • 9. 10/24/2017 9 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD so what do data scientists do? 29 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD DATA SCIENCE INPUT + OUTPUT 30 input output
  • 10. 10/24/2017 10 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD STEP 1: GET THE DATA data scientist is ~50% data janitor normalization: type, range, unit, format, foreign key ref … exception handling: spam, missing data, incomplete data … dedupe, metadata tagging … POS tagging entity detection sampling + sample selection special handling for rich media … 31 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD RAW DATA USUALLY DON’T PERFORM WELL raw data text, image, sound, video directly measured data, etc. “Hello, how are you?” 072 101 108 108 111 044 032 104 111 119 032 097 114 101 032 121 111 117 063 can a machine tell a bird from a plane? how? 34
  • 11. 10/24/2017 11 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD RAW DATA USUALLY DON’T PERFORM WELL raw data text, image, sound, video directly measured data, etc. 35 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD RAW DATA USUALLY DON’T PERFORM WELL raw data text, image, sound, video directly measured data, etc. bird (0) plane (1) probabilityofbird/plane any pixel’s color/intensity ~50% ~50% the info in a pixel is not discriminating enough for this task 36
  • 12. 10/24/2017 12 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD probabilityofbird/plane RAW DATA USUALLY DON’T PERFORM WELL raw data text, image, sound, video directly measured data, etc. bird (0) plane (1) anotherpixel’scolor/intensity any pixel’s color/intensity 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 any pixel can be part of a bird or a plane. 37 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD are bad features 38
  • 13. 10/24/2017 13 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD raw data are not only noisy and “dirty,” they are bad features! 39 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD FEATURES AND FEATURE ENGINEERING raw data text, image, sound, video directly measured data, etc. features any information you derive from the raw data and make explicit to the learning algorithm namesageloan amountincome normalized defaultrate ~10% ~10% any raw data loan income frequency of late (or early) payment, income (or spent) volatility (stdev), married or not, have kids or not … debt income avg. monthly spent income use of proper capitalization in the application # saves before submitting, avg. time between saving (or opening) the application, date, time, day of week when filling the application … hair color, eye color, height, weight … where did they fill out the application, sunny or rainy when filling the application … online application x = name, age, ID info, loan amount, income, spending + payment habit… any any feature normalized defaultrate the info doesn’t even have to be in the raw data, they just have to be derivable height feature engineering the extraction of implicit (or externally derived) information in (from) the raw data feature engineering feature engineering the extraction of implicit (or externally derived) information in (from) the raw data 40
  • 14. 10/24/2017 14 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD occurrenceprobabilityofbeak occurrence probability of stabilizer COMING UP WITH BETTER FEATURES raw data text, image, sound, video directly measured data, etc. birds: have beak, have eyes, have feets, have feathers … plane: have stabilizers, have engines, have windows … feature engineering model obtained by optimizing some objective function (error, likelihood, etc.) + model validation statistics features any information you derive from the raw data and make explicit to the learning algorithm 41 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD occurrenceprobabilityofbeak occurrence probability of stabilizer COMING UP WITH BETTER FEATURES raw data text, image, sound, video directly measured data, etc. birds: have beak, have eyes, have feets, have feathers … plane: have stabilizers, have engines, have windows … model obtained by optimizing some objective function (error, likelihood, etc.) + model validation features any information you derive from the raw data and make explicit to the learning algorithm 42 machine learningfeature engineering statistics
  • 15. 10/24/2017 15 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD data science is ~25% handcrafting … of features 44 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD “Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering.” —Andrew Ng hand crafted features are: - domain specific, - task specific, - not generalizable 45
  • 16. 10/24/2017 16 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD CAN WE LEARN “GOOD FEATURES” DIRECTLY FROM THE DATA? 47 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD CAN WE LEARN “GOOD FEATURES” DIRECTLY FROM THE DATA? raw data text, image, sound, video directly measured data, etc. birds: have beak, have eyes, have bird feet, have feathers … plane: have stabilizers, have engines, have windows … feature engineering model obtained by optimizing some objective function (error, likelihood, etc.) + model validation statistics features any information you derive from the raw data and make explicit to the learning algorithm 48
  • 17. 10/24/2017 17 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD CAN WE LEARN “GOOD FEATURES” DIRECTLY FROM THE DATA? raw data text, image, sound, video directly measured data, etc. birds: have beak, have eyes, have bird feet, have feathers … plane: have stabilizers, have engines, have windows … feature engineering model obtained by optimizing some objective function (error, likelihood, etc.) + model validation statistics features any information you derive from the raw data and make explicit to the learning algorithm shapes edges pixels 49 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD traditional machine learning handcrafted by experts work for most (80%) of the problems in business faces DEEP LEARNING raw data text, image, sound, video directly measured data, etc. feature engineering model obtained by optimizing some objective function (error, likelihood, etc.) + model validation statistics features any information you derive from the raw data and make explicit to the learning algorithm deep learning deep neural network automatically learned from the data with different levels of abstraction .... input= layer3 layer2 layer1 carselephantschairsfaces +cars +airplanes, +motorbikes combination of pixels → edges combination of edges → object parts combination of parts → the object 50
  • 18. 10/24/2017 18 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD traditional machine learning handcrafted by experts work for most (80%) of the problems in business DEEP LEARNING raw data text, image, sound, video directly measured data, etc. feature engineering model obtained by optimizing some objective function (error, likelihood, etc.) + model validation statistics features any information you derive from the raw data and make explicit to the learning algorithm deep learning deep neural network automatically learned from the data with different levels of abstraction google brain: 16,000 cpu 1,000,000,000+ connections 10,000,000 training images from youtube extraordinarily generalizable: makes machine behaves & think more like human, but requires lots of data to train success stories: computer vision: image labeling, search … audio signal processing: speaker ID, speech recognition (speech-text) … text processing: machine translation, etc. interesting problems in the industry: —sentiment analysis —actionability & intention prediction —fraud, spam detection … 51 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD domain expertise WHAT DO DATA SCIENTIST DO? raw data text, image, sound, video directly measured data, etc. feature engineering model obtained by optimizing some objective function (error, likelihood, etc.) + model validation statistics features any information you derive from the raw data and make explicit to the learning algorithm computer science math + statistics communication data visualization, storytelling, translation of data to business insights, decisions, and action domain expertise plumbing cleaning janitoring handcrafting 52
  • 19. 10/24/2017 19 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD domain expertise WHAT DO DATA SCIENTIST DO? raw data text, image, sound, video directly measured data, etc. feature engineering model obtained by optimizing some objective function (error, likelihood, etc.) + model validation statistics features any information you derive from the raw data and make explicit to the learning algorithm computer science math + statistics communication data visualization, storytelling, translation of data to business insights, decisions, and action domain expertise domain expertise math + statistics computer science data science plumbing cleaning janitoring handcrafting 54 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD thank you, q&a, + follow me twitter: @mich8elwu linkedin.com/in/MichaelWuPhD 135
  • 20. 10/24/2017 20 c o n f i d e n t i a l twitter: @mich8elwu linkedin.com/in/MichaelWuPhD want to dig deeper? sos sos2 http://pages.lithium.com/science-of-social http://www.lithium.com/library/science-of-social-2 136