SlideShare a Scribd company logo
1 of 44
The How and Why of
Feature Engineering
Alice Zheng, Dato
March 29, 2016
Strata + Hadoop World, San Jose
1
2
My journey so far
Shortage of expertise and
good tools in the market.
Applied machine learning/
data science
Build ML tools
Write a book
3
Machine learning is great!
Model data.
Make predictions.
Build intelligent
applications.
Play chess and go!
4
The machine learning pipeline
I fell in love the instant I laid
my eyes on that puppy. His
big eyes and playful tail, his
soft furry paws, …
Raw data
Features
Models
Predictions
Deploy in
production
5
If machine learning were hairstyles
Images courtesy of “A visual history of ancient hairdos” and “An animated history of 20th century hairstyles.”
Models
Magnificent, ornate, high-maintenance
Feature engineering
Street smart, ad-hoc, hacky
6
Making sense of feature engineering
• Feature generation
• Feature cleaning and transformation
• How well do they work?
• Why?
Feature Generation
Feature: An individual measurable property of a phenomenon being observed.
⎯ Christopher Bishop, “Pattern Recognition and Machine Learning”
8
Representing natural text
It is a puppy and it is
extremely cute.
What’s important?
Phrases? Specific
words? Ordering?
Subject, object, verb?
Classify:
puppy or not?
Raw Text
{“it”:2,
“is”:2,
“a”:1,
“puppy”:1,
“and”:1,
“extremely”:1,
“cute”:1 }
Bag of Words
9
Representing natural text
It is a puppy and it is
extremely cute.
Classify:
puppy or not?
Raw Text Bag of Words
it 2
they 0
I 0
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Sparse vector
representation
10
Representing images
Image source: “Recognizing and learning object categories,”
Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009.
Raw image:
millions of RGB triplets,
one for each pixel
Classify:
person or animal?
Raw Image Bag of Visual Words
11
Representing images
Classify:
person or animal?
Raw Image Deep learning features
3.29
-15
-5.24
48.3
1.36
47.1
-
1.92
36.5
2.83
95.4
-19
-89
5.09
37.8
Dense vector
representation
12
Representing audio
Raw Audio
Spectrogram
features
Classify:
Music or voice?
Type of instrument
t=0 t=1 t=2
6.1917 -0.3411 1.2418
0.2205 0.0214 0.4503
1.0423 0.2214 -1.0017
-0.2340 -0.0392 -0.2617
0.2750 0.0226 0.1229
0.0653 0.0428 -0.4721
0.3169 0.0541 -0.1033
-0.2970 -0.0627 0.1960
Time series of
dense vectors
13
Feature generation for audio, image, text
I fell in love the instant I
laid my eyes on that
puppy. His big eyes and
playful tail, his soft furry
paws, …
“Human native” Conceptually abstract
Low Semantic content in data High
Higher Difficulty of feature generation Lower
Feature Cleaning and Transformation
15
Auto-generated features are noisy
Rank Word Doc Count Rank Word Doc Count
1 the 1,416,058 11 was 929,703
2 and 1,381,324 12 this 844,824
3 a 1,263,126 13 but 822,313
4 i 1,230,214 14 my 786,595
5 to 1,196,238 15 that 777,045
6 it 1,027,835 16 with 775,044
7 of 1,025,638 17 on 735,419
8 for 993,430 18 they 720,994
9 is 988,547 19 you 701,015
10 in 961,518 20 have 692,749
Most popular words in Yelp reviews dataset (~ 6M reviews).
16
Auto-generated features are noisy
Rank Word Doc Count Rank Word Doc Count
357,480 cmtk8xyqg 1 357,470 attractif 1
357,479 tangified 1 357,469 chappagetti 1
357,478 laaaaaaasts 1 357,468 herdy 1
357,477 bailouts 1 357,467 csmpus 1
357,476 feautred 1 357,466 costoso 1
357,475 résine 1 357,465 freebased 1
357,474 chilyl 1 357,464 tikme 1
357,473 cariottis 1 357,463 traditionresort 1
357,472 enfeebled 1 357,462 jallisco 1
357,471 sparklely 1 357,461 zoawan 1
Least popular words in Yelp reviews dataset (~ 6M reviews).
17
Feature cleaning
• Popular words and rare words are not helpful
• Manually defined blacklist – stopwords
a b c d e f g h i
able be came definitely each far get had ie
about became can described edu few gets happens if
above because cannot despite eg fifth getting hardly ignored
according become cant did eight first given has immediately
accordingly becomes cause different either five gives have in
across becoming causes do else followed go having inasmuch
… … … … … … … … …
18
Feature cleaning
• Frequency-based pruning
19
Stopwords vs. frequency filters
No training required
Stopwords Frequency filters
Can be exhaustive
Inflexible
Adapts to data
Also deals with rare words
Needs tuning, hard to control
Both require manual attention
20
Tf-Idf: Automatic “soft” filter
• Tf-idf = term frequency x inverse document
frequency
• Tf = Number of times a terms appears in a
document
• Idf = log(# total docs / # docs containing word w)
• Large for uncommon words, small for popular words
• Discounts popular words, highlights rare words
21
Visualizing bag-of-words
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
I have a dog
and I have a pen
1
22
Visualizing tf-idf
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
idf(puppy) = log 4
idf(cat) = log 4
idf(have) = log 1 = 0
I have a dog
and I have a pen
1
23
Visualizing tf-idf
puppy
cat1
have
tfidf(puppy) = log 4
tfidf(cat) = log 4
tfidf(have) = 0
I have a dog
and I have a pen,
I have a kitten
1
log 4
log 4
I have a cat
I have a puppy
24
Algebraically, tf-idf = column scaling
w1 w2 w3 w4 … wM
d1
d2
d3
d4
…
dN
idf = log (N/L0 norm of word column)
25
w1 w2 w3 w4 … wM
d1
d2
d3
d4
…
dN
Algebraically, tf-idf = column scaling
Multiply word column with scalar (idf of word)
26
w1 w2 w3 w4 … wM
d1
d2
d3
d4
…
dN
Algebraically, tf-idf = column scaling
Multiply word column with scalar (idf of word)
27
Algebraically, tf-idf = column scaling
w1 w2 w3 w4 … wM
d1
d2
d3
d4
…
dN
Multiply word column with scalar (idf of word)
28
Other types of column scaling
• L2 scaling = divide column by L2 norm
How well do they work?
30
Classify reviews using logistic regression
• Classify business category of Yelp reviews
• Bag-of-words vs. L2 normalization vs. tf-idf
• Model: logistic regression
31
Observations
• l2 regularization made no difference (with proper tuning)
• L2 normalization made no difference on accuracy
• Tf-idf did better, but barely
• But they are both column scaling methods! Why the
difference?
A Peek Under the Hood
33
Linear classification
Feature 2
Feature 1
Find the best line to separate two classes
Algebraically–solve linear systems
Data matrix
Weightvector
Labels
How a matrix works
Any matrix
Left
singular
vectors
Singular values Right
singular
vectors
How a matrix works
Any matrix
Project
ScaleProject
How a matrix works
Null space
Singular value = 0
Null space = part of the input space that is
squashed by the matrix
Column space
Singular value ≠ 0
Column space = the non-zero part of the
output space
Effect of column scaling
Scaled columns
Effect of column scaling
Scaled columns Singular values change
(but zeros stay zero)
Singular vectors may also change
42
Effect of column scaling
• Changes the singular values and vectors, but not the rank
of the null space or column space
• … unless the scaling factor is zero
- Could only happen with tf-idf
• L2 scaling improves the condition number (therefore the
solver converges faster)
43
Mystery resolved
• Tf-idf can emphasize some columns while zeroing out
others—the uninformative features
• L2 normalization makes all features equal in “size”
- Improves the condition number of the matrix
- Solver converges faster
44
Take-away points
• Many tricks for feature generation and transformation
• Features interact with models, making their effects difficult
to predict
• But so much fun to play with!
• New book coming out: Mastering feature engineering
- More tricks, intuition, analysis
@RainyData

More Related Content

What's hot

Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science CompetitionsJeong-Yoon Lee
 
Understanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningUnderstanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningAlice Zheng
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
 
Intro to Factorization Machines
Intro to Factorization MachinesIntro to Factorization Machines
Intro to Factorization MachinesPavel Kalaidin
 
Beyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingBeyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingPierre Gutierrez
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark mldatamantra
 
Predicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksPredicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksSarah Dutkiewicz
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineSrivatsan Srinivasan
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
 
Concept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsConcept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsLightbend
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
ML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsAmy Hodler
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPDatabricks
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseHasan H Topcu
 
Feature selection
Feature selectionFeature selection
Feature selectiondkpawar
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchDatabricks
 

What's hot (20)

Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
Understanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningUnderstanding Feature Space in Machine Learning
Understanding Feature Space in Machine Learning
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
 
Intro to Factorization Machines
Intro to Factorization MachinesIntro to Factorization Machines
Intro to Factorization Machines
 
Beyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingBeyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modeling
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 
Predicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksPredicting Flights with Azure Databricks
Predicting Flights with Azure Databricks
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning Pipeline
 
Shap
ShapShap
Shap
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
Concept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsConcept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML Applications
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
ML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problems
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwise
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 

Similar to The How and Why of Feature Engineering

Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data typesAlice Zheng
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
 
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation enginelucenerevolution
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 
Replication in Data Science - A Dance Between Data Science & Machine Learning...
Replication in Data Science - A Dance Between Data Science & Machine Learning...Replication in Data Science - A Dance Between Data Science & Machine Learning...
Replication in Data Science - A Dance Between Data Science & Machine Learning...June Andrews
 
Large Components in the Rearview Mirror
Large Components in the Rearview MirrorLarge Components in the Rearview Mirror
Large Components in the Rearview MirrorMichelle Brush
 
On Entities and Evaluation
On Entities and EvaluationOn Entities and Evaluation
On Entities and Evaluationkrisztianbalog
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Miningebelani
 
Short URLs, Big Fun
Short URLs, Big FunShort URLs, Big Fun
Short URLs, Big FunHilary Mason
 
Intuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyondIntuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyondC4Media
 
Dialogare con agenti artificiali
Dialogare con agenti artificiali  Dialogare con agenti artificiali
Dialogare con agenti artificiali Agnese Augello
 
Begin with Data Scientist
Begin with Data ScientistBegin with Data Scientist
Begin with Data ScientistNarong Intiruk
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
¿Qué es real? Cuando la IA intenta engañar al ojo humano
¿Qué es real? Cuando la IA intenta engañar al ojo humano¿Qué es real? Cuando la IA intenta engañar al ojo humano
¿Qué es real? Cuando la IA intenta engañar al ojo humanoPlain Concepts
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday PeopleRebecca Bilbro
 
Object Oriented Paradigm
Object Oriented ParadigmObject Oriented Paradigm
Object Oriented ParadigmHüseyin Ergin
 
Deep learning introduction
Deep learning introductionDeep learning introduction
Deep learning introductionAdwait Bhave
 
Facets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data ExplorationFacets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data ExplorationRoberto García
 

Similar to The How and Why of Feature Engineering (20)

Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data types
 
Ml3
Ml3Ml3
Ml3
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Replication in Data Science - A Dance Between Data Science & Machine Learning...
Replication in Data Science - A Dance Between Data Science & Machine Learning...Replication in Data Science - A Dance Between Data Science & Machine Learning...
Replication in Data Science - A Dance Between Data Science & Machine Learning...
 
Large Components in the Rearview Mirror
Large Components in the Rearview MirrorLarge Components in the Rearview Mirror
Large Components in the Rearview Mirror
 
On Entities and Evaluation
On Entities and EvaluationOn Entities and Evaluation
On Entities and Evaluation
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
 
Short URLs, Big Fun
Short URLs, Big FunShort URLs, Big Fun
Short URLs, Big Fun
 
Intuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyondIntuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyond
 
Dialogare con agenti artificiali
Dialogare con agenti artificiali  Dialogare con agenti artificiali
Dialogare con agenti artificiali
 
Begin with Data Scientist
Begin with Data ScientistBegin with Data Scientist
Begin with Data Scientist
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
¿Qué es real? Cuando la IA intenta engañar al ojo humano
¿Qué es real? Cuando la IA intenta engañar al ojo humano¿Qué es real? Cuando la IA intenta engañar al ojo humano
¿Qué es real? Cuando la IA intenta engañar al ojo humano
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday People
 
Object Oriented Paradigm
Object Oriented ParadigmObject Oriented Paradigm
Object Oriented Paradigm
 
Deep learning introduction
Deep learning introductionDeep learning introduction
Deep learning introduction
 
Facets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data ExplorationFacets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data Exploration
 

Recently uploaded

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 

Recently uploaded (20)

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 

The How and Why of Feature Engineering

  • 1. The How and Why of Feature Engineering Alice Zheng, Dato March 29, 2016 Strata + Hadoop World, San Jose 1
  • 2. 2 My journey so far Shortage of expertise and good tools in the market. Applied machine learning/ data science Build ML tools Write a book
  • 3. 3 Machine learning is great! Model data. Make predictions. Build intelligent applications. Play chess and go!
  • 4. 4 The machine learning pipeline I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, … Raw data Features Models Predictions Deploy in production
  • 5. 5 If machine learning were hairstyles Images courtesy of “A visual history of ancient hairdos” and “An animated history of 20th century hairstyles.” Models Magnificent, ornate, high-maintenance Feature engineering Street smart, ad-hoc, hacky
  • 6. 6 Making sense of feature engineering • Feature generation • Feature cleaning and transformation • How well do they work? • Why?
  • 7. Feature Generation Feature: An individual measurable property of a phenomenon being observed. ⎯ Christopher Bishop, “Pattern Recognition and Machine Learning”
  • 8. 8 Representing natural text It is a puppy and it is extremely cute. What’s important? Phrases? Specific words? Ordering? Subject, object, verb? Classify: puppy or not? Raw Text {“it”:2, “is”:2, “a”:1, “puppy”:1, “and”:1, “extremely”:1, “cute”:1 } Bag of Words
  • 9. 9 Representing natural text It is a puppy and it is extremely cute. Classify: puppy or not? Raw Text Bag of Words it 2 they 0 I 0 am 0 how 0 puppy 1 and 1 cat 0 aardvark 0 cute 1 extremely 1 … … Sparse vector representation
  • 10. 10 Representing images Image source: “Recognizing and learning object categories,” Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009. Raw image: millions of RGB triplets, one for each pixel Classify: person or animal? Raw Image Bag of Visual Words
  • 11. 11 Representing images Classify: person or animal? Raw Image Deep learning features 3.29 -15 -5.24 48.3 1.36 47.1 - 1.92 36.5 2.83 95.4 -19 -89 5.09 37.8 Dense vector representation
  • 12. 12 Representing audio Raw Audio Spectrogram features Classify: Music or voice? Type of instrument t=0 t=1 t=2 6.1917 -0.3411 1.2418 0.2205 0.0214 0.4503 1.0423 0.2214 -1.0017 -0.2340 -0.0392 -0.2617 0.2750 0.0226 0.1229 0.0653 0.0428 -0.4721 0.3169 0.0541 -0.1033 -0.2970 -0.0627 0.1960 Time series of dense vectors
  • 13. 13 Feature generation for audio, image, text I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, … “Human native” Conceptually abstract Low Semantic content in data High Higher Difficulty of feature generation Lower
  • 14. Feature Cleaning and Transformation
  • 15. 15 Auto-generated features are noisy Rank Word Doc Count Rank Word Doc Count 1 the 1,416,058 11 was 929,703 2 and 1,381,324 12 this 844,824 3 a 1,263,126 13 but 822,313 4 i 1,230,214 14 my 786,595 5 to 1,196,238 15 that 777,045 6 it 1,027,835 16 with 775,044 7 of 1,025,638 17 on 735,419 8 for 993,430 18 they 720,994 9 is 988,547 19 you 701,015 10 in 961,518 20 have 692,749 Most popular words in Yelp reviews dataset (~ 6M reviews).
  • 16. 16 Auto-generated features are noisy Rank Word Doc Count Rank Word Doc Count 357,480 cmtk8xyqg 1 357,470 attractif 1 357,479 tangified 1 357,469 chappagetti 1 357,478 laaaaaaasts 1 357,468 herdy 1 357,477 bailouts 1 357,467 csmpus 1 357,476 feautred 1 357,466 costoso 1 357,475 résine 1 357,465 freebased 1 357,474 chilyl 1 357,464 tikme 1 357,473 cariottis 1 357,463 traditionresort 1 357,472 enfeebled 1 357,462 jallisco 1 357,471 sparklely 1 357,461 zoawan 1 Least popular words in Yelp reviews dataset (~ 6M reviews).
  • 17. 17 Feature cleaning • Popular words and rare words are not helpful • Manually defined blacklist – stopwords a b c d e f g h i able be came definitely each far get had ie about became can described edu few gets happens if above because cannot despite eg fifth getting hardly ignored according become cant did eight first given has immediately accordingly becomes cause different either five gives have in across becoming causes do else followed go having inasmuch … … … … … … … … …
  • 19. 19 Stopwords vs. frequency filters No training required Stopwords Frequency filters Can be exhaustive Inflexible Adapts to data Also deals with rare words Needs tuning, hard to control Both require manual attention
  • 20. 20 Tf-Idf: Automatic “soft” filter • Tf-idf = term frequency x inverse document frequency • Tf = Number of times a terms appears in a document • Idf = log(# total docs / # docs containing word w) • Large for uncommon words, small for popular words • Discounts popular words, highlights rare words
  • 21. 21 Visualizing bag-of-words puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten I have a dog and I have a pen 1
  • 22. 22 Visualizing tf-idf puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten idf(puppy) = log 4 idf(cat) = log 4 idf(have) = log 1 = 0 I have a dog and I have a pen 1
  • 23. 23 Visualizing tf-idf puppy cat1 have tfidf(puppy) = log 4 tfidf(cat) = log 4 tfidf(have) = 0 I have a dog and I have a pen, I have a kitten 1 log 4 log 4 I have a cat I have a puppy
  • 24. 24 Algebraically, tf-idf = column scaling w1 w2 w3 w4 … wM d1 d2 d3 d4 … dN idf = log (N/L0 norm of word column)
  • 25. 25 w1 w2 w3 w4 … wM d1 d2 d3 d4 … dN Algebraically, tf-idf = column scaling Multiply word column with scalar (idf of word)
  • 26. 26 w1 w2 w3 w4 … wM d1 d2 d3 d4 … dN Algebraically, tf-idf = column scaling Multiply word column with scalar (idf of word)
  • 27. 27 Algebraically, tf-idf = column scaling w1 w2 w3 w4 … wM d1 d2 d3 d4 … dN Multiply word column with scalar (idf of word)
  • 28. 28 Other types of column scaling • L2 scaling = divide column by L2 norm
  • 29. How well do they work?
  • 30. 30 Classify reviews using logistic regression • Classify business category of Yelp reviews • Bag-of-words vs. L2 normalization vs. tf-idf • Model: logistic regression
  • 31. 31 Observations • l2 regularization made no difference (with proper tuning) • L2 normalization made no difference on accuracy • Tf-idf did better, but barely • But they are both column scaling methods! Why the difference?
  • 32. A Peek Under the Hood
  • 33. 33 Linear classification Feature 2 Feature 1 Find the best line to separate two classes
  • 34. Algebraically–solve linear systems Data matrix Weightvector Labels
  • 35. How a matrix works Any matrix Left singular vectors Singular values Right singular vectors
  • 36. How a matrix works Any matrix Project ScaleProject
  • 37. How a matrix works
  • 38. Null space Singular value = 0 Null space = part of the input space that is squashed by the matrix
  • 39. Column space Singular value ≠ 0 Column space = the non-zero part of the output space
  • 40. Effect of column scaling Scaled columns
  • 41. Effect of column scaling Scaled columns Singular values change (but zeros stay zero) Singular vectors may also change
  • 42. 42 Effect of column scaling • Changes the singular values and vectors, but not the rank of the null space or column space • … unless the scaling factor is zero - Could only happen with tf-idf • L2 scaling improves the condition number (therefore the solver converges faster)
  • 43. 43 Mystery resolved • Tf-idf can emphasize some columns while zeroing out others—the uninformative features • L2 normalization makes all features equal in “size” - Improves the condition number of the matrix - Solver converges faster
  • 44. 44 Take-away points • Many tricks for feature generation and transformation • Features interact with models, making their effects difficult to predict • But so much fun to play with! • New book coming out: Mastering feature engineering - More tricks, intuition, analysis @RainyData

Editor's Notes

  1. Features sit between raw data and model. They can make or break an application.