Feature engineering pipelines

•

5 likes•2,992 views

Ramesh Sampath

Build Effective Feature Pipelines in Scikit-Learn and Pandas. Talk I presented at ODSC West 2016 in Santa Clara.

Data & Analytics

Feature Engineering Pipelines in
Scikit-Learn & Python
By Ramesh Sampath
Slides: goo.gl/sHC3iw

Ramesh Sampath
● Data Science Engineer
○ Some Machine Learning Models
○ A lot of Pre-Processing
○ Deploy it as API Services
@sampathweb (github / twitter / linkedin)

What’s the Problem
● Data Scientists Want to -
○ Build Models
○ Tune Models
○ Spend time in Algorithm Land
But Real world data is Messy and spend most of
the time in Features Land

Audience
● Built some ML Models with Scikit-Learn
● Familiar with Python
● Experienced pains of cleaning data

Agenda
● Data is Messy
● Preprocessing Options
● End to End Pipeline

Ideal World
Data
Train Test
fit(X_train, y_train)
Build Model
score(X_test, y_test)
Evaluate Model
Iterate on Algorithm Land

ML is Easy (to get started)
1. Instantiate the Model. model = LogisticRegression()
2. Train the Model. model.fit(X_train, y_train)
3. Evaluate.. model.score(X_test, y_test) / model.predict(X_test)
One Gotta -
Data needs to be Numerical Vector for Matrix Manipulation.

Vectorizing
Target -
Classification
Class -
Categorical
Gender -
Categorical
Age -
Continuous, N/A
Sibling -
Count
Embarked -
Categorical, N/A
Logistic Regression

Data Pipeline
Data
Train Test
fit(X_train, y_train)
Build Model
Clean Data
● Impute Columns
● Vectorize into Numerical Features
● Extract Additional Features
Pipeline

Train
fit(X_train, y_train)
Build Model
Feature Union
Pipeline
Pclass, Sex, Embarked -
Dummy values
Age, Fare -
● Impute Missing values
● Standardize to zero mean
SibSp, Parch -
No tranformation
Test

Preprocessing
Column Transformation Required Scikit-Learn Methods
Pclass Convert 1, 2, 3 to three columns OneHotEncoder
Sex Convert Male / Female to Binary LabelBinarizer
Age Impute Null Values
Zero Mean
Imputer
StandardScalar
SibSp Counts. No Pre-processing Required
Embarked Impute Null Values (most common)
Encode Embarked Stations to OneHot 1/0 values
Custom Imputer
LabelBinarizer (LabelEncoder &
OneHotEncoder)

StandardScaler
Zero Mean
Unit STD
Other Scalers - Min-Max Scaler, Normalizer.

Categorical Variables
● OneHotEncoder Doesn’t work with Categorical Data :-(

One Problem
● Convert ALL Categorical Columns to Numeric before OneHotEncoder
○ Fix in next Scikit-Learn version 0.19 (issue # 7327)
Categorical Encoders -
● DictVectorizer
● Label Encoder + OneHotEncoder
● Label Binarizer

Alternatives
● Preprocess in Pandas and convert to Numeric
● Create our own Custom Transformers
● Use SKLearn-Pandas
○ Original code by Ben Hamner (Kaggle CTO) and
○ Paul Butler (Google NY) 2013
○ Recent Version 1.2, Oct'2016

Feature Engineering Pipeline
Pre-Processing
● Cleaning / Imputing Values
● Encoding to Numerical Vectors
Feature Reduction & Selection
● PCA
● SelectFromModel
Feature Extractions
● Text Vectorization (Count / TFIDF)
● Polynomial Features
Machine Learning Models
Grid Search - Hyper Parameter Tuning of Models

Grid Search
Hyper Parameter Tuning (Hurry!)
Back in Algorithm Land

Jupyter Notebook
https://github.com/sampathweb/odsc-feature-engineering-talk

Credits
● Scikit-Learn (https://github.com/scikit-learn/scikit-learn)
● Sklearn-Pandas (https://github.com/paulgb/sklearn-pandas)
StackOverflow Posts:
● http://stackoverflow.com/questions/24458645/label-encoding-across-multiple-co
lumns-in-scikit-learn
● http://stackoverflow.com/questions/34710281/use-featureunion-in-scikit-learn-to
-combine-two-pandas-columns-for-tfidf

Thank You!
Slides: https://goo.gl/sHC3iw
@sampathweb (Github / Twitter / Linkedin)

What's hot

Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira

Tips for data science competitionsOwen Zhang

Model selection and tuning at scaleOwen Zhang

XgboostVivian S. Zhang

Machine Learning - Dataset PreparationAndrew Ferlitsch

Feature EngineeringSri Ambati

Performance Metrics for Machine Learning AlgorithmsKush Kulshrestha

Kaggle and data scienceAkira Shibata

Hyperparameter TuningJon Lederman

Kaggle presentationHJ van Veen

Neural NetworksAdri Jovin

Introduction to XGBoostJoonyoung Yi

The How and Why of Feature EngineeringAlice Zheng

Knapsack problem solved by Genetic AlgorithmsStelios Krasadakis

Genetic algorithmSyed Muhammad Zeejah Hashmi

Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati

Machine learningAmit Kumar Rathi

Hands-on ML - CH1Jamie (Taka) Wang

Deep learning.pptxMdMahfoozAlam5

Winning Data Science CompetitionsJeong-Yoon Lee

What's hot (20)

Feature Engineering - Getting most out of data for predictive models - TDC 2017

Tips for data science competitions

Model selection and tuning at scale

Xgboost

Machine Learning - Dataset Preparation

Feature Engineering

Performance Metrics for Machine Learning Algorithms

Kaggle and data science

Hyperparameter Tuning

Kaggle presentation

Neural Networks

Introduction to XGBoost

The How and Why of Feature Engineering

Knapsack problem solved by Genetic Algorithms

Genetic algorithm

Feature Engineering for ML - Dmitry Larko, H2O.ai

Machine learning

Hands-on ML - CH1

Deep learning.pptx

Winning Data Science Competitions

Similar to Feature engineering pipelines

From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...confluent

Spark Summit EU talk by Nick PentreathSpark Summit

BlaBlaCar Elastic Search Feedbacksinfomicien

AutoML for user segmentation: how to match millions of users with hundreds of...Institute of Contemporary Sciences

Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon

Taking your machine learning workflow to the next level using Scikit-Learn Pi...Philip Goddard

OPTIMIZING THE TICK STACKInfluxData

Introduction to Machine Learning with Sparkdatamantra

Model Transformation Reusemiso_uam

PregelWeiru Dai

How to build TiDBPingCAP

Enar short courseDeepak Agarwal

Robust C++ Task Systems Through Compile-time ChecksStoyan Nikolov

Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit

Oracle to Postgres Schema Migration HustleEDB

RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData

Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous

Apache Calcite: One Frontend to Rule Them AllMichael Mior

Scaling Ride-Hailing with Machine Learning on MLflowDatabricks

Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha

Similar to Feature engineering pipelines (20)

From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...

Spark Summit EU talk by Nick Pentreath

BlaBlaCar Elastic Search Feedback

AutoML for user segmentation: how to match millions of users with hundreds of...

Production ready big ml workflows from zero to hero daniel marcous @ waze

Taking your machine learning workflow to the next level using Scikit-Learn Pi...

OPTIMIZING THE TICK STACK

Introduction to Machine Learning with Spark

Model Transformation Reuse

Pregel

How to build TiDB

Enar short course

Robust C++ Task Systems Through Compile-time Checks

Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg

Oracle to Postgres Schema Migration Hustle

RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro

Production-Ready BIG ML Workflows - from zero to hero

Apache Calcite: One Frontend to Rule Them All

Scaling Ride-Hailing with Machine Learning on MLflow

Online learning with structured streaming, spark summit brussels 2016

Recently uploaded

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

IMA MSN - Medical Students Network (2).pptxdolaknnilon

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

RadioAdProWritingCinderellabyButleri.pdfgstagge

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Recently uploaded (20)

9654467111 Call Girls In Munirka Hotel And Home Service

DBA Basics: Getting Started with Performance Tuning.pdf

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

IMA MSN - Medical Students Network (2).pptx

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

Identifying Appropriate Test Statistics Involving Population Mean

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

Defining Constituents, Data Vizzes and Telling a Data Story

RadioAdProWritingCinderellabyButleri.pdf

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

20240419 - Measurecamp Amsterdam - SAM.pdf

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

GA4 Without Cookies [Measure Camp AMS]

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

Customer Service Analytics - Make Sense of All Your Data.pptx

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

Call Girls In Dwarka 9654467111 Escorts Service

Feature engineering pipelines

1. Feature Engineering Pipelines in Scikit-Learn & Python By Ramesh Sampath Slides: goo.gl/sHC3iw

2. Ramesh Sampath ● Data Science Engineer ○ Some Machine Learning Models ○ A lot of Pre-Processing ○ Deploy it as API Services @sampathweb (github / twitter / linkedin)

3. What’s the Problem ● Data Scientists Want to - ○ Build Models ○ Tune Models ○ Spend time in Algorithm Land But Real world data is Messy and spend most of the time in Features Land

4. Audience ● Built some ML Models with Scikit-Learn ● Familiar with Python ● Experienced pains of cleaning data

5. Agenda ● Data is Messy ● Preprocessing Options ● End to End Pipeline

6. Ideal World Data Train Test fit(X_train, y_train) Build Model score(X_test, y_test) Evaluate Model Iterate on Algorithm Land

7. ML is Easy (to get started) 1. Instantiate the Model. model = LogisticRegression() 2. Train the Model. model.fit(X_train, y_train) 3. Evaluate.. model.score(X_test, y_test) / model.predict(X_test) One Gotta - Data needs to be Numerical Vector for Matrix Manipulation.

8. Data is Messy

9. Vectorizing Target - Classification Class - Categorical Gender - Categorical Age - Continuous, N/A Sibling - Count Embarked - Categorical, N/A Logistic Regression

10. Data Pipeline Data Train Test fit(X_train, y_train) Build Model Clean Data ● Impute Columns ● Vectorize into Numerical Features ● Extract Additional Features Pipeline

11. Train fit(X_train, y_train) Build Model Feature Union Pipeline Pclass, Sex, Embarked - Dummy values Age, Fare - ● Impute Missing values ● Standardize to zero mean SibSp, Parch - No tranformation Test

12. Preprocessing Column Transformation Required Scikit-Learn Methods Pclass Convert 1, 2, 3 to three columns OneHotEncoder Sex Convert Male / Female to Binary LabelBinarizer Age Impute Null Values Zero Mean Imputer StandardScalar SibSp Counts. No Pre-processing Required Embarked Impute Null Values (most common) Encode Embarked Stations to OneHot 1/0 values Custom Imputer LabelBinarizer (LabelEncoder & OneHotEncoder)

13. StandardScaler Zero Mean Unit STD Other Scalers - Min-Max Scaler, Normalizer.

14. OneHotEncoder Transform Pclass

15. Categorical Variables ● OneHotEncoder Doesn’t work with Categorical Data :-(

16. OneHotEncoder Map Strings to Numeric

17. Column Selector

18. Pipeline

19. One Problem ● Convert ALL Categorical Columns to Numeric before OneHotEncoder ○ Fix in next Scikit-Learn version 0.19 (issue # 7327) Categorical Encoders - ● DictVectorizer ● Label Encoder + OneHotEncoder ● Label Binarizer

20. Alternatives ● Preprocess in Pandas and convert to Numeric ● Create our own Custom Transformers ● Use SKLearn-Pandas ○ Original code by Ben Hamner (Kaggle CTO) and ○ Paul Butler (Google NY) 2013 ○ Recent Version 1.2, Oct'2016

21. SKLearn-Pandas

22. SKLearn-Pandas

23. Feature Engineering Pipeline Pre-Processing ● Cleaning / Imputing Values ● Encoding to Numerical Vectors Feature Reduction & Selection ● PCA ● SelectFromModel Feature Extractions ● Text Vectorization (Count / TFIDF) ● Polynomial Features Machine Learning Models Grid Search - Hyper Parameter Tuning of Models

24. Grid Search Hyper Parameter Tuning (Hurry!) Back in Algorithm Land

25. Jupyter Notebook https://github.com/sampathweb/odsc-feature-engineering-talk

26. Credits ● Scikit-Learn (https://github.com/scikit-learn/scikit-learn) ● Sklearn-Pandas (https://github.com/paulgb/sklearn-pandas) StackOverflow Posts: ● http://stackoverflow.com/questions/24458645/label-encoding-across-multiple-co lumns-in-scikit-learn ● http://stackoverflow.com/questions/34710281/use-featureunion-in-scikit-learn-to -combine-two-pandas-columns-for-tfidf

27. Thank You! Slides: https://goo.gl/sHC3iw @sampathweb (Github / Twitter / Linkedin)

Feature engineering pipelines

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Feature engineering pipelines

Similar to Feature engineering pipelines (20)

Recently uploaded

Recently uploaded (20)

Feature engineering pipelines