SlideShare a Scribd company logo
1 of 74
Download to read offline
Valencian Summer School in Machine Learning
3rd edition
September 14-15, 2017
BigML, Inc 2
Basic Transformations
Making Data Machine Learning Ready
Poul Petersen
CIO, BigML, Inc
BigML, Inc 3Basic Transformations
In a Perfect Worldā€¦
Q: How does a physicist milk a cow?
A: Well, ļ¬rst let us consider a spherical cow...
Q: How does a data scientist build a model?
A: Well, ļ¬rst let us consider perfectly formatted dataā€¦
BigML, Inc 4Basic Transformations
The Dream
CSV Dataset Model Proļ¬t!
BigML, Inc 5Basic Transformations
The Reality
CRM
Web Accounts
Transactions
ML Ready?
BigML, Inc 6Basic Transformations
Obstacles
ā€¢ Data Structure
ā€¢ Scattered across systems
ā€¢ Wrong "shape"
ā€¢ Unlabelled data
ā€¢ Data Value
ā€¢ Format: spelling, units
ā€¢ Missing values
ā€¢ Non-optimal correlation
ā€¢ Non-existant correlation
ā€¢ Data Signiļ¬cance
ā€¢ Unwanted: PII, Non-Preferred
ā€¢ Expensive to collect
ā€¢ Insidious: Leakage, obviously correlated
Data Transformation
Feature Engineering
Feature Selection
BigML, Inc 7Basic Transformations
The Process
ā€¢ Deļ¬ne a clear idea of the goal.
ā€¢ Sometimes this comes laterā€¦
ā€¢ Understand what ML tasks will achieve the goal.
ā€¢ Transform the data
ā€¢ where is it, how is it stored?
ā€¢ what are the features?
ā€¢ can you access it programmatically?
ā€¢ Feature Engineering: transform the data you have into
the data you actually need.
ā€¢ Evaluate: Try it on a small scale
ā€¢ Accept that you might have to start overā€¦.
ā€¢ But when it works, automate it!!!!
BigML, Inc 8Basic Transformations
Data Transformations
BigML, Inc 9Basic Transformations
BigML Tasks
Goal
ā€¢ Will this customer default on a
loan?
ā€¢ How many customers will apply for
a loan next month?
ā€¢ Is the consumption of this product
unusual?
ā€¢ Is the behavior of the customers
similar?
ā€¢ Are these products purchased
together?
ML Task
Classification
Regression
Anomaly Detection
Cluster Analysis
Association Discovery
BigML, Inc 10Basic Transformations
Classiļ¬cation
CategoricalTrainingTesting
Predicting
BigML, Inc 11Basic Transformations
Regression
NumericTrainingTesting
Predicting
BigML, Inc 12Basic Transformations
Anomaly Detection
BigML, Inc 13Basic Transformations
Cluster Analysis
BigML, Inc 14Basic Transformations
Association Discovery
BigML, Inc 15Basic Transformations
ML Ready DataInstances
Fields	
 Ā (Features)
Tabular Data (rows and columns):
ā€¢ Each row
ā€¢ is one instance.
ā€¢ contains all the information about that one instance.
ā€¢ Each column
ā€¢ is a ļ¬eld that describes a property of the instance.
BigML, Inc 16Basic Transformations
Data Labeling
Unsupervised	
 Ā Learning Supervised	
 Ā Learning
ā€¢ Anomaly Detection
ā€¢ Clustering
ā€¢ Association Discovery
ā€¢ Classiļ¬cation
ā€¢ Regression
The only difference, in terms of
ML-Ready structure is the
presence of a "label"
BigML, Inc 17Basic Transformations
Data Labelling
Data is often not labeled
Create labels with a transformation
Name Month - 3 Month - 2 Month - 1
Joe Schmo 123.23 0 0
Jane Plain 0 0 0
Mary Happy 0 55.22 243.33
Tom Thumb 12.34 8.34 14.56
Un-Ā­ā€Labelled	
 Ā Data
Labelled	
 Ā data
Name Month - 3 Month - 2 Month - 1 Default
Joe Schmo 123.23 0 0 FALSE
Jane Plain 0 0 0 TRUE
Mary Happy 0 55.22 243.33 FALSE
Tom Thumb 12.34 8.34 14.56 FALSE
Can be done at Feature
Engineering step as well
BigML, Inc 18Basic Transformations
SF Restaurants Example
https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores/stya-26eb
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
create database sf_restaurants;
use sf_restaurants;
create table businesses (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100),
postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100));
load data local infile './businesses.csv' into table businesses fields terminated by ',' enclosed by '"' lines terminated by
'rn' ignore 1 lines;
create table inspections (business_id int, score varchar(10), idate varchar(8), itype varchar(100));
load data local infile './inspections.csv' into table inspections fields terminated by ',' enclosed by '"' lines terminated
by 'rn' ignore 1 lines;
create table violations (business_id int, vdate varchar(8), description varchar(1000));
load data local infile './violations.csv' into table violations fields terminated by ',' enclosed by '"' lines terminated by
'rn' ignore 1 lines;
create table legend (Minimum_Score int, Maximum_Score int, Description varchar(100));
load data local infile './legend.csv' into table legend fields terminated by ',' enclosed by '"' lines terminated by 'rn'
ignore 1 lines;
BigML, Inc 19Basic Transformations
Transformations Demo #1
BigML, Inc 20Basic Transformations
Data Cleaning
Homogenize missing values and different types in the same
feature, fix input errors, correct semantic issues, types, etc.
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 - Rock 139
Blues alive 1990/03/01 281 Blues 239
Lonely planet 2002-11-19 5:32s Techno 42
Dance, dance 02/23/1983 312 Disco N/A
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 4 minutes Techno 895
The alchemist 2001-11-21 418 Bluesss 178
Bring me down 18-10-98 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Original	
 Ā data
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 Rock 139
Blues alive 1990-03-01 281 Blues 239
Lonely planet 2002-11-19 332 Techno 42
Dance, dance 1983-02-23 312 Disco
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 240 Techno 895
The alchemist 2001-11-21 418 Blues 178
Bring me down 1998-10-18 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Cleaned	
 Ā data
update violations set description = substr(description,1,instr(description,' [ date violation corrected:')-1) where instr(description,'
[ date violation corrected:') > 0;
BigML, Inc 21Basic Transformations
Transformations Demo #2
BigML, Inc 22Basic Transformations
Deļ¬ne a Goal
ā€¢ Predict rating: Poor / Needs Improvement / Adequate /
Good
ā€¢ This is a classiļ¬cation problem
ā€¢ Based on business proļ¬le:
ā€¢ Description: kitchen, cafe, etc.
ā€¢ Location: zip, latitude, longitude
BigML, Inc 23Basic Transformations
Denormalizing
business
inspections
violations
scores
Instances
Features
(millions)
join
Data is usually normalized in relational databases, ML-Ready
datasets need the information de-normalized in a single dataset.
create table scores select * from businesses left join inspections using (business_id);
create table scores_last select a.* from scores as a JOIN (select business_id,max(idate)
as idate from scores group by business_id) as b where a.business_id=b.business_id and
a.idate=b.idate;
Denormalize
ML-Ā­ā€Ready:	
 Ā Each	
 Ā row	
 Ā contains	
 Ā all	
 Ā the	
 Ā information	
 Ā about	
 Ā that	
 Ā one	
 Ā instance.	
 Ā 
create table scores_last_label select scores_last.*, Description as score_label from
scores_last join legend on score <= Maximum_Score and score >= Minimum_Score;
Add	
 Ā Label
BigML, Inc 24Basic Transformations
Transformations Demo #3
BigML, Inc 25Basic Transformations
Structuring Output
ā€¢ A CSV ļ¬le uses plain text to store tabular data.
ā€¢ In a CSV ļ¬le, each row of the ļ¬le is an instance.
ā€¢ Each column in a row is usually separated by a comma (,) but other
"separators" like semi-colon (;), colon (:), pipe (|), can also be used.
Each row must contain the same number of ļ¬elds
ā€¢ but they can be null
ā€¢ Fields can be quoted using double quotes (").
ā€¢ Fields that contain commas or line separators must be quoted.
ā€¢ Quotes (") in ļ¬elds must be doubled ("").
ā€¢ The character encoding must be UTF-8
ā€¢ Optionally, a CSV ļ¬le can use the ļ¬rst line as a header to provide the
names of each ļ¬eld.
After all the data transformations, a CSV (ā€œComma-Separated
Values) file has to be generated, following the rules below:
select * from scores_last_label into outfile "./scores_last_label.csv";
select 'name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'score_label' UNION select name, address, city,
state, postal_code, latitude, longitude, score_label from scores_last_label into outfile "./scores_last_label_headers.csv" ;
BigML, Inc 26Basic Transformations
Transformations Demo #4
BigML, Inc 27Basic Transformations
Deļ¬ne a Goal
ā€¢ Predict rating: Poor / Needs Improvement / Adequate / Good
ā€¢ This is a classiļ¬cation problem
ā€¢ Based on business proļ¬le:
ā€¢ Description: kitchen, restaurant, etc.
ā€¢ Location: zip code, latitude, longitude
ā€¢ Number of violations, text of violations
BigML, Inc 28Basic Transformations
Aggregating
User Num.Playbacks Total Time Pref.Device
User001 3 830 Tablet
User002 1 218 Smartphone
User003 3 1019 TV
User005 2 521 Tablet
Aggregated data (list of users)
When the entity to model is different from the provided data,
an aggregation to get the entity might be needed.
Content Genr
e
Duration Play Time User Device
Highway
star
Rock 190 2015-05-12
16:29:33
User001 TV
Blues alive Blues 281 2015-05-13
12:31:21
User005 Tablet
Lonely
planet
Tech
no
332 2015-05-13
14:26:04
User003 TV
Dance,
dance
Disco 312 2015-05-13
18:12:45
User001 Tablet
The wall Reag
ge
218 2015-05-14
09:02:55
User002 Smartphone
Offside
down
Tech
no
240 2015-05-14
11:26:32
User005 Tablet
The
alchemist
Blues 418 2015-05-14
21:44:15
User003 TV
Bring me
down
Class
ic
328 2015-05-15
06:59:56
User001 Tablet
The
scarecrow
Rock 269 2015-05-15
12:37:05
User003 Smartphone
Original data (list of playbacks)
create table violations_aggregated select business_id,count(*) as violation_num,group_concat(description) as violation_txt from
violations group by business_id;
create table scores_last_label_violations select * from scores_last_label left join violations_aggregated USING (business_id);
tail -n+2 playlists.csv | cut -d',' -f5 | sort | uniq -c
tail -n+2 playlist.csv | awk -F',' '{arr[$5]+=$3} END {for (i in arr) {print arr[i],i}}'
SET @@group_concat_max_len = 15000
select 'name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'violation_num', 'violation_txt', 'score_label'
UNION select name, address, city, state, postal_code, latitude, longitude, violation_num, violation_txt, score_label from
scores_last_label_violations into outfile "./scores_last_label_violations_headers.csv" ;
BigML, Inc 29Basic Transformations
Transformations Demo #5
BigML, Inc 30Basic Transformations
Pivoting
Different values of a feature are pivoted to new columns in the
result dataset.
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002 Smartphone
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet
The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone
Original data
User Num.Playback
s
Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone
User001 3 830 Tablet 1 2 0 190 640 0
User002 1 218 Smartphone 0 0 1 0 0 218
User003 3 1019 TV 2 0 1 750 0 269
User005 2 521 Tablet 0 2 0 0 521 0
Aggregated data with pivoted columns
BigML, Inc 31Basic Transformations
Time Windows
Create new features using values over different periods of time
Instances
Features
Time
Instances
Features
(millions)
(thousands)
t=1 t=2 t=3
create table scores_2013 select a.business_id, a.score as score_2013, a.idate as idate_2013 from inspections as a JOIN ( select
business_id, max(idate) as idate from inspections where substr(idate,1,4) = "2013" group by business_id) as b where a.business_id =
b.business_id and a.idate = b.idate;
create table scores_over_time select * from businesses left join scores_2013 USING (business_id) left join scores_2014 USING (business_id);
BigML, Inc 32Basic Transformations
Transformations Demo #6
BigML, Inc 33Basic Transformations
Updates
Need a current view of the data, but new data only comes in
batches of changes
day	
 Ā 1day	
 Ā 2day	
 Ā 3
Instances
Features
BigML, Inc 34Basic Transformations
Streaming
Data only comes in single changes
data	
 Ā stream
Instances
Features
Stream
Batch
(kafka, etc)
BigML, Inc 35Basic Transformations
Prosper Loan Life Cycle
Submit
Cancelled Withdraw Expired
FundedBids Current
Q: Which new listings make it to funded?
Q: Which funded loans make it to paid?
Q: If funded, what will be the rate?
Classification
Regression
Classification
Goal ML Task
Defaulted
Paid
Late
Listings Loans
BigML, Inc 36Basic Transformations
Prosper Example
D a t a P ro v i d e d i n X M L
updates!!
export.sh
fetch.sh
ā€œcurlā€
daily
import.py
XML
bigml.sh
Model

Predict

Share in gallery
Status
LoanStatus
BorrowerRate
Denormalization with join
BigML, Inc 37Basic Transformations
Prosper Example
ā€¢ XMLā€¦ yuck!
ā€¢ MongoDB has CSV export and is record based so it is easy to
handle changing data structure.
ā€¢ Feature Engineering
ā€¢ There are 5 diļ¬€erent classes of ā€œbadā€ loans
ā€¢ Date cleanup
ā€¢ Type casting: ļ¬‚oats and ints
ā€¢ Would be better to track over time
ā€¢ number of late payments
ā€¢ compare predictions and actuals
ā€¢ XMLā€¦ yuck!
Tidbits and Lessons Learnedā€¦.
BigML, Inc 38Basic Transformations
Tools
BigML, Inc 39Basic Transformations
Tools
ā€¢ Command Line?
ā€¢ join, cut, awk, sed, sort, uniq
ā€¢ Automation
ā€¢ Shell, Python, crontab, etc
ā€¢ Talend
ā€¢ BigML: bindings, bigmler, API, whizzml
ā€¢ Relational DB
ā€¢ MySQL
ā€¢ Non-Relational DB
ā€¢ MongoDB
BigML, Inc 40Basic Transformations
Talend
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
Denormalization Example
BigML, Inc 41Basic Transformations
Summary
ā€¢ Data is awful
ā€¢ Requires clean-up
ā€¢ Transformations
ā€¢ Consumes an enormous part of the effort in
applying ML
ā€¢ Techniques:
ā€¢ Denormalizing
ā€¢ Aggregating / Pivoting
ā€¢ Time windows / Streaming
ā€¢ What a real Workļ¬‚ow looks like and the tools required
BigML, Inc 2
Feature Engineering
Creating Features that Make Machine Learning Work
Poul Petersen
CIO, BigML, Inc
BigML, Inc 3Feature Engineering
what is Feature Engineering
ā€¢ This is really, really important - more than algorithm selection!
ā€¢ In fact, so important that BigML often does it automatically
ā€¢ ML Algorithms have no deeper understanding of data
ā€¢ Numerical: have a natural order, can be scaled, etc
ā€¢ Categorical: have discrete values, etc.
ā€¢ The "magic" is the ability to ļ¬nd patterns quickly and efļ¬ciently
ā€¢ ML Algorithms only know what you tell/show it with data
ā€¢ Medical: Kg and M, but BMI = Kg/M2 is better
ā€¢ Lending: Debt and Income, but DTI is better
ā€¢ Intuition can be risky: remember to prove it with an evaluation!
Feature Engineering: applying domain knowledge of
the data to create new features that allow ML
algorithms to work better, or to work at all.
BigML, Inc 4Feature Engineering
Built-in Transformations
2013-09-25 10:02
Date-Time Fields
ā€¦ year month day hour minute ā€¦
ā€¦ 2013 Sep 25 10 2 ā€¦
ā€¦ ā€¦ ā€¦ ā€¦ ā€¦ ā€¦ ā€¦
NUM NUMCAT NUM NUM
ā€¢ Date-Time ļ¬elds have a lot of information "packed" into them
ā€¢ Splitting out the time components allows ML algorithms to
discover time-based patterns.
DATE-TIME
BigML, Inc 5Feature Engineering
Built-in Transformations
Categorical Fields for Clustering/LR
ā€¦ alchemy_category ā€¦
ā€¦ business ā€¦
ā€¦ recreation ā€¦
ā€¦ health ā€¦
ā€¦ ā€¦ ā€¦
CAT
business health recreation ā€¦
ā€¦ 1 0 0 ā€¦
ā€¦ 0 0 1 ā€¦
ā€¦ 0 1 0 ā€¦
ā€¦ ā€¦ ā€¦ ā€¦ ā€¦
NUM NUM NUM
ā€¢ Clustering and Logistic Regression require numeric ļ¬elds for
inputs
ā€¢ Categorical values are transformed to numeric vectors
automatically*
ā€¢ *Note: In BigML, clustering uses k-prototypes and the encoding used for LR can be conļ¬gured.
BigML, Inc 6Feature Engineering
Built-in Transformations
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ā€˜em.
TEXT
Text Fields
ā€¦ great afraid born achieve ā€¦
ā€¦ 4 1 1 1 ā€¦
ā€¦ ā€¦ ā€¦ ā€¦ ā€¦ ā€¦
NUM NUM NUM NUM
ā€¢ Unstructured text contains a lot of potentially interesting
patterns
ā€¢ Bag-of-words analysis happens automatically and extracts
the "interesting" tokens in the text
ā€¢ Another option is Topic Modeling to extract thematic meaning
BigML, Inc 7Feature Engineering
Help ML to Work Better
{
ā€œurl":"cbsnews",
"title":"Breaking News Headlines
Business Entertainment World News ā€œ,
"body":" news covering all the latest
breaking national and world news
headlines, including politics, sports,
entertainment, business and more.ā€
}
TEXT
title body
Breaking Newsā€¦ news coveringā€¦
ā€¦ ā€¦
TEXT TEXT
When text is not actually unstructured
ā€¢ In this case, the text ļ¬eld has structure (key/value pairs)
ā€¢ Extracting the structure as new features may allow the ML
algorithm to work better
BigML, Inc 8Feature Engineering
FE Demo #1
BigML, Inc 9Feature Engineering
Help ML to Work at all
When the pattern does not exist
Highway Number Direction Is Long
2 East-West FALSE
4 East-West FALSE
5 North-South TRUE
8 East-West FALSE
10 East-West TRUE
ā€¦ ā€¦ ā€¦
Goal: Predict principle direction from highway number
( = (mod (field "Highway Number") 2) 0)
BigML, Inc 10Feature Engineering
FE Demo #2
BigML, Inc 11Feature Engineering
Feature Engineering
Discretization
Total Spend
7,342.99
304.12
4.56
345.87
8,546.32
NUM
ā€œPredict will spend
$3,521 with error
$1,232ā€
Spend Category
Top 33%
Bottom 33%
Bottom 33%
Middle 33%
Top 33%
CAT
ā€œPredict customer
will be Top 33% in
spendingā€
BigML, Inc 12Feature Engineering
FE Demo #3
BigML, Inc 13Feature Engineering
Built-ins for FE
ā€¢ Discretize: Converts a numeric value to categorical
ā€¢ Replace missing values: ļ¬xed/max/mean/median/etc
ā€¢ Normalize: Adjust a numeric value to a speciļ¬c range of
values while preserving the distribution
ā€¢ Math: Exponentiation, Logarithms, Squares, Roots, etc
ā€¢ Types: Force a ļ¬eld value to categorical, integer, or real
ā€¢ Random: Create random values for introducing noise
ā€¢ Statistics: Mean, Population
ā€¢ Refresh Fields:
ā€¢ Types: recomputes ļ¬eld types. Ex: #classes	
 Ā >	
 Ā 1000
ā€¢ Preferred: recomputes preferred status
BigML, Inc 14Feature Engineering
Flatline Add Fields
Computing with Existing Features
Debt Income
10,134 100,000
85,234 134,000
8,112 21,500
0 45,900
17,534 52,000
NUM NUM
(/ (ļ¬eld "Debt") (ļ¬eld "Income"))
Debt	
 Ā 
Income
Debt to Income Ratio
0.10
0.64
0.38
0
0.34
NUM
BigML, Inc 15Feature Engineering
FE Demo #4
BigML, Inc 16Feature Engineering
What is Flatline?
ā€¢ DSL:
ā€¢ Invented by BigML - Programmatic / Optimized for speed
ā€¢ Transforms datasets into new datasets
ā€¢ Adding new ļ¬elds / Filtering
ā€¢ Transformations are written in lisp-style syntax
ā€¢ Feature Engineering
ā€¢ Computing new ļ¬elds: (/	
 Ā (ļ¬eld	
 Ā "Debt")	
 Ā (ļ¬eld	
 Ā ā€œIncomeā€))
ā€¢ Programmatic Filtering:
ā€¢ Filtering datasets according to functions that evaluate to
true/false using the row of data as an input.
Flatline: a domain specific language for feature
engineering and programmatic filtering
BigML, Inc 17Feature Engineering
Flatline
ā€¢ Lisp style syntax: Operators come ļ¬rst
ā€¢ Correct: (+	
 Ā 1	
 Ā 2) => NOT Correct: (1	
 Ā +	
 Ā 2)
ā€¢ Dataset Fields are ļ¬rst-class citizens
ā€¢ (ļ¬eld	
 Ā ā€œdiabetes	
 Ā pedigreeā€)	
 Ā 
ā€¢ Limited programming language structures
ā€¢ let, cond, if, map, list operators, */+-Ā­ā€, etc.
ā€¢ Built-in transformations
ā€¢ statistics, strings, timestamps, windows
BigML, Inc 18Feature Engineering
Flatline s-expressions
(=	
 Ā 0	
 Ā (+	
 Ā (abs	
 Ā (	
 Ā f	
 Ā "Month	
 Ā -Ā­ā€	
 Ā 3"	
 Ā )	
 Ā )	
 Ā (abs	
 Ā (	
 Ā f	
 Ā "Month	
 Ā -Ā­ā€	
 Ā 2"))	
 Ā (abs	
 Ā (	
 Ā f	
 Ā "Month	
 Ā -Ā­ā€	
 Ā 1")	
 Ā )	
 Ā ))
Name Month - 3 Month - 2 Month - 1
Joe Schmo 123.23 0 0
Jane Plain 0 0 0
Mary Happy 0 55.22 243.33
Tom Thumb 12.34 8.34 14.56
Un-Ā­ā€Labelled	
 Ā Data
Labelled	
 Ā data
Name Month - 3 Month - 2 Month - 1 Default
Joe Schmo 123.23 0 0 FALSE
Jane Plain 0 0 0 TRUE
Mary Happy 0 55.22 243.33 FALSE
Tom Thumb 12.34 8.34 14.56 FALSE
Adding Simple Labels to Data
Define "default" as
missing three payments
in a row
BigML, Inc 19Feature Engineering
FE Demo #5
BigML, Inc 20Feature Engineering
Flatline s-expressions
date volume price
1 34353 314
2 44455 315
3 22333 315
4 52322 321
5 28000 320
6 31254 319
7 56544 323
8 44331 324
9 81111 287
10 65422 294
11 59999 300
12 45556 302
13 19899 301
Current	
 Ā -Ā­ā€	
 Ā (4-Ā­ā€day	
 Ā avg)	
 Ā 
std	
 Ā dev
Shock: Deviations from a Trend
day-4 day-3 day-2 day-1 4davg
-
314 -
314 315 -
314 315 315 -
314 315 315 321 316.25
315 315 321 320 317.75
315 321 320 319 318.75
BigML, Inc 21Feature Engineering
Flatline s-expressions
Current	
 Ā -Ā­ā€	
 Ā (4-Ā­ā€day	
 Ā avg)	
 Ā 
std	
 Ā dev
Shock: Deviations from a Trend
Current : (ļ¬eld ā€œpriceā€)
4-Ā­ā€day	
 Ā avg: (avg-window ā€œpriceā€ -4 -1)
std	
 Ā dev: (standard-deviation ā€œpriceā€)
(/	
 Ā 	
 Ā (-Ā­ā€	
 Ā 	
 Ā (	
 Ā f	
 Ā "price")	
 Ā (avg-Ā­ā€window	
 Ā "price"	
 Ā -Ā­ā€4,	
 Ā -Ā­ā€1))	
 Ā (standard-Ā­ā€deviaOon	
 Ā "price"))
BigML, Inc 22Feature Engineering
FE Demo #6
BigML, Inc 23Feature Engineering
Advanced s-expressions
Moon Phase%
(	
 Ā /	
 Ā (	
 Ā mod	
 Ā (	
 Ā -Ā­ā€	
 Ā (	
 Ā /	
 Ā (	
 Ā epoch	
 Ā (	
 Ā ļ¬eld	
 Ā "date-Ā­ā€ļ¬eld"	
 Ā ))	
 Ā 1000	
 Ā )	
 Ā 621300	
 Ā )	
 Ā 2551443	
 Ā )	
 Ā 2551442	
 Ā )
Highway isEven?
(	
 Ā =	
 Ā (mod	
 Ā (ļ¬eld	
 Ā "Highway	
 Ā Number")	
 Ā 2)	
 Ā 0)
(	
 Ā let	
 Ā (R	
 Ā 6371000	
 Ā latA	
 Ā (to-Ā­ā€radians	
 Ā {lat-Ā­ā€ref})	
 Ā latB	
 Ā (to-Ā­ā€radians	
 Ā (	
 Ā ļ¬eld	
 Ā "LATITUDE"	
 Ā )	
 Ā )	
 Ā latD	
 Ā (	
 Ā -Ā­ā€	
 Ā latB	
 Ā latA	
 Ā 
)	
 Ā longD	
 Ā (	
 Ā to-Ā­ā€radians	
 Ā (	
 Ā -Ā­ā€	
 Ā (	
 Ā ļ¬eld	
 Ā "LONGITUDE"	
 Ā )	
 Ā {long-Ā­ā€ref}	
 Ā )	
 Ā )	
 Ā a	
 Ā (	
 Ā +	
 Ā (	
 Ā square	
 Ā (	
 Ā sin	
 Ā (	
 Ā /	
 Ā latD	
 Ā 2	
 Ā )	
 Ā )	
 Ā )	
 Ā (	
 Ā *	
 Ā 
(cos	
 Ā latA)	
 Ā (cos	
 Ā latB)	
 Ā (square	
 Ā (	
 Ā sin	
 Ā (	
 Ā /	
 Ā longD	
 Ā 2)))	
 Ā )	
 Ā )	
 Ā c	
 Ā (	
 Ā *	
 Ā 2	
 Ā (	
 Ā asin	
 Ā (	
 Ā min	
 Ā (list	
 Ā 1	
 Ā (sqrt	
 Ā a))	
 Ā )	
 Ā )	
 Ā )	
 Ā )	
 Ā (	
 Ā *	
 Ā R	
 Ā 
c	
 Ā )	
 Ā )	
 Ā 
Distance Lat/Long <=> Ref (Haversine)
BigML, Inc 24Feature Engineering
WhizzML + Flatline
HAVERSINE
FLATLINE
OUTPUT
DATASET
INPUT
DATASET
LONG Ref
LAT Ref
WHIZZML SCRIPT
GALLERY
BigML, Inc 25Feature Engineering
Feature Engineering
Fix Missing Values in a ā€œMeaningfulā€ Way
F i l t e r
Zeros
Model ā€Ø
insulin
Predict ā€Ø
insulin
Select ā€Ø
insulin
Fixedā€Ø
Dataset
Amendedā€Ø
Dataset
Originalā€Ø
Dataset
Cleanā€Ø
Dataset
( if ( = (field "insulin") 0) (field "predicted insulin") (field "insulin"))
BigML, Inc 26Feature Engineering
FE Demo #7
BigML, Inc 27Feature Engineering
Feature Selection
BigML, Inc 28Feature Engineering
Feature Selection
ā€¢ Model Summary
ā€¢ Field Importance
ā€¢ Algorithmic
ā€¢ Best-First Feature Selection
ā€¢ Boruta
ā€¢ Leakage
ā€¢ Tight Correlations (AD, Plot, Correlations)
ā€¢ Test Data
ā€¢ Perfect future knowledge
cat diabetes.csv diabetes_testset.csv | sort | uniq -d | wc -l
BigML, Inc 29Feature Engineering
Feature Selection
ā€¢ Sales pipeline where step n-1 has no other outcome then
step n.
ā€¢ Stock close predicts stock open
ā€¢ Churn retention: the worst rep is actually the best
(correlation != causation)
ā€¢ Cancer prediction where one input is a doctor ordered test
for the condition
ā€¢ Account ID predicts fraud (because only new accounts are
fraudsters)
Leakage
BigML, Inc 30Feature Engineering
Evaluate & Automate
BigML, Inc 31Feature Engineering
Evaluate & Automate
ā€¢ Evaluate
ā€¢ Did you meet the goal?
ā€¢ If not, did you discover something else useful?
ā€¢ If not, start over
ā€¢ If you didā€¦
ā€¢ Automate - You donā€™t want to hand code that every time,
right?
ā€¢ Consider tools that are easy to automate
ā€¢ Scripting interface
ā€¢ APIs
ā€¢ Ability to maintain is important
BigML, Inc 32Feature Engineering
The Process
Data
Transform
Deļ¬ne Goal
Model &
Evaluate
no
yes
Better

Data
Not

Possible
Tune

Algorithm
Goal
Met?
Automate
Feature
Engineer &
Selection
Betterā€Ø
Features
BigML, Inc 33Feature Engineering
Summary
ā€¢ Feature Engineering: what is it / why it is important
ā€¢ Automatic transformations: date-time, text, etc
ā€¢ Built-in functions: ļ¬ltering and feature engineering
ā€¢ Discretization / Normalization / etc.
ā€¢ Flatline: programmatic feature engineering / ļ¬ltering
ā€¢ Structure
ā€¢ Examples: Adding ļ¬elds / ļ¬ltering
ā€¢ When building features it is important to watch for leakage
ā€¢ The critical importance of automating
VSSML17 L5. Basic Data Transformations and Feature Engineering

More Related Content

What's hot

VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1BigML, Inc
Ā 
BigML Education - Feature Engineering with Flatline
BigML Education - Feature Engineering with FlatlineBigML Education - Feature Engineering with Flatline
BigML Education - Feature Engineering with FlatlineBigML, Inc
Ā 
BSSML16 L6. Basic Data Transformations
BSSML16 L6. Basic Data TransformationsBSSML16 L6. Basic Data Transformations
BSSML16 L6. Basic Data TransformationsBigML, Inc
Ā 
VSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data TransformationsVSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data TransformationsBigML, Inc
Ā 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBigML, Inc
Ā 
VSSML17 L2. Ensembles and Logistic Regressions
VSSML17 L2. Ensembles and Logistic RegressionsVSSML17 L2. Ensembles and Logistic Regressions
VSSML17 L2. Ensembles and Logistic RegressionsBigML, Inc
Ā 
VSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic RegressionVSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic RegressionBigML, Inc
Ā 
VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2BigML, Inc
Ā 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsBigML, Inc
Ā 
Web UI, Algorithms, and Feature Engineering
Web UI, Algorithms, and Feature Engineering Web UI, Algorithms, and Feature Engineering
Web UI, Algorithms, and Feature Engineering BigML, Inc
Ā 
BSSML16 L2. Ensembles and Logistic Regressions
BSSML16 L2. Ensembles and Logistic RegressionsBSSML16 L2. Ensembles and Logistic Regressions
BSSML16 L2. Ensembles and Logistic RegressionsBigML, Inc
Ā 
BSSML17 - API and WhizzML
BSSML17 - API and WhizzMLBSSML17 - API and WhizzML
BSSML17 - API and WhizzMLBigML, Inc
Ā 
VSSML17 L3. Clusters and Anomaly Detection
VSSML17 L3. Clusters and Anomaly DetectionVSSML17 L3. Clusters and Anomaly Detection
VSSML17 L3. Clusters and Anomaly DetectionBigML, Inc
Ā 
BSSML17 - Introduction, Models, Evaluations
BSSML17 - Introduction, Models, EvaluationsBSSML17 - Introduction, Models, Evaluations
BSSML17 - Introduction, Models, EvaluationsBigML, Inc
Ā 
BSSML17 - Time Series
BSSML17 - Time SeriesBSSML17 - Time Series
BSSML17 - Time SeriesBigML, Inc
Ā 
BSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBigML, Inc
Ā 
BSSML16 L4. Association Discovery and Topic Modeling
BSSML16 L4. Association Discovery and Topic ModelingBSSML16 L4. Association Discovery and Topic Modeling
BSSML16 L4. Association Discovery and Topic ModelingBigML, Inc
Ā 
VSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionVSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionBigML, Inc
Ā 
BSSML17 - Clusters
BSSML17 - ClustersBSSML17 - Clusters
BSSML17 - ClustersBigML, Inc
Ā 
MLSD18. Feature Engineering
MLSD18. Feature EngineeringMLSD18. Feature Engineering
MLSD18. Feature EngineeringBigML, Inc
Ā 

What's hot (20)

VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
Ā 
BigML Education - Feature Engineering with Flatline
BigML Education - Feature Engineering with FlatlineBigML Education - Feature Engineering with Flatline
BigML Education - Feature Engineering with Flatline
Ā 
BSSML16 L6. Basic Data Transformations
BSSML16 L6. Basic Data TransformationsBSSML16 L6. Basic Data Transformations
BSSML16 L6. Basic Data Transformations
Ā 
VSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data TransformationsVSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data Transformations
Ā 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
Ā 
VSSML17 L2. Ensembles and Logistic Regressions
VSSML17 L2. Ensembles and Logistic RegressionsVSSML17 L2. Ensembles and Logistic Regressions
VSSML17 L2. Ensembles and Logistic Regressions
Ā 
VSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic RegressionVSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic Regression
Ā 
VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2
Ā 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 Sessions
Ā 
Web UI, Algorithms, and Feature Engineering
Web UI, Algorithms, and Feature Engineering Web UI, Algorithms, and Feature Engineering
Web UI, Algorithms, and Feature Engineering
Ā 
BSSML16 L2. Ensembles and Logistic Regressions
BSSML16 L2. Ensembles and Logistic RegressionsBSSML16 L2. Ensembles and Logistic Regressions
BSSML16 L2. Ensembles and Logistic Regressions
Ā 
BSSML17 - API and WhizzML
BSSML17 - API and WhizzMLBSSML17 - API and WhizzML
BSSML17 - API and WhizzML
Ā 
VSSML17 L3. Clusters and Anomaly Detection
VSSML17 L3. Clusters and Anomaly DetectionVSSML17 L3. Clusters and Anomaly Detection
VSSML17 L3. Clusters and Anomaly Detection
Ā 
BSSML17 - Introduction, Models, Evaluations
BSSML17 - Introduction, Models, EvaluationsBSSML17 - Introduction, Models, Evaluations
BSSML17 - Introduction, Models, Evaluations
Ā 
BSSML17 - Time Series
BSSML17 - Time SeriesBSSML17 - Time Series
BSSML17 - Time Series
Ā 
BSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly Detection
Ā 
BSSML16 L4. Association Discovery and Topic Modeling
BSSML16 L4. Association Discovery and Topic ModelingBSSML16 L4. Association Discovery and Topic Modeling
BSSML16 L4. Association Discovery and Topic Modeling
Ā 
VSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionVSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly Detection
Ā 
BSSML17 - Clusters
BSSML17 - ClustersBSSML17 - Clusters
BSSML17 - Clusters
Ā 
MLSD18. Feature Engineering
MLSD18. Feature EngineeringMLSD18. Feature Engineering
MLSD18. Feature Engineering
Ā 

Similar to VSSML17 L5. Basic Data Transformations and Feature Engineering

VSSML18. Data Transformations
VSSML18. Data TransformationsVSSML18. Data Transformations
VSSML18. Data TransformationsBigML, Inc
Ā 
VSSML18. Feature Engineering
VSSML18. Feature EngineeringVSSML18. Feature Engineering
VSSML18. Feature EngineeringBigML, Inc
Ā 
MLSEV. Automating Decision Making
MLSEV. Automating Decision MakingMLSEV. Automating Decision Making
MLSEV. Automating Decision MakingBigML, Inc
Ā 
DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingBigML, Inc
Ā 
MLSD18. Basic Transformations - BigML
MLSD18. Basic Transformations - BigMLMLSD18. Basic Transformations - BigML
MLSD18. Basic Transformations - BigMLBigML, Inc
Ā 
Dimensional Modelling Session 2
Dimensional Modelling Session 2Dimensional Modelling Session 2
Dimensional Modelling Session 2akitda
Ā 
Effective capture of metadata using ca e rwin data modeler 09232010
Effective capture of metadata using ca e rwin data modeler 09232010Effective capture of metadata using ca e rwin data modeler 09232010
Effective capture of metadata using ca e rwin data modeler 09232010ERwin Modeling
Ā 
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017MLconf
Ā 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBigML, Inc
Ā 
Merchant Lookup Service Intuit
Merchant Lookup Service IntuitMerchant Lookup Service Intuit
Merchant Lookup Service IntuitDataWorks Summit
Ā 
Overview of business intelligence
Overview of business intelligenceOverview of business intelligence
Overview of business intelligenceAhsan Kabir
Ā 
PowerBI importance of power bi in data analytics field
PowerBI importance of power bi in data analytics fieldPowerBI importance of power bi in data analytics field
PowerBI importance of power bi in data analytics fieldshubham299785
Ā 
Jumpstart: Introduction to MongoDB
Jumpstart: Introduction to MongoDBJumpstart: Introduction to MongoDB
Jumpstart: Introduction to MongoDBMongoDB
Ā 
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Databricks
Ā 
Info cube modeling_dimension_design_erada_bw_infoalert
Info cube modeling_dimension_design_erada_bw_infoalertInfo cube modeling_dimension_design_erada_bw_infoalert
Info cube modeling_dimension_design_erada_bw_infoalertPhani Kumar
Ā 
MLSEV. Use Case: The Data-Driven Factory
MLSEV. Use Case: The Data-Driven FactoryMLSEV. Use Case: The Data-Driven Factory
MLSEV. Use Case: The Data-Driven FactoryBigML, Inc
Ā 
dbms-unit-_part-1.pptxeqweqweqweqweqweqweqweq
dbms-unit-_part-1.pptxeqweqweqweqweqweqweqweqdbms-unit-_part-1.pptxeqweqweqweqweqweqweqweq
dbms-unit-_part-1.pptxeqweqweqweqweqweqweqweqwrushabhsirsat
Ā 
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftBest Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftAmazon Web Services
Ā 
MondCloud Semantic Data Hub for Insurance
MondCloud Semantic Data Hub for InsuranceMondCloud Semantic Data Hub for Insurance
MondCloud Semantic Data Hub for InsuranceGeetha Sreedhar, MBA
Ā 

Similar to VSSML17 L5. Basic Data Transformations and Feature Engineering (20)

VSSML18. Data Transformations
VSSML18. Data TransformationsVSSML18. Data Transformations
VSSML18. Data Transformations
Ā 
VSSML18. Feature Engineering
VSSML18. Feature EngineeringVSSML18. Feature Engineering
VSSML18. Feature Engineering
Ā 
MLSEV. Automating Decision Making
MLSEV. Automating Decision MakingMLSEV. Automating Decision Making
MLSEV. Automating Decision Making
Ā 
DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision Making
Ā 
MLSD18. Basic Transformations - BigML
MLSD18. Basic Transformations - BigMLMLSD18. Basic Transformations - BigML
MLSD18. Basic Transformations - BigML
Ā 
Dimensional Modelling Session 2
Dimensional Modelling Session 2Dimensional Modelling Session 2
Dimensional Modelling Session 2
Ā 
Effective capture of metadata using ca e rwin data modeler 09232010
Effective capture of metadata using ca e rwin data modeler 09232010Effective capture of metadata using ca e rwin data modeler 09232010
Effective capture of metadata using ca e rwin data modeler 09232010
Ā 
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
Ā 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
Ā 
Merchant Lookup Service Intuit
Merchant Lookup Service IntuitMerchant Lookup Service Intuit
Merchant Lookup Service Intuit
Ā 
How We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad GuysHow We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad Guys
Ā 
Overview of business intelligence
Overview of business intelligenceOverview of business intelligence
Overview of business intelligence
Ā 
PowerBI importance of power bi in data analytics field
PowerBI importance of power bi in data analytics fieldPowerBI importance of power bi in data analytics field
PowerBI importance of power bi in data analytics field
Ā 
Jumpstart: Introduction to MongoDB
Jumpstart: Introduction to MongoDBJumpstart: Introduction to MongoDB
Jumpstart: Introduction to MongoDB
Ā 
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Ā 
Info cube modeling_dimension_design_erada_bw_infoalert
Info cube modeling_dimension_design_erada_bw_infoalertInfo cube modeling_dimension_design_erada_bw_infoalert
Info cube modeling_dimension_design_erada_bw_infoalert
Ā 
MLSEV. Use Case: The Data-Driven Factory
MLSEV. Use Case: The Data-Driven FactoryMLSEV. Use Case: The Data-Driven Factory
MLSEV. Use Case: The Data-Driven Factory
Ā 
dbms-unit-_part-1.pptxeqweqweqweqweqweqweqweq
dbms-unit-_part-1.pptxeqweqweqweqweqweqweqweqdbms-unit-_part-1.pptxeqweqweqweqweqweqweqweq
dbms-unit-_part-1.pptxeqweqweqweqweqweqweqweq
Ā 
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftBest Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Ā 
MondCloud Semantic Data Hub for Insurance
MondCloud Semantic Data Hub for InsuranceMondCloud Semantic Data Hub for Insurance
MondCloud Semantic Data Hub for Insurance
Ā 

More from BigML, Inc

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
Ā 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationBigML, Inc
Ā 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
Ā 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
Ā 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
Ā 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionBigML, Inc
Ā 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
Ā 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLBigML, Inc
Ā 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
Ā 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
Ā 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsBigML, Inc
Ā 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
Ā 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
Ā 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
Ā 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object DetectionBigML, Inc
Ā 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image ProcessingBigML, Inc
Ā 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
Ā 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
Ā 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
Ā 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
Ā 

More from BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
Ā 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
Ā 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
Ā 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
Ā 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
Ā 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
Ā 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
Ā 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
Ā 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
Ā 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
Ā 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
Ā 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
Ā 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
Ā 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
Ā 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
Ā 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
Ā 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Ā 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
Ā 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
Ā 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
Ā 

Recently uploaded

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
Ā 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
Ā 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra
Ā 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
Ā 
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...amitlee9823
Ā 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
Ā 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
Ā 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
Ā 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
Ā 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
Ā 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
Ā 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
Ā 
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort ServiceBDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort ServiceDelhi Call girls
Ā 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
Ā 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
Ā 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
Ā 

Recently uploaded (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
Ā 
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
Ā 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Ā 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
Ā 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
Ā 
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Ā 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
Ā 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Ā 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Ā 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Ā 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
Ā 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
Ā 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
Ā 
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort ServiceBDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
Ā 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
Ā 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
Ā 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
Ā 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
Ā 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
Ā 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
Ā 

VSSML17 L5. Basic Data Transformations and Feature Engineering

  • 1. Valencian Summer School in Machine Learning 3rd edition September 14-15, 2017
  • 2. BigML, Inc 2 Basic Transformations Making Data Machine Learning Ready Poul Petersen CIO, BigML, Inc
  • 3. BigML, Inc 3Basic Transformations In a Perfect Worldā€¦ Q: How does a physicist milk a cow? A: Well, ļ¬rst let us consider a spherical cow... Q: How does a data scientist build a model? A: Well, ļ¬rst let us consider perfectly formatted dataā€¦
  • 4. BigML, Inc 4Basic Transformations The Dream CSV Dataset Model Proļ¬t!
  • 5. BigML, Inc 5Basic Transformations The Reality CRM Web Accounts Transactions ML Ready?
  • 6. BigML, Inc 6Basic Transformations Obstacles ā€¢ Data Structure ā€¢ Scattered across systems ā€¢ Wrong "shape" ā€¢ Unlabelled data ā€¢ Data Value ā€¢ Format: spelling, units ā€¢ Missing values ā€¢ Non-optimal correlation ā€¢ Non-existant correlation ā€¢ Data Signiļ¬cance ā€¢ Unwanted: PII, Non-Preferred ā€¢ Expensive to collect ā€¢ Insidious: Leakage, obviously correlated Data Transformation Feature Engineering Feature Selection
  • 7. BigML, Inc 7Basic Transformations The Process ā€¢ Deļ¬ne a clear idea of the goal. ā€¢ Sometimes this comes laterā€¦ ā€¢ Understand what ML tasks will achieve the goal. ā€¢ Transform the data ā€¢ where is it, how is it stored? ā€¢ what are the features? ā€¢ can you access it programmatically? ā€¢ Feature Engineering: transform the data you have into the data you actually need. ā€¢ Evaluate: Try it on a small scale ā€¢ Accept that you might have to start overā€¦. ā€¢ But when it works, automate it!!!!
  • 8. BigML, Inc 8Basic Transformations Data Transformations
  • 9. BigML, Inc 9Basic Transformations BigML Tasks Goal ā€¢ Will this customer default on a loan? ā€¢ How many customers will apply for a loan next month? ā€¢ Is the consumption of this product unusual? ā€¢ Is the behavior of the customers similar? ā€¢ Are these products purchased together? ML Task Classification Regression Anomaly Detection Cluster Analysis Association Discovery
  • 10. BigML, Inc 10Basic Transformations Classiļ¬cation CategoricalTrainingTesting Predicting
  • 11. BigML, Inc 11Basic Transformations Regression NumericTrainingTesting Predicting
  • 12. BigML, Inc 12Basic Transformations Anomaly Detection
  • 13. BigML, Inc 13Basic Transformations Cluster Analysis
  • 14. BigML, Inc 14Basic Transformations Association Discovery
  • 15. BigML, Inc 15Basic Transformations ML Ready DataInstances Fields Ā (Features) Tabular Data (rows and columns): ā€¢ Each row ā€¢ is one instance. ā€¢ contains all the information about that one instance. ā€¢ Each column ā€¢ is a ļ¬eld that describes a property of the instance.
  • 16. BigML, Inc 16Basic Transformations Data Labeling Unsupervised Ā Learning Supervised Ā Learning ā€¢ Anomaly Detection ā€¢ Clustering ā€¢ Association Discovery ā€¢ Classiļ¬cation ā€¢ Regression The only difference, in terms of ML-Ready structure is the presence of a "label"
  • 17. BigML, Inc 17Basic Transformations Data Labelling Data is often not labeled Create labels with a transformation Name Month - 3 Month - 2 Month - 1 Joe Schmo 123.23 0 0 Jane Plain 0 0 0 Mary Happy 0 55.22 243.33 Tom Thumb 12.34 8.34 14.56 Un-Ā­ā€Labelled Ā Data Labelled Ā data Name Month - 3 Month - 2 Month - 1 Default Joe Schmo 123.23 0 0 FALSE Jane Plain 0 0 0 TRUE Mary Happy 0 55.22 243.33 FALSE Tom Thumb 12.34 8.34 14.56 FALSE Can be done at Feature Engineering step as well
  • 18. BigML, Inc 18Basic Transformations SF Restaurants Example https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores/stya-26eb https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/ create database sf_restaurants; use sf_restaurants; create table businesses (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100), postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100)); load data local infile './businesses.csv' into table businesses fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines; create table inspections (business_id int, score varchar(10), idate varchar(8), itype varchar(100)); load data local infile './inspections.csv' into table inspections fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines; create table violations (business_id int, vdate varchar(8), description varchar(1000)); load data local infile './violations.csv' into table violations fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines; create table legend (Minimum_Score int, Maximum_Score int, Description varchar(100)); load data local infile './legend.csv' into table legend fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines;
  • 19. BigML, Inc 19Basic Transformations Transformations Demo #1
  • 20. BigML, Inc 20Basic Transformations Data Cleaning Homogenize missing values and different types in the same feature, fix input errors, correct semantic issues, types, etc. Name Date Duration (s) Genre Plays Highway star 1984-05-24 - Rock 139 Blues alive 1990/03/01 281 Blues 239 Lonely planet 2002-11-19 5:32s Techno 42 Dance, dance 02/23/1983 312 Disco N/A The wall 1943-01-20 218 Reagge 83 Offside down 1965-02-19 4 minutes Techno 895 The alchemist 2001-11-21 418 Bluesss 178 Bring me down 18-10-98 328 Classic 21 The scarecrow 1994-10-12 269 Rock 734 Original Ā data Name Date Duration (s) Genre Plays Highway star 1984-05-24 Rock 139 Blues alive 1990-03-01 281 Blues 239 Lonely planet 2002-11-19 332 Techno 42 Dance, dance 1983-02-23 312 Disco The wall 1943-01-20 218 Reagge 83 Offside down 1965-02-19 240 Techno 895 The alchemist 2001-11-21 418 Blues 178 Bring me down 1998-10-18 328 Classic 21 The scarecrow 1994-10-12 269 Rock 734 Cleaned Ā data update violations set description = substr(description,1,instr(description,' [ date violation corrected:')-1) where instr(description,' [ date violation corrected:') > 0;
  • 21. BigML, Inc 21Basic Transformations Transformations Demo #2
  • 22. BigML, Inc 22Basic Transformations Deļ¬ne a Goal ā€¢ Predict rating: Poor / Needs Improvement / Adequate / Good ā€¢ This is a classiļ¬cation problem ā€¢ Based on business proļ¬le: ā€¢ Description: kitchen, cafe, etc. ā€¢ Location: zip, latitude, longitude
  • 23. BigML, Inc 23Basic Transformations Denormalizing business inspections violations scores Instances Features (millions) join Data is usually normalized in relational databases, ML-Ready datasets need the information de-normalized in a single dataset. create table scores select * from businesses left join inspections using (business_id); create table scores_last select a.* from scores as a JOIN (select business_id,max(idate) as idate from scores group by business_id) as b where a.business_id=b.business_id and a.idate=b.idate; Denormalize ML-Ā­ā€Ready: Ā Each Ā row Ā contains Ā all Ā the Ā information Ā about Ā that Ā one Ā instance. Ā  create table scores_last_label select scores_last.*, Description as score_label from scores_last join legend on score <= Maximum_Score and score >= Minimum_Score; Add Ā Label
  • 24. BigML, Inc 24Basic Transformations Transformations Demo #3
  • 25. BigML, Inc 25Basic Transformations Structuring Output ā€¢ A CSV ļ¬le uses plain text to store tabular data. ā€¢ In a CSV ļ¬le, each row of the ļ¬le is an instance. ā€¢ Each column in a row is usually separated by a comma (,) but other "separators" like semi-colon (;), colon (:), pipe (|), can also be used. Each row must contain the same number of ļ¬elds ā€¢ but they can be null ā€¢ Fields can be quoted using double quotes ("). ā€¢ Fields that contain commas or line separators must be quoted. ā€¢ Quotes (") in ļ¬elds must be doubled (""). ā€¢ The character encoding must be UTF-8 ā€¢ Optionally, a CSV ļ¬le can use the ļ¬rst line as a header to provide the names of each ļ¬eld. After all the data transformations, a CSV (ā€œComma-Separated Values) file has to be generated, following the rules below: select * from scores_last_label into outfile "./scores_last_label.csv"; select 'name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'score_label' UNION select name, address, city, state, postal_code, latitude, longitude, score_label from scores_last_label into outfile "./scores_last_label_headers.csv" ;
  • 26. BigML, Inc 26Basic Transformations Transformations Demo #4
  • 27. BigML, Inc 27Basic Transformations Deļ¬ne a Goal ā€¢ Predict rating: Poor / Needs Improvement / Adequate / Good ā€¢ This is a classiļ¬cation problem ā€¢ Based on business proļ¬le: ā€¢ Description: kitchen, restaurant, etc. ā€¢ Location: zip code, latitude, longitude ā€¢ Number of violations, text of violations
  • 28. BigML, Inc 28Basic Transformations Aggregating User Num.Playbacks Total Time Pref.Device User001 3 830 Tablet User002 1 218 Smartphone User003 3 1019 TV User005 2 521 Tablet Aggregated data (list of users) When the entity to model is different from the provided data, an aggregation to get the entity might be needed. Content Genr e Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Tech no 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reag ge 218 2015-05-14 09:02:55 User002 Smartphone Offside down Tech no 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Class ic 328 2015-05-15 06:59:56 User001 Tablet The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone Original data (list of playbacks) create table violations_aggregated select business_id,count(*) as violation_num,group_concat(description) as violation_txt from violations group by business_id; create table scores_last_label_violations select * from scores_last_label left join violations_aggregated USING (business_id); tail -n+2 playlists.csv | cut -d',' -f5 | sort | uniq -c tail -n+2 playlist.csv | awk -F',' '{arr[$5]+=$3} END {for (i in arr) {print arr[i],i}}' SET @@group_concat_max_len = 15000 select 'name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'violation_num', 'violation_txt', 'score_label' UNION select name, address, city, state, postal_code, latitude, longitude, violation_num, violation_txt, score_label from scores_last_label_violations into outfile "./scores_last_label_violations_headers.csv" ;
  • 29. BigML, Inc 29Basic Transformations Transformations Demo #5
  • 30. BigML, Inc 30Basic Transformations Pivoting Different values of a feature are pivoted to new columns in the result dataset. Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Smartphone Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone Original data User Num.Playback s Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone User001 3 830 Tablet 1 2 0 190 640 0 User002 1 218 Smartphone 0 0 1 0 0 218 User003 3 1019 TV 2 0 1 750 0 269 User005 2 521 Tablet 0 2 0 0 521 0 Aggregated data with pivoted columns
  • 31. BigML, Inc 31Basic Transformations Time Windows Create new features using values over different periods of time Instances Features Time Instances Features (millions) (thousands) t=1 t=2 t=3 create table scores_2013 select a.business_id, a.score as score_2013, a.idate as idate_2013 from inspections as a JOIN ( select business_id, max(idate) as idate from inspections where substr(idate,1,4) = "2013" group by business_id) as b where a.business_id = b.business_id and a.idate = b.idate; create table scores_over_time select * from businesses left join scores_2013 USING (business_id) left join scores_2014 USING (business_id);
  • 32. BigML, Inc 32Basic Transformations Transformations Demo #6
  • 33. BigML, Inc 33Basic Transformations Updates Need a current view of the data, but new data only comes in batches of changes day Ā 1day Ā 2day Ā 3 Instances Features
  • 34. BigML, Inc 34Basic Transformations Streaming Data only comes in single changes data Ā stream Instances Features Stream Batch (kafka, etc)
  • 35. BigML, Inc 35Basic Transformations Prosper Loan Life Cycle Submit Cancelled Withdraw Expired FundedBids Current Q: Which new listings make it to funded? Q: Which funded loans make it to paid? Q: If funded, what will be the rate? Classification Regression Classification Goal ML Task Defaulted Paid Late Listings Loans
  • 36. BigML, Inc 36Basic Transformations Prosper Example D a t a P ro v i d e d i n X M L updates!! export.sh fetch.sh ā€œcurlā€ daily import.py XML bigml.sh Model Predict Share in gallery Status LoanStatus BorrowerRate Denormalization with join
  • 37. BigML, Inc 37Basic Transformations Prosper Example ā€¢ XMLā€¦ yuck! ā€¢ MongoDB has CSV export and is record based so it is easy to handle changing data structure. ā€¢ Feature Engineering ā€¢ There are 5 diļ¬€erent classes of ā€œbadā€ loans ā€¢ Date cleanup ā€¢ Type casting: ļ¬‚oats and ints ā€¢ Would be better to track over time ā€¢ number of late payments ā€¢ compare predictions and actuals ā€¢ XMLā€¦ yuck! Tidbits and Lessons Learnedā€¦.
  • 38. BigML, Inc 38Basic Transformations Tools
  • 39. BigML, Inc 39Basic Transformations Tools ā€¢ Command Line? ā€¢ join, cut, awk, sed, sort, uniq ā€¢ Automation ā€¢ Shell, Python, crontab, etc ā€¢ Talend ā€¢ BigML: bindings, bigmler, API, whizzml ā€¢ Relational DB ā€¢ MySQL ā€¢ Non-Relational DB ā€¢ MongoDB
  • 40. BigML, Inc 40Basic Transformations Talend https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/ Denormalization Example
  • 41. BigML, Inc 41Basic Transformations Summary ā€¢ Data is awful ā€¢ Requires clean-up ā€¢ Transformations ā€¢ Consumes an enormous part of the effort in applying ML ā€¢ Techniques: ā€¢ Denormalizing ā€¢ Aggregating / Pivoting ā€¢ Time windows / Streaming ā€¢ What a real Workļ¬‚ow looks like and the tools required
  • 42. BigML, Inc 2 Feature Engineering Creating Features that Make Machine Learning Work Poul Petersen CIO, BigML, Inc
  • 43. BigML, Inc 3Feature Engineering what is Feature Engineering ā€¢ This is really, really important - more than algorithm selection! ā€¢ In fact, so important that BigML often does it automatically ā€¢ ML Algorithms have no deeper understanding of data ā€¢ Numerical: have a natural order, can be scaled, etc ā€¢ Categorical: have discrete values, etc. ā€¢ The "magic" is the ability to ļ¬nd patterns quickly and efļ¬ciently ā€¢ ML Algorithms only know what you tell/show it with data ā€¢ Medical: Kg and M, but BMI = Kg/M2 is better ā€¢ Lending: Debt and Income, but DTI is better ā€¢ Intuition can be risky: remember to prove it with an evaluation! Feature Engineering: applying domain knowledge of the data to create new features that allow ML algorithms to work better, or to work at all.
  • 44. BigML, Inc 4Feature Engineering Built-in Transformations 2013-09-25 10:02 Date-Time Fields ā€¦ year month day hour minute ā€¦ ā€¦ 2013 Sep 25 10 2 ā€¦ ā€¦ ā€¦ ā€¦ ā€¦ ā€¦ ā€¦ ā€¦ NUM NUMCAT NUM NUM ā€¢ Date-Time ļ¬elds have a lot of information "packed" into them ā€¢ Splitting out the time components allows ML algorithms to discover time-based patterns. DATE-TIME
  • 45. BigML, Inc 5Feature Engineering Built-in Transformations Categorical Fields for Clustering/LR ā€¦ alchemy_category ā€¦ ā€¦ business ā€¦ ā€¦ recreation ā€¦ ā€¦ health ā€¦ ā€¦ ā€¦ ā€¦ CAT business health recreation ā€¦ ā€¦ 1 0 0 ā€¦ ā€¦ 0 0 1 ā€¦ ā€¦ 0 1 0 ā€¦ ā€¦ ā€¦ ā€¦ ā€¦ ā€¦ NUM NUM NUM ā€¢ Clustering and Logistic Regression require numeric ļ¬elds for inputs ā€¢ Categorical values are transformed to numeric vectors automatically* ā€¢ *Note: In BigML, clustering uses k-prototypes and the encoding used for LR can be conļ¬gured.
  • 46. BigML, Inc 6Feature Engineering Built-in Transformations Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon ā€˜em. TEXT Text Fields ā€¦ great afraid born achieve ā€¦ ā€¦ 4 1 1 1 ā€¦ ā€¦ ā€¦ ā€¦ ā€¦ ā€¦ ā€¦ NUM NUM NUM NUM ā€¢ Unstructured text contains a lot of potentially interesting patterns ā€¢ Bag-of-words analysis happens automatically and extracts the "interesting" tokens in the text ā€¢ Another option is Topic Modeling to extract thematic meaning
  • 47. BigML, Inc 7Feature Engineering Help ML to Work Better { ā€œurl":"cbsnews", "title":"Breaking News Headlines Business Entertainment World News ā€œ, "body":" news covering all the latest breaking national and world news headlines, including politics, sports, entertainment, business and more.ā€ } TEXT title body Breaking Newsā€¦ news coveringā€¦ ā€¦ ā€¦ TEXT TEXT When text is not actually unstructured ā€¢ In this case, the text ļ¬eld has structure (key/value pairs) ā€¢ Extracting the structure as new features may allow the ML algorithm to work better
  • 48. BigML, Inc 8Feature Engineering FE Demo #1
  • 49. BigML, Inc 9Feature Engineering Help ML to Work at all When the pattern does not exist Highway Number Direction Is Long 2 East-West FALSE 4 East-West FALSE 5 North-South TRUE 8 East-West FALSE 10 East-West TRUE ā€¦ ā€¦ ā€¦ Goal: Predict principle direction from highway number ( = (mod (field "Highway Number") 2) 0)
  • 50. BigML, Inc 10Feature Engineering FE Demo #2
  • 51. BigML, Inc 11Feature Engineering Feature Engineering Discretization Total Spend 7,342.99 304.12 4.56 345.87 8,546.32 NUM ā€œPredict will spend $3,521 with error $1,232ā€ Spend Category Top 33% Bottom 33% Bottom 33% Middle 33% Top 33% CAT ā€œPredict customer will be Top 33% in spendingā€
  • 52. BigML, Inc 12Feature Engineering FE Demo #3
  • 53. BigML, Inc 13Feature Engineering Built-ins for FE ā€¢ Discretize: Converts a numeric value to categorical ā€¢ Replace missing values: ļ¬xed/max/mean/median/etc ā€¢ Normalize: Adjust a numeric value to a speciļ¬c range of values while preserving the distribution ā€¢ Math: Exponentiation, Logarithms, Squares, Roots, etc ā€¢ Types: Force a ļ¬eld value to categorical, integer, or real ā€¢ Random: Create random values for introducing noise ā€¢ Statistics: Mean, Population ā€¢ Refresh Fields: ā€¢ Types: recomputes ļ¬eld types. Ex: #classes Ā > Ā 1000 ā€¢ Preferred: recomputes preferred status
  • 54. BigML, Inc 14Feature Engineering Flatline Add Fields Computing with Existing Features Debt Income 10,134 100,000 85,234 134,000 8,112 21,500 0 45,900 17,534 52,000 NUM NUM (/ (ļ¬eld "Debt") (ļ¬eld "Income")) Debt Ā  Income Debt to Income Ratio 0.10 0.64 0.38 0 0.34 NUM
  • 55. BigML, Inc 15Feature Engineering FE Demo #4
  • 56. BigML, Inc 16Feature Engineering What is Flatline? ā€¢ DSL: ā€¢ Invented by BigML - Programmatic / Optimized for speed ā€¢ Transforms datasets into new datasets ā€¢ Adding new ļ¬elds / Filtering ā€¢ Transformations are written in lisp-style syntax ā€¢ Feature Engineering ā€¢ Computing new ļ¬elds: (/ Ā (ļ¬eld Ā "Debt") Ā (ļ¬eld Ā ā€œIncomeā€)) ā€¢ Programmatic Filtering: ā€¢ Filtering datasets according to functions that evaluate to true/false using the row of data as an input. Flatline: a domain specific language for feature engineering and programmatic filtering
  • 57. BigML, Inc 17Feature Engineering Flatline ā€¢ Lisp style syntax: Operators come ļ¬rst ā€¢ Correct: (+ Ā 1 Ā 2) => NOT Correct: (1 Ā + Ā 2) ā€¢ Dataset Fields are ļ¬rst-class citizens ā€¢ (ļ¬eld Ā ā€œdiabetes Ā pedigreeā€) Ā  ā€¢ Limited programming language structures ā€¢ let, cond, if, map, list operators, */+-Ā­ā€, etc. ā€¢ Built-in transformations ā€¢ statistics, strings, timestamps, windows
  • 58. BigML, Inc 18Feature Engineering Flatline s-expressions (= Ā 0 Ā (+ Ā (abs Ā ( Ā f Ā "Month Ā -Ā­ā€ Ā 3" Ā ) Ā ) Ā (abs Ā ( Ā f Ā "Month Ā -Ā­ā€ Ā 2")) Ā (abs Ā ( Ā f Ā "Month Ā -Ā­ā€ Ā 1") Ā ) Ā )) Name Month - 3 Month - 2 Month - 1 Joe Schmo 123.23 0 0 Jane Plain 0 0 0 Mary Happy 0 55.22 243.33 Tom Thumb 12.34 8.34 14.56 Un-Ā­ā€Labelled Ā Data Labelled Ā data Name Month - 3 Month - 2 Month - 1 Default Joe Schmo 123.23 0 0 FALSE Jane Plain 0 0 0 TRUE Mary Happy 0 55.22 243.33 FALSE Tom Thumb 12.34 8.34 14.56 FALSE Adding Simple Labels to Data Define "default" as missing three payments in a row
  • 59. BigML, Inc 19Feature Engineering FE Demo #5
  • 60. BigML, Inc 20Feature Engineering Flatline s-expressions date volume price 1 34353 314 2 44455 315 3 22333 315 4 52322 321 5 28000 320 6 31254 319 7 56544 323 8 44331 324 9 81111 287 10 65422 294 11 59999 300 12 45556 302 13 19899 301 Current Ā -Ā­ā€ Ā (4-Ā­ā€day Ā avg) Ā  std Ā dev Shock: Deviations from a Trend day-4 day-3 day-2 day-1 4davg - 314 - 314 315 - 314 315 315 - 314 315 315 321 316.25 315 315 321 320 317.75 315 321 320 319 318.75
  • 61. BigML, Inc 21Feature Engineering Flatline s-expressions Current Ā -Ā­ā€ Ā (4-Ā­ā€day Ā avg) Ā  std Ā dev Shock: Deviations from a Trend Current : (ļ¬eld ā€œpriceā€) 4-Ā­ā€day Ā avg: (avg-window ā€œpriceā€ -4 -1) std Ā dev: (standard-deviation ā€œpriceā€) (/ Ā  Ā (-Ā­ā€ Ā  Ā ( Ā f Ā "price") Ā (avg-Ā­ā€window Ā "price" Ā -Ā­ā€4, Ā -Ā­ā€1)) Ā (standard-Ā­ā€deviaOon Ā "price"))
  • 62. BigML, Inc 22Feature Engineering FE Demo #6
  • 63. BigML, Inc 23Feature Engineering Advanced s-expressions Moon Phase% ( Ā / Ā ( Ā mod Ā ( Ā -Ā­ā€ Ā ( Ā / Ā ( Ā epoch Ā ( Ā ļ¬eld Ā "date-Ā­ā€ļ¬eld" Ā )) Ā 1000 Ā ) Ā 621300 Ā ) Ā 2551443 Ā ) Ā 2551442 Ā ) Highway isEven? ( Ā = Ā (mod Ā (ļ¬eld Ā "Highway Ā Number") Ā 2) Ā 0) ( Ā let Ā (R Ā 6371000 Ā latA Ā (to-Ā­ā€radians Ā {lat-Ā­ā€ref}) Ā latB Ā (to-Ā­ā€radians Ā ( Ā ļ¬eld Ā "LATITUDE" Ā ) Ā ) Ā latD Ā ( Ā -Ā­ā€ Ā latB Ā latA Ā  ) Ā longD Ā ( Ā to-Ā­ā€radians Ā ( Ā -Ā­ā€ Ā ( Ā ļ¬eld Ā "LONGITUDE" Ā ) Ā {long-Ā­ā€ref} Ā ) Ā ) Ā a Ā ( Ā + Ā ( Ā square Ā ( Ā sin Ā ( Ā / Ā latD Ā 2 Ā ) Ā ) Ā ) Ā ( Ā * Ā  (cos Ā latA) Ā (cos Ā latB) Ā (square Ā ( Ā sin Ā ( Ā / Ā longD Ā 2))) Ā ) Ā ) Ā c Ā ( Ā * Ā 2 Ā ( Ā asin Ā ( Ā min Ā (list Ā 1 Ā (sqrt Ā a)) Ā ) Ā ) Ā ) Ā ) Ā ( Ā * Ā R Ā  c Ā ) Ā ) Ā  Distance Lat/Long <=> Ref (Haversine)
  • 64. BigML, Inc 24Feature Engineering WhizzML + Flatline HAVERSINE FLATLINE OUTPUT DATASET INPUT DATASET LONG Ref LAT Ref WHIZZML SCRIPT GALLERY
  • 65. BigML, Inc 25Feature Engineering Feature Engineering Fix Missing Values in a ā€œMeaningfulā€ Way F i l t e r Zeros Model ā€Ø insulin Predict ā€Ø insulin Select ā€Ø insulin Fixedā€Ø Dataset Amendedā€Ø Dataset Originalā€Ø Dataset Cleanā€Ø Dataset ( if ( = (field "insulin") 0) (field "predicted insulin") (field "insulin"))
  • 66. BigML, Inc 26Feature Engineering FE Demo #7
  • 67. BigML, Inc 27Feature Engineering Feature Selection
  • 68. BigML, Inc 28Feature Engineering Feature Selection ā€¢ Model Summary ā€¢ Field Importance ā€¢ Algorithmic ā€¢ Best-First Feature Selection ā€¢ Boruta ā€¢ Leakage ā€¢ Tight Correlations (AD, Plot, Correlations) ā€¢ Test Data ā€¢ Perfect future knowledge cat diabetes.csv diabetes_testset.csv | sort | uniq -d | wc -l
  • 69. BigML, Inc 29Feature Engineering Feature Selection ā€¢ Sales pipeline where step n-1 has no other outcome then step n. ā€¢ Stock close predicts stock open ā€¢ Churn retention: the worst rep is actually the best (correlation != causation) ā€¢ Cancer prediction where one input is a doctor ordered test for the condition ā€¢ Account ID predicts fraud (because only new accounts are fraudsters) Leakage
  • 70. BigML, Inc 30Feature Engineering Evaluate & Automate
  • 71. BigML, Inc 31Feature Engineering Evaluate & Automate ā€¢ Evaluate ā€¢ Did you meet the goal? ā€¢ If not, did you discover something else useful? ā€¢ If not, start over ā€¢ If you didā€¦ ā€¢ Automate - You donā€™t want to hand code that every time, right? ā€¢ Consider tools that are easy to automate ā€¢ Scripting interface ā€¢ APIs ā€¢ Ability to maintain is important
  • 72. BigML, Inc 32Feature Engineering The Process Data Transform Deļ¬ne Goal Model & Evaluate no yes Better Data Not Possible Tune Algorithm Goal Met? Automate Feature Engineer & Selection Betterā€Ø Features
  • 73. BigML, Inc 33Feature Engineering Summary ā€¢ Feature Engineering: what is it / why it is important ā€¢ Automatic transformations: date-time, text, etc ā€¢ Built-in functions: ļ¬ltering and feature engineering ā€¢ Discretization / Normalization / etc. ā€¢ Flatline: programmatic feature engineering / ļ¬ltering ā€¢ Structure ā€¢ Examples: Adding ļ¬elds / ļ¬ltering ā€¢ When building features it is important to watch for leakage ā€¢ The critical importance of automating