Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 5: Basic Data Transformations and Feature Engineering. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
2. BigML, Inc 2
Basic Transformations
Making Data Machine Learning Ready
Poul Petersen
CIO, BigML, Inc
3. BigML, Inc 3Basic Transformations
In a Perfect Worldā¦
Q: How does a physicist milk a cow?
A: Well, ļ¬rst let us consider a spherical cow...
Q: How does a data scientist build a model?
A: Well, ļ¬rst let us consider perfectly formatted dataā¦
4. BigML, Inc 4Basic Transformations
The Dream
CSV Dataset Model Proļ¬t!
5. BigML, Inc 5Basic Transformations
The Reality
CRM
Web Accounts
Transactions
ML Ready?
6. BigML, Inc 6Basic Transformations
Obstacles
ā¢ Data Structure
ā¢ Scattered across systems
ā¢ Wrong "shape"
ā¢ Unlabelled data
ā¢ Data Value
ā¢ Format: spelling, units
ā¢ Missing values
ā¢ Non-optimal correlation
ā¢ Non-existant correlation
ā¢ Data Signiļ¬cance
ā¢ Unwanted: PII, Non-Preferred
ā¢ Expensive to collect
ā¢ Insidious: Leakage, obviously correlated
Data Transformation
Feature Engineering
Feature Selection
7. BigML, Inc 7Basic Transformations
The Process
ā¢ Deļ¬ne a clear idea of the goal.
ā¢ Sometimes this comes laterā¦
ā¢ Understand what ML tasks will achieve the goal.
ā¢ Transform the data
ā¢ where is it, how is it stored?
ā¢ what are the features?
ā¢ can you access it programmatically?
ā¢ Feature Engineering: transform the data you have into
the data you actually need.
ā¢ Evaluate: Try it on a small scale
ā¢ Accept that you might have to start overā¦.
ā¢ But when it works, automate it!!!!
9. BigML, Inc 9Basic Transformations
BigML Tasks
Goal
ā¢ Will this customer default on a
loan?
ā¢ How many customers will apply for
a loan next month?
ā¢ Is the consumption of this product
unusual?
ā¢ Is the behavior of the customers
similar?
ā¢ Are these products purchased
together?
ML Task
Classification
Regression
Anomaly Detection
Cluster Analysis
Association Discovery
15. BigML, Inc 15Basic Transformations
ML Ready DataInstances
Fields
Ā (Features)
Tabular Data (rows and columns):
ā¢ Each row
ā¢ is one instance.
ā¢ contains all the information about that one instance.
ā¢ Each column
ā¢ is a ļ¬eld that describes a property of the instance.
16. BigML, Inc 16Basic Transformations
Data Labeling
Unsupervised
Ā Learning Supervised
Ā Learning
ā¢ Anomaly Detection
ā¢ Clustering
ā¢ Association Discovery
ā¢ Classiļ¬cation
ā¢ Regression
The only difference, in terms of
ML-Ready structure is the
presence of a "label"
17. BigML, Inc 17Basic Transformations
Data Labelling
Data is often not labeled
Create labels with a transformation
Name Month - 3 Month - 2 Month - 1
Joe Schmo 123.23 0 0
Jane Plain 0 0 0
Mary Happy 0 55.22 243.33
Tom Thumb 12.34 8.34 14.56
Un-ĀāLabelled
Ā Data
Labelled
Ā data
Name Month - 3 Month - 2 Month - 1 Default
Joe Schmo 123.23 0 0 FALSE
Jane Plain 0 0 0 TRUE
Mary Happy 0 55.22 243.33 FALSE
Tom Thumb 12.34 8.34 14.56 FALSE
Can be done at Feature
Engineering step as well
18. BigML, Inc 18Basic Transformations
SF Restaurants Example
https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores/stya-26eb
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
create database sf_restaurants;
use sf_restaurants;
create table businesses (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100),
postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100));
load data local infile './businesses.csv' into table businesses fields terminated by ',' enclosed by '"' lines terminated by
'rn' ignore 1 lines;
create table inspections (business_id int, score varchar(10), idate varchar(8), itype varchar(100));
load data local infile './inspections.csv' into table inspections fields terminated by ',' enclosed by '"' lines terminated
by 'rn' ignore 1 lines;
create table violations (business_id int, vdate varchar(8), description varchar(1000));
load data local infile './violations.csv' into table violations fields terminated by ',' enclosed by '"' lines terminated by
'rn' ignore 1 lines;
create table legend (Minimum_Score int, Maximum_Score int, Description varchar(100));
load data local infile './legend.csv' into table legend fields terminated by ',' enclosed by '"' lines terminated by 'rn'
ignore 1 lines;
20. BigML, Inc 20Basic Transformations
Data Cleaning
Homogenize missing values and different types in the same
feature, fix input errors, correct semantic issues, types, etc.
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 - Rock 139
Blues alive 1990/03/01 281 Blues 239
Lonely planet 2002-11-19 5:32s Techno 42
Dance, dance 02/23/1983 312 Disco N/A
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 4 minutes Techno 895
The alchemist 2001-11-21 418 Bluesss 178
Bring me down 18-10-98 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Original
Ā data
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 Rock 139
Blues alive 1990-03-01 281 Blues 239
Lonely planet 2002-11-19 332 Techno 42
Dance, dance 1983-02-23 312 Disco
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 240 Techno 895
The alchemist 2001-11-21 418 Blues 178
Bring me down 1998-10-18 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Cleaned
Ā data
update violations set description = substr(description,1,instr(description,' [ date violation corrected:')-1) where instr(description,'
[ date violation corrected:') > 0;
22. BigML, Inc 22Basic Transformations
Deļ¬ne a Goal
ā¢ Predict rating: Poor / Needs Improvement / Adequate /
Good
ā¢ This is a classiļ¬cation problem
ā¢ Based on business proļ¬le:
ā¢ Description: kitchen, cafe, etc.
ā¢ Location: zip, latitude, longitude
23. BigML, Inc 23Basic Transformations
Denormalizing
business
inspections
violations
scores
Instances
Features
(millions)
join
Data is usually normalized in relational databases, ML-Ready
datasets need the information de-normalized in a single dataset.
create table scores select * from businesses left join inspections using (business_id);
create table scores_last select a.* from scores as a JOIN (select business_id,max(idate)
as idate from scores group by business_id) as b where a.business_id=b.business_id and
a.idate=b.idate;
Denormalize
ML-ĀāReady:
Ā Each
Ā row
Ā contains
Ā all
Ā the
Ā information
Ā about
Ā that
Ā one
Ā instance.
Ā
create table scores_last_label select scores_last.*, Description as score_label from
scores_last join legend on score <= Maximum_Score and score >= Minimum_Score;
Add
Ā Label
25. BigML, Inc 25Basic Transformations
Structuring Output
ā¢ A CSV ļ¬le uses plain text to store tabular data.
ā¢ In a CSV ļ¬le, each row of the ļ¬le is an instance.
ā¢ Each column in a row is usually separated by a comma (,) but other
"separators" like semi-colon (;), colon (:), pipe (|), can also be used.
Each row must contain the same number of ļ¬elds
ā¢ but they can be null
ā¢ Fields can be quoted using double quotes (").
ā¢ Fields that contain commas or line separators must be quoted.
ā¢ Quotes (") in ļ¬elds must be doubled ("").
ā¢ The character encoding must be UTF-8
ā¢ Optionally, a CSV ļ¬le can use the ļ¬rst line as a header to provide the
names of each ļ¬eld.
After all the data transformations, a CSV (āComma-Separated
Values) file has to be generated, following the rules below:
select * from scores_last_label into outfile "./scores_last_label.csv";
select 'name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'score_label' UNION select name, address, city,
state, postal_code, latitude, longitude, score_label from scores_last_label into outfile "./scores_last_label_headers.csv" ;
27. BigML, Inc 27Basic Transformations
Deļ¬ne a Goal
ā¢ Predict rating: Poor / Needs Improvement / Adequate / Good
ā¢ This is a classiļ¬cation problem
ā¢ Based on business proļ¬le:
ā¢ Description: kitchen, restaurant, etc.
ā¢ Location: zip code, latitude, longitude
ā¢ Number of violations, text of violations
28. BigML, Inc 28Basic Transformations
Aggregating
User Num.Playbacks Total Time Pref.Device
User001 3 830 Tablet
User002 1 218 Smartphone
User003 3 1019 TV
User005 2 521 Tablet
Aggregated data (list of users)
When the entity to model is different from the provided data,
an aggregation to get the entity might be needed.
Content Genr
e
Duration Play Time User Device
Highway
star
Rock 190 2015-05-12
16:29:33
User001 TV
Blues alive Blues 281 2015-05-13
12:31:21
User005 Tablet
Lonely
planet
Tech
no
332 2015-05-13
14:26:04
User003 TV
Dance,
dance
Disco 312 2015-05-13
18:12:45
User001 Tablet
The wall Reag
ge
218 2015-05-14
09:02:55
User002 Smartphone
Offside
down
Tech
no
240 2015-05-14
11:26:32
User005 Tablet
The
alchemist
Blues 418 2015-05-14
21:44:15
User003 TV
Bring me
down
Class
ic
328 2015-05-15
06:59:56
User001 Tablet
The
scarecrow
Rock 269 2015-05-15
12:37:05
User003 Smartphone
Original data (list of playbacks)
create table violations_aggregated select business_id,count(*) as violation_num,group_concat(description) as violation_txt from
violations group by business_id;
create table scores_last_label_violations select * from scores_last_label left join violations_aggregated USING (business_id);
tail -n+2 playlists.csv | cut -d',' -f5 | sort | uniq -c
tail -n+2 playlist.csv | awk -F',' '{arr[$5]+=$3} END {for (i in arr) {print arr[i],i}}'
SET @@group_concat_max_len = 15000
select 'name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'violation_num', 'violation_txt', 'score_label'
UNION select name, address, city, state, postal_code, latitude, longitude, violation_num, violation_txt, score_label from
scores_last_label_violations into outfile "./scores_last_label_violations_headers.csv" ;
30. BigML, Inc 30Basic Transformations
Pivoting
Different values of a feature are pivoted to new columns in the
result dataset.
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002 Smartphone
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet
The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone
Original data
User Num.Playback
s
Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone
User001 3 830 Tablet 1 2 0 190 640 0
User002 1 218 Smartphone 0 0 1 0 0 218
User003 3 1019 TV 2 0 1 750 0 269
User005 2 521 Tablet 0 2 0 0 521 0
Aggregated data with pivoted columns
31. BigML, Inc 31Basic Transformations
Time Windows
Create new features using values over different periods of time
Instances
Features
Time
Instances
Features
(millions)
(thousands)
t=1 t=2 t=3
create table scores_2013 select a.business_id, a.score as score_2013, a.idate as idate_2013 from inspections as a JOIN ( select
business_id, max(idate) as idate from inspections where substr(idate,1,4) = "2013" group by business_id) as b where a.business_id =
b.business_id and a.idate = b.idate;
create table scores_over_time select * from businesses left join scores_2013 USING (business_id) left join scores_2014 USING (business_id);
33. BigML, Inc 33Basic Transformations
Updates
Need a current view of the data, but new data only comes in
batches of changes
day
Ā 1day
Ā 2day
Ā 3
Instances
Features
34. BigML, Inc 34Basic Transformations
Streaming
Data only comes in single changes
data
Ā stream
Instances
Features
Stream
Batch
(kafka, etc)
35. BigML, Inc 35Basic Transformations
Prosper Loan Life Cycle
Submit
Cancelled Withdraw Expired
FundedBids Current
Q: Which new listings make it to funded?
Q: Which funded loans make it to paid?
Q: If funded, what will be the rate?
Classification
Regression
Classification
Goal ML Task
Defaulted
Paid
Late
Listings Loans
36. BigML, Inc 36Basic Transformations
Prosper Example
D a t a P ro v i d e d i n X M L
updates!!
export.sh
fetch.sh
ācurlā
daily
import.py
XML
bigml.sh
Model
Predict
Share in gallery
Status
LoanStatus
BorrowerRate
Denormalization with join
37. BigML, Inc 37Basic Transformations
Prosper Example
ā¢ XMLā¦ yuck!
ā¢ MongoDB has CSV export and is record based so it is easy to
handle changing data structure.
ā¢ Feature Engineering
ā¢ There are 5 diļ¬erent classes of ābadā loans
ā¢ Date cleanup
ā¢ Type casting: ļ¬oats and ints
ā¢ Would be better to track over time
ā¢ number of late payments
ā¢ compare predictions and actuals
ā¢ XMLā¦ yuck!
Tidbits and Lessons Learnedā¦.
40. BigML, Inc 40Basic Transformations
Talend
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
Denormalization Example
41. BigML, Inc 41Basic Transformations
Summary
ā¢ Data is awful
ā¢ Requires clean-up
ā¢ Transformations
ā¢ Consumes an enormous part of the effort in
applying ML
ā¢ Techniques:
ā¢ Denormalizing
ā¢ Aggregating / Pivoting
ā¢ Time windows / Streaming
ā¢ What a real Workļ¬ow looks like and the tools required
42. BigML, Inc 2
Feature Engineering
Creating Features that Make Machine Learning Work
Poul Petersen
CIO, BigML, Inc
43. BigML, Inc 3Feature Engineering
what is Feature Engineering
ā¢ This is really, really important - more than algorithm selection!
ā¢ In fact, so important that BigML often does it automatically
ā¢ ML Algorithms have no deeper understanding of data
ā¢ Numerical: have a natural order, can be scaled, etc
ā¢ Categorical: have discrete values, etc.
ā¢ The "magic" is the ability to ļ¬nd patterns quickly and efļ¬ciently
ā¢ ML Algorithms only know what you tell/show it with data
ā¢ Medical: Kg and M, but BMI = Kg/M2 is better
ā¢ Lending: Debt and Income, but DTI is better
ā¢ Intuition can be risky: remember to prove it with an evaluation!
Feature Engineering: applying domain knowledge of
the data to create new features that allow ML
algorithms to work better, or to work at all.
44. BigML, Inc 4Feature Engineering
Built-in Transformations
2013-09-25 10:02
Date-Time Fields
ā¦ year month day hour minute ā¦
ā¦ 2013 Sep 25 10 2 ā¦
ā¦ ā¦ ā¦ ā¦ ā¦ ā¦ ā¦
NUM NUMCAT NUM NUM
ā¢ Date-Time ļ¬elds have a lot of information "packed" into them
ā¢ Splitting out the time components allows ML algorithms to
discover time-based patterns.
DATE-TIME
45. BigML, Inc 5Feature Engineering
Built-in Transformations
Categorical Fields for Clustering/LR
ā¦ alchemy_category ā¦
ā¦ business ā¦
ā¦ recreation ā¦
ā¦ health ā¦
ā¦ ā¦ ā¦
CAT
business health recreation ā¦
ā¦ 1 0 0 ā¦
ā¦ 0 0 1 ā¦
ā¦ 0 1 0 ā¦
ā¦ ā¦ ā¦ ā¦ ā¦
NUM NUM NUM
ā¢ Clustering and Logistic Regression require numeric ļ¬elds for
inputs
ā¢ Categorical values are transformed to numeric vectors
automatically*
ā¢ *Note: In BigML, clustering uses k-prototypes and the encoding used for LR can be conļ¬gured.
46. BigML, Inc 6Feature Engineering
Built-in Transformations
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon āem.
TEXT
Text Fields
ā¦ great afraid born achieve ā¦
ā¦ 4 1 1 1 ā¦
ā¦ ā¦ ā¦ ā¦ ā¦ ā¦
NUM NUM NUM NUM
ā¢ Unstructured text contains a lot of potentially interesting
patterns
ā¢ Bag-of-words analysis happens automatically and extracts
the "interesting" tokens in the text
ā¢ Another option is Topic Modeling to extract thematic meaning
47. BigML, Inc 7Feature Engineering
Help ML to Work Better
{
āurl":"cbsnews",
"title":"Breaking News Headlines
Business Entertainment World News ā,
"body":" news covering all the latest
breaking national and world news
headlines, including politics, sports,
entertainment, business and more.ā
}
TEXT
title body
Breaking Newsā¦ news coveringā¦
ā¦ ā¦
TEXT TEXT
When text is not actually unstructured
ā¢ In this case, the text ļ¬eld has structure (key/value pairs)
ā¢ Extracting the structure as new features may allow the ML
algorithm to work better
49. BigML, Inc 9Feature Engineering
Help ML to Work at all
When the pattern does not exist
Highway Number Direction Is Long
2 East-West FALSE
4 East-West FALSE
5 North-South TRUE
8 East-West FALSE
10 East-West TRUE
ā¦ ā¦ ā¦
Goal: Predict principle direction from highway number
( = (mod (field "Highway Number") 2) 0)
51. BigML, Inc 11Feature Engineering
Feature Engineering
Discretization
Total Spend
7,342.99
304.12
4.56
345.87
8,546.32
NUM
āPredict will spend
$3,521 with error
$1,232ā
Spend Category
Top 33%
Bottom 33%
Bottom 33%
Middle 33%
Top 33%
CAT
āPredict customer
will be Top 33% in
spendingā
53. BigML, Inc 13Feature Engineering
Built-ins for FE
ā¢ Discretize: Converts a numeric value to categorical
ā¢ Replace missing values: ļ¬xed/max/mean/median/etc
ā¢ Normalize: Adjust a numeric value to a speciļ¬c range of
values while preserving the distribution
ā¢ Math: Exponentiation, Logarithms, Squares, Roots, etc
ā¢ Types: Force a ļ¬eld value to categorical, integer, or real
ā¢ Random: Create random values for introducing noise
ā¢ Statistics: Mean, Population
ā¢ Refresh Fields:
ā¢ Types: recomputes ļ¬eld types. Ex: #classes
Ā >
Ā 1000
ā¢ Preferred: recomputes preferred status
54. BigML, Inc 14Feature Engineering
Flatline Add Fields
Computing with Existing Features
Debt Income
10,134 100,000
85,234 134,000
8,112 21,500
0 45,900
17,534 52,000
NUM NUM
(/ (ļ¬eld "Debt") (ļ¬eld "Income"))
Debt
Ā
Income
Debt to Income Ratio
0.10
0.64
0.38
0
0.34
NUM
56. BigML, Inc 16Feature Engineering
What is Flatline?
ā¢ DSL:
ā¢ Invented by BigML - Programmatic / Optimized for speed
ā¢ Transforms datasets into new datasets
ā¢ Adding new ļ¬elds / Filtering
ā¢ Transformations are written in lisp-style syntax
ā¢ Feature Engineering
ā¢ Computing new ļ¬elds: (/
Ā (ļ¬eld
Ā "Debt")
Ā (ļ¬eld
Ā āIncomeā))
ā¢ Programmatic Filtering:
ā¢ Filtering datasets according to functions that evaluate to
true/false using the row of data as an input.
Flatline: a domain specific language for feature
engineering and programmatic filtering
57. BigML, Inc 17Feature Engineering
Flatline
ā¢ Lisp style syntax: Operators come ļ¬rst
ā¢ Correct: (+
Ā 1
Ā 2) => NOT Correct: (1
Ā +
Ā 2)
ā¢ Dataset Fields are ļ¬rst-class citizens
ā¢ (ļ¬eld
Ā ādiabetes
Ā pedigreeā)
Ā
ā¢ Limited programming language structures
ā¢ let, cond, if, map, list operators, */+-Āā, etc.
ā¢ Built-in transformations
ā¢ statistics, strings, timestamps, windows
58. BigML, Inc 18Feature Engineering
Flatline s-expressions
(=
Ā 0
Ā (+
Ā (abs
Ā (
Ā f
Ā "Month
Ā -Āā
Ā 3"
Ā )
Ā )
Ā (abs
Ā (
Ā f
Ā "Month
Ā -Āā
Ā 2"))
Ā (abs
Ā (
Ā f
Ā "Month
Ā -Āā
Ā 1")
Ā )
Ā ))
Name Month - 3 Month - 2 Month - 1
Joe Schmo 123.23 0 0
Jane Plain 0 0 0
Mary Happy 0 55.22 243.33
Tom Thumb 12.34 8.34 14.56
Un-ĀāLabelled
Ā Data
Labelled
Ā data
Name Month - 3 Month - 2 Month - 1 Default
Joe Schmo 123.23 0 0 FALSE
Jane Plain 0 0 0 TRUE
Mary Happy 0 55.22 243.33 FALSE
Tom Thumb 12.34 8.34 14.56 FALSE
Adding Simple Labels to Data
Define "default" as
missing three payments
in a row
65. BigML, Inc 25Feature Engineering
Feature Engineering
Fix Missing Values in a āMeaningfulā Way
F i l t e r
Zeros
Model āØ
insulin
Predict āØ
insulin
Select āØ
insulin
FixedāØ
Dataset
AmendedāØ
Dataset
OriginalāØ
Dataset
CleanāØ
Dataset
( if ( = (field "insulin") 0) (field "predicted insulin") (field "insulin"))
69. BigML, Inc 29Feature Engineering
Feature Selection
ā¢ Sales pipeline where step n-1 has no other outcome then
step n.
ā¢ Stock close predicts stock open
ā¢ Churn retention: the worst rep is actually the best
(correlation != causation)
ā¢ Cancer prediction where one input is a doctor ordered test
for the condition
ā¢ Account ID predicts fraud (because only new accounts are
fraudsters)
Leakage
71. BigML, Inc 31Feature Engineering
Evaluate & Automate
ā¢ Evaluate
ā¢ Did you meet the goal?
ā¢ If not, did you discover something else useful?
ā¢ If not, start over
ā¢ If you didā¦
ā¢ Automate - You donāt want to hand code that every time,
right?
ā¢ Consider tools that are easy to automate
ā¢ Scripting interface
ā¢ APIs
ā¢ Ability to maintain is important
72. BigML, Inc 32Feature Engineering
The Process
Data
Transform
Deļ¬ne Goal
Model &
Evaluate
no
yes
Better
Data
Not
Possible
Tune
Algorithm
Goal
Met?
Automate
Feature
Engineer &
Selection
BetterāØ
Features
73. BigML, Inc 33Feature Engineering
Summary
ā¢ Feature Engineering: what is it / why it is important
ā¢ Automatic transformations: date-time, text, etc
ā¢ Built-in functions: ļ¬ltering and feature engineering
ā¢ Discretization / Normalization / etc.
ā¢ Flatline: programmatic feature engineering / ļ¬ltering
ā¢ Structure
ā¢ Examples: Adding ļ¬elds / ļ¬ltering
ā¢ When building features it is important to watch for leakage
ā¢ The critical importance of automating