SlideShare a Scribd company logo
1 of 66
Download to read offline
IntroductionofFeatureHashingIntroductionofFeatureHashing
CreateaModelMatrixviaFeatureHashingwithaFormula
Interface
Wush Wu
Taiwan R User Group
Wush Chi-Hsuan Wu
Ph.D. Student
Cofounder of Taiwan R User Group
·
Display Advertising and Real Time Bidding
Large-Scale Machine Learning
-
-
·
Machine Learning and Data Mining Monday
R Workshop in Data Science Conference in Taiwan
2014 and 2015
-
-
2/66
source: http://www.nbcnews.com
Taiwan, a Mountainous Island
Highest Mountain: Yushan (3952m)·
3/66
Taipei City and Taipei 101
source: https://c2.staticflickr.com/6/5208/5231215951_c0e0036b17_b.jpg
4/66
Taiwan R User Group
http://www.meetup.com/Taiwan-R
5/66
An Example of FeatureHashing
6/66
Sentiment Analysis
Provided by Lewis Crouch
Show the pros of using FeatureHashing
·
·
7/66
Dataset: IMDB
Predict the rating of the movie according to the text in the review.
Training Dataset: 25000 reviews
Binary response: positive or negative.
·
·
·
## [1] 1
Cleaned review·
## [1] "kurosawa is a proved humanitarian this movie is totally about people living in"
## [2] "poverty you will see nothing but angry in this movie it makes you feel bad but"
## [3] "still worth all those who s too comfortable with materialization should spend 2"
## [4] "5 hours with this movie"
8/66
Word Segmentation
## [1] "" "kurosawa" "is"
## [4] "a" "proved" "humanitarian"
## [7] "" "this" "movie"
## [10] "is" "totally" "about"
## [13] "people" "living" "in"
## [16] "poverty" "" "you"
## [19] "will" "see" "nothing"
## [22] "but" "angry" "in"
## [25] "this" "movie" ""
## [28] "it" "makes" "you"
## [31] "feel" "bad" "but"
## [34] "still" "worth" ""
## [37] "all" "those" "who"
## [40] "s" "too" "comfortable"
## [43] "with" "materialization" "should"
## [46] "spend" "2" "5"
## [49] "hours" "with" "this"
## [52] "movie" ""
9/66
Word Segmentation and Feature Extraction
ID MESSAGE HATE BAD LIKE GRATUITOUS
"5814_8" TRUE TRUE TRUE TRUE FALSE
"2381_9" FALSE FALSE FALSE TRUE FALSE
"7759_3" FALSE FALSE FALSE TRUE FALSE
"3630_4" FALSE FALSE FALSE FALSE FALSE
"9495_8" FALSE FALSE FALSE FALSE TRUE
"8196_8" FALSE FALSE TRUE TRUE TRUE
10/66
FeatureHashing
FeatureHashing implements split in the formula
interface.
·
hash_size <- 2^16
m.train <- hashed.model.matrix(~ split(review, delim = " ", type = "existence"),
data = imdb.train,
hash.size = hash_size,
signed.hash = FALSE)
m.valid <- hashed.model.matrix(~ split(review, delim = " ", type = "existence"),
data = imdb.valid,
hash.size = hash_size,
signed.hash = FALSE)
11/66
Type of split
existence
count
tf-idf which is contributed by Michaël Benesty and will
be announced in v0.9.1
·
·
·
12/66
Gradient Boosted Decision Tree with xgboost
dtrain <- xgb.DMatrix(m.train, label = imdb.train$sentiment)
dvalid <- xgb.DMatrix(m.valid, label = imdb.valid$sentiment)
watch <- list(train = dtrain, valid = dvalid)
g <- xgb.train(booster = "gblinear", nrounds = 100, eta = 0.0001, max.depth = 2,
data = dtrain, objective = "binary:logistic",
watchlist = watch, eval_metric = "auc")
[0] train-auc:0.969895 valid-auc:0.914488
[1] train-auc:0.969982 valid-auc:0.914621
[2] train-auc:0.970069 valid-auc:0.914766
...
[97] train-auc:0.975616 valid-auc:0.922895
[98] train-auc:0.975658 valid-auc:0.922952
[99] train-auc:0.975700 valid-auc:0.923014
13/66
Performance
The AUC of g is in the public leader board· 0.90110
It outperforms the benchmark in Kaggle-
14/66
The Purpose of FeatureHashing
Make our life easier·
15/66
Formula Interface in R
16/66
Algorithm in Text Book
Regression
Real Data
SEPAL.LENGTH SEPAL.WIDTH PETAL.LENGTH PETAL.WIDTH SPECIES
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
y = Xβ + ε
How to convert the real data to ?· X
17/66
Feature Vectorization and Formula Interface
(INTERCEPT) SEPAL.LENGTH SEPAL.WIDTH PETAL.LENGTH PETAL.WIDTH SPECIESVERSICOLOR SPECIESVIRGINICA
1 1 5.1 3.5 1.4 0.2 0 0
2 1 4.9 3.0 1.4 0.2 0 0
51 1 7.0 3.2 4.7 1.4 1 0
52 1 6.4 3.2 4.5 1.5 1 0
101 1 6.3 3.3 6.0 2.5 0 1
102 1 5.8 2.7 5.1 1.9 0 1
is usually constructed via model.matrix in R· X
model.matrix(~ ., iris.demo)
18/66
Formula Interface
y ~ a + b·
y is the response
a and b are predictors
-
-
19/66
Formula Interface: +
(INTERCEPT) A B
1 1 5.1 3.5
2 1 4.9 3.0
51 1 7.0 3.2
52 1 6.4 3.2
101 1 6.3 3.3
102 1 5.8 2.7
+ is the operator of combining linear predictors·
model.matrix(~ a + b, data.demo)
20/66
Formula Interface: :
(INTERCEPT) A B A:B
1 1 5.1 3.5 17.85
2 1 4.9 3.0 14.70
51 1 7.0 3.2 22.40
52 1 6.4 3.2 20.48
101 1 6.3 3.3 20.79
102 1 5.8 2.7 15.66
: is the interaction operator·
model.matrix(~ a + b + a:b, data.demo)
21/66
Formula Interface: *
(INTERCEPT) A B A:B
1 1 5.1 3.5 17.85
2 1 4.9 3.0 14.70
51 1 7.0 3.2 22.40
52 1 6.4 3.2 20.48
101 1 6.3 3.3 20.79
102 1 5.8 2.7 15.66
* is the operator of cross product·
# a + b + a:b
model.matrix(~ a * b, data.demo)
22/66
Formula Interface: (
(INTERCEPT) A:C B:C
1 1 7.14 4.90
2 1 6.86 4.20
51 1 32.90 15.04
52 1 28.80 14.40
101 1 37.80 19.80
102 1 29.58 13.77
: and * are distributive over +·
# a:c + b:c
model.matrix(~ (a + b):c, data.demo)
23/66
Formula Interface: .
(INTERCEPT) SEPAL.LENGTH SEPAL.WIDTH PETAL.LENGTH PETAL.WIDTH SPECIESVERSICOLOR SPECIESVIRGINICA
1 1 5.1 3.5 1.4 0.2 0 0
2 1 4.9 3.0 1.4 0.2 0 0
51 1 7.0 3.2 4.7 1.4 1 0
52 1 6.4 3.2 4.5 1.5 1 0
101 1 6.3 3.3 6.0 2.5 0 1
102 1 5.8 2.7 5.1 1.9 0 1
. means all columns of the data.·
# ~ Sepal.Length + Sepal.Width + Petal.Length +
# Petal.Width + Species
model.matrix(~ ., iris.demo)
24/66
Please check ?formula·
25/66
Categorical Features
26/66
Categorical Feature in R
Dummy Coding
A categorical variables of categories are transformed
to a -dimentional vector.
There are many coding systems and the most commonly
used is Dummy Coding.
· K
K − 1
·
The first category are transformed to .- 0⃗
## versicolor virginica
## setosa 0 0
## versicolor 1 0
## virginica 0 1
27/66
Categorical Feature in Machine Learning
Predictive analysis
Regularization
The categorical variables of categories are
transformed to -dimentional vector.
·
·
· K
K
The missing data are transformed to .- 0⃗
contr.treatment(levels(iris.demo$Species), contrasts = FALSE)
## setosa versicolor virginica
## setosa 1 0 0
## versicolor 0 1 0
## virginica 0 0 1
28/66
Motivation of the FeatureHashing
29/66
Kaggle: Display Advertising Challenge
Given a user and the page he is visiting, what is the
probability that he will click on a given ad?
·
30/66
The Size of the Dataset
13 integer features and 26 categorical features
instances
Download the dataset via this link
·
· 7 × 107
·
31/66
Vectorize These Features in R
After binning the integer features, I got
categories.
Sparse matrix was required
· 3 × 107
·
A dense matrix required 14PB memory...
A sparse matrix required 40GB memory...
-
-
32/66
Estimating Computing Resources
Dense matrix: nrow ( ) ncol ( 8
bytes
Sparse matrix: nrow ( ) ( ) (
) bytes
· 7 × 107
× 3 × )107
×
· 7 × 107
× 13 + 26 ×
4 + 4 + 8
33/66
Vectorize These Features in R
sparse.model.matrix is similar to model.matrix but
returns a sparse matrix.
·
34/66
Fit Model with Limited Memory
A post Beat the benchmark with less then 200MB of
memory describes how to fit a model with limited
resources.
·
online logistic regression
feature hash trick
adaptive learning rate
-
-
-
35/66
Online Machine Learning
In online learning, the model parameter is updated after
the arrival of every new datapoint.
In batch learning, the model paramter is updated after
access to the entire dataset.
·
Only the aggregated information and the incoming
datapoint are used.
-
·
36/66
Memory Requirement of Online Learning
Related to model parameters·
Require little memory for large amount of instances.-
37/66
Limitation of model.matrix
The vectorization requires all categories.·
contr.treatment(levels(iris$Species))
## versicolor virginica
## setosa 0 0
## versicolor 1 0
## virginica 0 1
38/66
My Work Around
Scan all data to retrieve all categories
Vectorize features in an online fashion
The overhead of exploring features increases
·
·
·
39/66
Observations
Mapping features to is one of method
to vectorize feature.
· {0, 1, 2, . . . , K}
setosa =>
versicolor =>
virginica =>
- e1⃗
- e2⃗
- e3⃗
## setosa versicolor virginica
## setosa 1 0 0
## versicolor 0 1 0
## virginica 0 0 1
40/66
Observations
contr.treatment ranks all categories to map the
feature to integer.
What if we do not know all categories?
·
·
Digital features are integer.
We use a function maps to
-
- ℤ {0, 1, 2, . . . , K}
charToRaw("setosa")
## [1] 73 65 74 6f 73 61
41/66
What is Feature Hashing?
A method to do feature vectorization with a hash
function
·
For example, Mod %% is a family of hash function.·
42/66
Feature Hashing
Choose a hash function and use it to hash all the
categorical features.
The hash function does not require global information.
·
·
43/66
An Example of Feature Hashing of Criteo's Data
V15 V16 V17
68fd1e64 80e26c9b fb936136
68fd1e64 f0cf0024 6f67f7e5
287e684f 0a519c5c 02cf9876
68fd1e64 2c16a946 a9a87e68
8cf07265 ae46a29d c81688bb
05db9164 6c9c9cf3 2730ec9c
The categorical variables have been hashed onto 32 bits
for anonymization purposes.
Let us use the last 4 bits as the hash result, i.e. the hash
function is function(x) x %% 16
The size of the vector is , which is called hash size
·
·
· ( )24
44/66
An Example of Feature Hashing of Criteo's Data
V15 V16 V17
68fd1e64 80e26c9b fb936136
68fd1e64 f0cf0024 6f67f7e5
287e684f 0a519c5c 02cf9876
68fd1e64 2c16a946 a9a87e68
8cf07265 ae46a29d c81688bb
05db9164 6c9c9cf3 2730ec9c
68fd1e64, 80e26c9b, fb936136 => 4, b, 6
68fd1e64, f0cf0024, 6f67f7e5 => 4, 4, 5
·
- (0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0)
·
- (0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
45/66
Hash Collision
68fd1e64, f0cf0024, 6f67f7e5 => 4, 4, 5
Hash function is a many to one mapping, so different
features might be mapped to the same index. This is
called collision.
In the perspective of statistics, the collision in hash
function makes the effect of these features
confounding.
·
- (0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
·
·
46/66
Choosing Hash Function
Less collision rate
High throughput
FeatureHashing uses the Murmurhash3 algorithm
implemented by digest
·
Real features, such as text features, are not
uniformly distributed in .
-
ℤ
·
·
47/66
Pros of FeatureHashing
A Good Companion of Online Algorithm
library(FeatureHashing)
hash_size <- 2^16
w <- numeric(hash_size)
for(i in 1:1000) {
data <- fread(paste0("criteo", i))
X <- hashed.model.matrix(V1 ~ ., data, hash.size = hash_size)
y <- data$V1
update_w(w, X, y)
}
48/66
Pros of FeatureHashing
A Good Companion of Distributed Algorithm
library(pbdMPI)
library(FeatureHashing)
hash_size <- 2^16
w <- numeric(hash_size)
i <- comm.rank()
data <- fread(paste0("criteo", i))
X <- hashed.model.matrix(V1 ~ ., data, hash.size = hash_size)
y <- data$V1
# ...
49/66
Pros of FeatureHashing
Simple Training and Testing
library(FeatureHashing)
model <- is_click ~ ad * (url + ip)
m_train <- hashed.model.matrix(model, data_train, hash_size)
m_test <- hashed.model.matrix(model, data_test, hash_size)
50/66
Cons of FeatureHashing
Hash Size
51/66
Cons of FeatureHashing
Lose Interpretation
Collision makes the interpretation harder.
It is inconvenient to reverse the indices to feature.
·
·
m <- hashed.model.matrix(~ Species, iris, hash.size = 2^4, create.mapping = TRUE)
hash.mapping(m) %% 2^4
## Speciessetosa Speciesvirginica Speciesversicolor
## 7 13 8
52/66
The Result of the Competition...
No.1: Field-aware Factorization Machine
No.3, no.4 and no.9 (we): Neuron Network
·
The hashing trick do not contribute any
improvement on the leaderboard. They apply the
hashing trick only because it makes their life easier
to generate features..
-
·
53/66
Extending Formula Interface in R
54/66
Formula is the most R style interface·
55/66
Tips
terms.formula and its argument specials·
tf <- terms.formula(~ Plant * Type + conc * split(Treatment), specials = "split",
data = CO2)
attr(tf, "factors")
## Plant Type conc split(Treatment) Plant:Type
## Plant 1 0 0 0 1
## Type 0 1 0 0 1
## conc 0 0 1 0 0
## split(Treatment) 0 0 0 1 0
## conc:split(Treatment)
## Plant 0
## Type 0
## conc 1
## split(Treatment) 1
56/66
Specials
attr(tf, "specials") tells which rows of attr(tf,
"factors") need to be parsed further
·
rownames(attr(tf, "factors"))
## [1] "Plant" "Type" "conc"
## [4] "split(Treatment)"
attr(tf, "specials")
## $split
## [1] 4
57/66
Parse
parse extracts the information from the specials·
options(keep.source = TRUE)
p <- parse(text = rownames(attr(tf, "factors"))[4])
getParseData(p)
## line1 col1 line2 col2 id parent token terminal text
## 9 1 1 1 16 9 0 expr FALSE
## 1 1 1 1 5 1 3 SYMBOL_FUNCTION_CALL TRUE split
## 3 1 1 1 5 3 9 expr FALSE
## 2 1 6 1 6 2 9 '(' TRUE (
## 4 1 7 1 15 4 6 SYMBOL TRUE Treatment
## 6 1 7 1 15 6 9 expr FALSE
## 5 1 16 1 16 5 9 ')' TRUE )
58/66
Efficiency
Rcpp
Invoking external C functions
·
The core functions are implemented in C++-
·
digest exposes the C function:
FeatureHashing imports the C function:
-
Register the c function
Add the helper header file
-
-
-
Import the c function-
59/66
Let's Discuss the Implementation Later
source: http://www.commpraxis.com/wp-content/uploads/2015/01/commpraxis-
audience-confusion.png
60/66
Incoming Feature
61/66
FeatureHashing + jiebaR
library(FeatureHashing)
df <- data.frame(title = c(
" 33 ",
" 4 ",
" …// "
))
init_jiebaR_callback()
m <- hashed.model.matrix(~ jiebaR(title), df, create.mapping = TRUE)
Require jiebaR-devel and FeatureHashing-devel·
devtools::install_github('qinwf/jiebaR')
devtools::install_github('wush978/FeatureHashing', ref = 'dev/0.10')
-
-
62/66
Customized Formula Special Function
Defines the specail function in R or C++(through Rcpp)
Register the function to the FeatureHashing
Call these function before FeatureHashing via Formula
Interface
Might be useful for n-gram
·
·
·
·
63/66
Summary
When should I use FeatureHashing?
Short Answer: Predictive Anlysis with a large amount of categories.
Pros of FeatureHashing
Cons of FeatureHashing
·
Make it easier to do feature vectorization for
predictive analysis.
Make it easier to tokenize the text data.
-
-
·
Decrease the prediction accuracy if the hash size
is too small
Interpretation becomes harder.
-
-
64/66
Q&A
65/66
Thanks You
66/66

More Related Content

What's hot

Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelinesRamesh Sampath
 
Training Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in SparkTraining Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in SparkPatrick Pletscher
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
 
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017PAPIs.io
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRDatabricks
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹Wayne Chen
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboostmichiaki ito
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 

What's hot (18)

Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
Training Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in SparkTraining Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in Spark
 
Xgboost
XgboostXgboost
Xgboost
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 

Similar to Introduction of Feature Hashing

maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learningMax Kleiner
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMLAI2
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_fariaPaulo Faria
 
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...cvpaper. challenge
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
 
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...Juan Cruz Nores
 
Time series representations for better data mining
Time series representations for better data miningTime series representations for better data mining
Time series representations for better data miningPeter Laurinec
 
Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595Marco Yandun
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptxhtn540
 
IRJET- Image Classification – Cat and Dog Images
IRJET- Image Classification – Cat and Dog ImagesIRJET- Image Classification – Cat and Dog Images
IRJET- Image Classification – Cat and Dog ImagesIRJET Journal
 
SAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security worldSAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security worldAndrey Karpov
 
Grow and Shrink - Dynamically Extending the Ruby VM Stack
Grow and Shrink - Dynamically Extending the Ruby VM StackGrow and Shrink - Dynamically Extending the Ruby VM Stack
Grow and Shrink - Dynamically Extending the Ruby VM StackKeitaSugiyama1
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET Journal
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...Victor Asanza
 

Similar to Introduction of Feature Hashing (20)

maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...
 
Time series representations for better data mining
Time series representations for better data miningTime series representations for better data mining
Time series representations for better data mining
 
Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptx
 
P1121133727
P1121133727P1121133727
P1121133727
 
IRJET- Image Classification – Cat and Dog Images
IRJET- Image Classification – Cat and Dog ImagesIRJET- Image Classification – Cat and Dog Images
IRJET- Image Classification – Cat and Dog Images
 
SAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security worldSAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security world
 
Grow and Shrink - Dynamically Extending the Ruby VM Stack
Grow and Shrink - Dynamically Extending the Ruby VM StackGrow and Shrink - Dynamically Extending the Ruby VM Stack
Grow and Shrink - Dynamically Extending the Ruby VM Stack
 
Methods
MethodsMethods
Methods
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning Algorithm
 
Atomreaktor
AtomreaktorAtomreaktor
Atomreaktor
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
 

More from Wush Wu

Predicting winning price in real time bidding
Predicting winning price in real time biddingPredicting winning price in real time bidding
Predicting winning price in real time biddingWush Wu
 
機器學習的技術債
機器學習的技術債機器學習的技術債
機器學習的技術債Wush Wu
 
社群對我職涯的影響
社群對我職涯的影響社群對我職涯的影響
社群對我職涯的影響Wush Wu
 
R 語言上手篇
R 語言上手篇R 語言上手篇
R 語言上手篇Wush Wu
 
利用免費服務建立R的持續整合環境
利用免費服務建立R的持續整合環境利用免費服務建立R的持續整合環境
利用免費服務建立R的持續整合環境Wush Wu
 
Predicting Winning Price in Real Time Bidding with Censored Data
Predicting Winning Price in Real Time Bidding with Censored DataPredicting Winning Price in Real Time Bidding with Censored Data
Predicting Winning Price in Real Time Bidding with Censored DataWush Wu
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fittingWush Wu
 
R, Git, Github, and CI
R, Git, Github, and CIR, Git, Github, and CI
R, Git, Github, and CIWush Wu
 

More from Wush Wu (8)

Predicting winning price in real time bidding
Predicting winning price in real time biddingPredicting winning price in real time bidding
Predicting winning price in real time bidding
 
機器學習的技術債
機器學習的技術債機器學習的技術債
機器學習的技術債
 
社群對我職涯的影響
社群對我職涯的影響社群對我職涯的影響
社群對我職涯的影響
 
R 語言上手篇
R 語言上手篇R 語言上手篇
R 語言上手篇
 
利用免費服務建立R的持續整合環境
利用免費服務建立R的持續整合環境利用免費服務建立R的持續整合環境
利用免費服務建立R的持續整合環境
 
Predicting Winning Price in Real Time Bidding with Censored Data
Predicting Winning Price in Real Time Bidding with Censored DataPredicting Winning Price in Real Time Bidding with Censored Data
Predicting Winning Price in Real Time Bidding with Censored Data
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fitting
 
R, Git, Github, and CI
R, Git, Github, and CIR, Git, Github, and CI
R, Git, Github, and CI
 

Recently uploaded

chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 

Recently uploaded (20)

chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 

Introduction of Feature Hashing

  • 2. Wush Chi-Hsuan Wu Ph.D. Student Cofounder of Taiwan R User Group · Display Advertising and Real Time Bidding Large-Scale Machine Learning - - · Machine Learning and Data Mining Monday R Workshop in Data Science Conference in Taiwan 2014 and 2015 - - 2/66
  • 3. source: http://www.nbcnews.com Taiwan, a Mountainous Island Highest Mountain: Yushan (3952m)· 3/66
  • 4. Taipei City and Taipei 101 source: https://c2.staticflickr.com/6/5208/5231215951_c0e0036b17_b.jpg 4/66
  • 5. Taiwan R User Group http://www.meetup.com/Taiwan-R 5/66
  • 6. An Example of FeatureHashing 6/66
  • 7. Sentiment Analysis Provided by Lewis Crouch Show the pros of using FeatureHashing · · 7/66
  • 8. Dataset: IMDB Predict the rating of the movie according to the text in the review. Training Dataset: 25000 reviews Binary response: positive or negative. · · · ## [1] 1 Cleaned review· ## [1] "kurosawa is a proved humanitarian this movie is totally about people living in" ## [2] "poverty you will see nothing but angry in this movie it makes you feel bad but" ## [3] "still worth all those who s too comfortable with materialization should spend 2" ## [4] "5 hours with this movie" 8/66
  • 9. Word Segmentation ## [1] "" "kurosawa" "is" ## [4] "a" "proved" "humanitarian" ## [7] "" "this" "movie" ## [10] "is" "totally" "about" ## [13] "people" "living" "in" ## [16] "poverty" "" "you" ## [19] "will" "see" "nothing" ## [22] "but" "angry" "in" ## [25] "this" "movie" "" ## [28] "it" "makes" "you" ## [31] "feel" "bad" "but" ## [34] "still" "worth" "" ## [37] "all" "those" "who" ## [40] "s" "too" "comfortable" ## [43] "with" "materialization" "should" ## [46] "spend" "2" "5" ## [49] "hours" "with" "this" ## [52] "movie" "" 9/66
  • 10. Word Segmentation and Feature Extraction ID MESSAGE HATE BAD LIKE GRATUITOUS "5814_8" TRUE TRUE TRUE TRUE FALSE "2381_9" FALSE FALSE FALSE TRUE FALSE "7759_3" FALSE FALSE FALSE TRUE FALSE "3630_4" FALSE FALSE FALSE FALSE FALSE "9495_8" FALSE FALSE FALSE FALSE TRUE "8196_8" FALSE FALSE TRUE TRUE TRUE 10/66
  • 11. FeatureHashing FeatureHashing implements split in the formula interface. · hash_size <- 2^16 m.train <- hashed.model.matrix(~ split(review, delim = " ", type = "existence"), data = imdb.train, hash.size = hash_size, signed.hash = FALSE) m.valid <- hashed.model.matrix(~ split(review, delim = " ", type = "existence"), data = imdb.valid, hash.size = hash_size, signed.hash = FALSE) 11/66
  • 12. Type of split existence count tf-idf which is contributed by Michaël Benesty and will be announced in v0.9.1 · · · 12/66
  • 13. Gradient Boosted Decision Tree with xgboost dtrain <- xgb.DMatrix(m.train, label = imdb.train$sentiment) dvalid <- xgb.DMatrix(m.valid, label = imdb.valid$sentiment) watch <- list(train = dtrain, valid = dvalid) g <- xgb.train(booster = "gblinear", nrounds = 100, eta = 0.0001, max.depth = 2, data = dtrain, objective = "binary:logistic", watchlist = watch, eval_metric = "auc") [0] train-auc:0.969895 valid-auc:0.914488 [1] train-auc:0.969982 valid-auc:0.914621 [2] train-auc:0.970069 valid-auc:0.914766 ... [97] train-auc:0.975616 valid-auc:0.922895 [98] train-auc:0.975658 valid-auc:0.922952 [99] train-auc:0.975700 valid-auc:0.923014 13/66
  • 14. Performance The AUC of g is in the public leader board· 0.90110 It outperforms the benchmark in Kaggle- 14/66
  • 15. The Purpose of FeatureHashing Make our life easier· 15/66
  • 17. Algorithm in Text Book Regression Real Data SEPAL.LENGTH SEPAL.WIDTH PETAL.LENGTH PETAL.WIDTH SPECIES 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 51 7.0 3.2 4.7 1.4 versicolor 52 6.4 3.2 4.5 1.5 versicolor 101 6.3 3.3 6.0 2.5 virginica 102 5.8 2.7 5.1 1.9 virginica y = Xβ + ε How to convert the real data to ?· X 17/66
  • 18. Feature Vectorization and Formula Interface (INTERCEPT) SEPAL.LENGTH SEPAL.WIDTH PETAL.LENGTH PETAL.WIDTH SPECIESVERSICOLOR SPECIESVIRGINICA 1 1 5.1 3.5 1.4 0.2 0 0 2 1 4.9 3.0 1.4 0.2 0 0 51 1 7.0 3.2 4.7 1.4 1 0 52 1 6.4 3.2 4.5 1.5 1 0 101 1 6.3 3.3 6.0 2.5 0 1 102 1 5.8 2.7 5.1 1.9 0 1 is usually constructed via model.matrix in R· X model.matrix(~ ., iris.demo) 18/66
  • 19. Formula Interface y ~ a + b· y is the response a and b are predictors - - 19/66
  • 20. Formula Interface: + (INTERCEPT) A B 1 1 5.1 3.5 2 1 4.9 3.0 51 1 7.0 3.2 52 1 6.4 3.2 101 1 6.3 3.3 102 1 5.8 2.7 + is the operator of combining linear predictors· model.matrix(~ a + b, data.demo) 20/66
  • 21. Formula Interface: : (INTERCEPT) A B A:B 1 1 5.1 3.5 17.85 2 1 4.9 3.0 14.70 51 1 7.0 3.2 22.40 52 1 6.4 3.2 20.48 101 1 6.3 3.3 20.79 102 1 5.8 2.7 15.66 : is the interaction operator· model.matrix(~ a + b + a:b, data.demo) 21/66
  • 22. Formula Interface: * (INTERCEPT) A B A:B 1 1 5.1 3.5 17.85 2 1 4.9 3.0 14.70 51 1 7.0 3.2 22.40 52 1 6.4 3.2 20.48 101 1 6.3 3.3 20.79 102 1 5.8 2.7 15.66 * is the operator of cross product· # a + b + a:b model.matrix(~ a * b, data.demo) 22/66
  • 23. Formula Interface: ( (INTERCEPT) A:C B:C 1 1 7.14 4.90 2 1 6.86 4.20 51 1 32.90 15.04 52 1 28.80 14.40 101 1 37.80 19.80 102 1 29.58 13.77 : and * are distributive over +· # a:c + b:c model.matrix(~ (a + b):c, data.demo) 23/66
  • 24. Formula Interface: . (INTERCEPT) SEPAL.LENGTH SEPAL.WIDTH PETAL.LENGTH PETAL.WIDTH SPECIESVERSICOLOR SPECIESVIRGINICA 1 1 5.1 3.5 1.4 0.2 0 0 2 1 4.9 3.0 1.4 0.2 0 0 51 1 7.0 3.2 4.7 1.4 1 0 52 1 6.4 3.2 4.5 1.5 1 0 101 1 6.3 3.3 6.0 2.5 0 1 102 1 5.8 2.7 5.1 1.9 0 1 . means all columns of the data.· # ~ Sepal.Length + Sepal.Width + Petal.Length + # Petal.Width + Species model.matrix(~ ., iris.demo) 24/66
  • 27. Categorical Feature in R Dummy Coding A categorical variables of categories are transformed to a -dimentional vector. There are many coding systems and the most commonly used is Dummy Coding. · K K − 1 · The first category are transformed to .- 0⃗ ## versicolor virginica ## setosa 0 0 ## versicolor 1 0 ## virginica 0 1 27/66
  • 28. Categorical Feature in Machine Learning Predictive analysis Regularization The categorical variables of categories are transformed to -dimentional vector. · · · K K The missing data are transformed to .- 0⃗ contr.treatment(levels(iris.demo$Species), contrasts = FALSE) ## setosa versicolor virginica ## setosa 1 0 0 ## versicolor 0 1 0 ## virginica 0 0 1 28/66
  • 29. Motivation of the FeatureHashing 29/66
  • 30. Kaggle: Display Advertising Challenge Given a user and the page he is visiting, what is the probability that he will click on a given ad? · 30/66
  • 31. The Size of the Dataset 13 integer features and 26 categorical features instances Download the dataset via this link · · 7 × 107 · 31/66
  • 32. Vectorize These Features in R After binning the integer features, I got categories. Sparse matrix was required · 3 × 107 · A dense matrix required 14PB memory... A sparse matrix required 40GB memory... - - 32/66
  • 33. Estimating Computing Resources Dense matrix: nrow ( ) ncol ( 8 bytes Sparse matrix: nrow ( ) ( ) ( ) bytes · 7 × 107 × 3 × )107 × · 7 × 107 × 13 + 26 × 4 + 4 + 8 33/66
  • 34. Vectorize These Features in R sparse.model.matrix is similar to model.matrix but returns a sparse matrix. · 34/66
  • 35. Fit Model with Limited Memory A post Beat the benchmark with less then 200MB of memory describes how to fit a model with limited resources. · online logistic regression feature hash trick adaptive learning rate - - - 35/66
  • 36. Online Machine Learning In online learning, the model parameter is updated after the arrival of every new datapoint. In batch learning, the model paramter is updated after access to the entire dataset. · Only the aggregated information and the incoming datapoint are used. - · 36/66
  • 37. Memory Requirement of Online Learning Related to model parameters· Require little memory for large amount of instances.- 37/66
  • 38. Limitation of model.matrix The vectorization requires all categories.· contr.treatment(levels(iris$Species)) ## versicolor virginica ## setosa 0 0 ## versicolor 1 0 ## virginica 0 1 38/66
  • 39. My Work Around Scan all data to retrieve all categories Vectorize features in an online fashion The overhead of exploring features increases · · · 39/66
  • 40. Observations Mapping features to is one of method to vectorize feature. · {0, 1, 2, . . . , K} setosa => versicolor => virginica => - e1⃗ - e2⃗ - e3⃗ ## setosa versicolor virginica ## setosa 1 0 0 ## versicolor 0 1 0 ## virginica 0 0 1 40/66
  • 41. Observations contr.treatment ranks all categories to map the feature to integer. What if we do not know all categories? · · Digital features are integer. We use a function maps to - - ℤ {0, 1, 2, . . . , K} charToRaw("setosa") ## [1] 73 65 74 6f 73 61 41/66
  • 42. What is Feature Hashing? A method to do feature vectorization with a hash function · For example, Mod %% is a family of hash function.· 42/66
  • 43. Feature Hashing Choose a hash function and use it to hash all the categorical features. The hash function does not require global information. · · 43/66
  • 44. An Example of Feature Hashing of Criteo's Data V15 V16 V17 68fd1e64 80e26c9b fb936136 68fd1e64 f0cf0024 6f67f7e5 287e684f 0a519c5c 02cf9876 68fd1e64 2c16a946 a9a87e68 8cf07265 ae46a29d c81688bb 05db9164 6c9c9cf3 2730ec9c The categorical variables have been hashed onto 32 bits for anonymization purposes. Let us use the last 4 bits as the hash result, i.e. the hash function is function(x) x %% 16 The size of the vector is , which is called hash size · · · ( )24 44/66
  • 45. An Example of Feature Hashing of Criteo's Data V15 V16 V17 68fd1e64 80e26c9b fb936136 68fd1e64 f0cf0024 6f67f7e5 287e684f 0a519c5c 02cf9876 68fd1e64 2c16a946 a9a87e68 8cf07265 ae46a29d c81688bb 05db9164 6c9c9cf3 2730ec9c 68fd1e64, 80e26c9b, fb936136 => 4, b, 6 68fd1e64, f0cf0024, 6f67f7e5 => 4, 4, 5 · - (0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0) · - (0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) 45/66
  • 46. Hash Collision 68fd1e64, f0cf0024, 6f67f7e5 => 4, 4, 5 Hash function is a many to one mapping, so different features might be mapped to the same index. This is called collision. In the perspective of statistics, the collision in hash function makes the effect of these features confounding. · - (0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) · · 46/66
  • 47. Choosing Hash Function Less collision rate High throughput FeatureHashing uses the Murmurhash3 algorithm implemented by digest · Real features, such as text features, are not uniformly distributed in . - ℤ · · 47/66
  • 48. Pros of FeatureHashing A Good Companion of Online Algorithm library(FeatureHashing) hash_size <- 2^16 w <- numeric(hash_size) for(i in 1:1000) { data <- fread(paste0("criteo", i)) X <- hashed.model.matrix(V1 ~ ., data, hash.size = hash_size) y <- data$V1 update_w(w, X, y) } 48/66
  • 49. Pros of FeatureHashing A Good Companion of Distributed Algorithm library(pbdMPI) library(FeatureHashing) hash_size <- 2^16 w <- numeric(hash_size) i <- comm.rank() data <- fread(paste0("criteo", i)) X <- hashed.model.matrix(V1 ~ ., data, hash.size = hash_size) y <- data$V1 # ... 49/66
  • 50. Pros of FeatureHashing Simple Training and Testing library(FeatureHashing) model <- is_click ~ ad * (url + ip) m_train <- hashed.model.matrix(model, data_train, hash_size) m_test <- hashed.model.matrix(model, data_test, hash_size) 50/66
  • 52. Cons of FeatureHashing Lose Interpretation Collision makes the interpretation harder. It is inconvenient to reverse the indices to feature. · · m <- hashed.model.matrix(~ Species, iris, hash.size = 2^4, create.mapping = TRUE) hash.mapping(m) %% 2^4 ## Speciessetosa Speciesvirginica Speciesversicolor ## 7 13 8 52/66
  • 53. The Result of the Competition... No.1: Field-aware Factorization Machine No.3, no.4 and no.9 (we): Neuron Network · The hashing trick do not contribute any improvement on the leaderboard. They apply the hashing trick only because it makes their life easier to generate features.. - · 53/66
  • 55. Formula is the most R style interface· 55/66
  • 56. Tips terms.formula and its argument specials· tf <- terms.formula(~ Plant * Type + conc * split(Treatment), specials = "split", data = CO2) attr(tf, "factors") ## Plant Type conc split(Treatment) Plant:Type ## Plant 1 0 0 0 1 ## Type 0 1 0 0 1 ## conc 0 0 1 0 0 ## split(Treatment) 0 0 0 1 0 ## conc:split(Treatment) ## Plant 0 ## Type 0 ## conc 1 ## split(Treatment) 1 56/66
  • 57. Specials attr(tf, "specials") tells which rows of attr(tf, "factors") need to be parsed further · rownames(attr(tf, "factors")) ## [1] "Plant" "Type" "conc" ## [4] "split(Treatment)" attr(tf, "specials") ## $split ## [1] 4 57/66
  • 58. Parse parse extracts the information from the specials· options(keep.source = TRUE) p <- parse(text = rownames(attr(tf, "factors"))[4]) getParseData(p) ## line1 col1 line2 col2 id parent token terminal text ## 9 1 1 1 16 9 0 expr FALSE ## 1 1 1 1 5 1 3 SYMBOL_FUNCTION_CALL TRUE split ## 3 1 1 1 5 3 9 expr FALSE ## 2 1 6 1 6 2 9 '(' TRUE ( ## 4 1 7 1 15 4 6 SYMBOL TRUE Treatment ## 6 1 7 1 15 6 9 expr FALSE ## 5 1 16 1 16 5 9 ')' TRUE ) 58/66
  • 59. Efficiency Rcpp Invoking external C functions · The core functions are implemented in C++- · digest exposes the C function: FeatureHashing imports the C function: - Register the c function Add the helper header file - - - Import the c function- 59/66
  • 60. Let's Discuss the Implementation Later source: http://www.commpraxis.com/wp-content/uploads/2015/01/commpraxis- audience-confusion.png 60/66
  • 62. FeatureHashing + jiebaR library(FeatureHashing) df <- data.frame(title = c( " 33 ", " 4 ", " …// " )) init_jiebaR_callback() m <- hashed.model.matrix(~ jiebaR(title), df, create.mapping = TRUE) Require jiebaR-devel and FeatureHashing-devel· devtools::install_github('qinwf/jiebaR') devtools::install_github('wush978/FeatureHashing', ref = 'dev/0.10') - - 62/66
  • 63. Customized Formula Special Function Defines the specail function in R or C++(through Rcpp) Register the function to the FeatureHashing Call these function before FeatureHashing via Formula Interface Might be useful for n-gram · · · · 63/66
  • 64. Summary When should I use FeatureHashing? Short Answer: Predictive Anlysis with a large amount of categories. Pros of FeatureHashing Cons of FeatureHashing · Make it easier to do feature vectorization for predictive analysis. Make it easier to tokenize the text data. - - · Decrease the prediction accuracy if the hash size is too small Interpretation becomes harder. - - 64/66