6. Hivemall’s Vision: ML on SQL
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in
parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
✓Machine Learning made easy for SQL
developers (ML for the rest of us)
✓Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in
parallel on Hadoop
2015/10/20 Hivemall meetup #2 6
7. List of Features in Hivemall v0.3.2
Classification (both
binary- and multi-class)
✓ Perceptron
✓ Passive Aggressive (PA)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
Regression
✓Logistic Regression (SGD)
✓PA Regression
✓AROW Regression
✓AdaGrad
✓AdaDELTA
kNN and Recommendation
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search using K-NN
(Euclid/Cosine/Jaccard/Angular)
✓ Matrix Factorization
Feature engineering
✓ Feature Hashing
✓ Feature Scaling
(normalization, z-score)
✓ TF-IDF vectorizer
✓ Polynomial Expansion
Anomaly Detection
✓ Local Outlier Factor
Treasure Data supports Hivemall v0.3.2-3
2015/10/20 Hivemall meetup #2 7
8. Ø CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc. and more
Ø Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
Ø Churn Detection
• Algorithm: Regression
• OISIX and more
Ø Item/User recommendation
• Algorithm: Recommendation (Matrix Factorization / kNN)
• Adtech Companies, ISP portal, and more
Ø Value prediction of Real estates
• Algorithm: Regression
• Livesense
Industry use cases of Hivemall
82015/10/20 Hivemall meetup #2
10. CREATE EXTERNAL TABLE e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data
2015/10/20 Hivemall meetup #2 10
14. How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel
2015/10/20 Hivemall meetup #2 14
15. How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier
2015/10/20 Hivemall meetup #2 15
16. create table news20mc_ensemble_model1as
select
label,
cast(feature as int) as feature,
cast(voted_avg(weight)as float) as weight
from
(select
train_multiclass_cw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_arow(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_scw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
) t
group by label,feature;
Ensemble learning for stable prediction performance
Just stack prediction models
by union all
26 / 43
162015/10/20 Hivemall meetup #2
22. Features to be supported in Hivemall v0.4
2015/10/20 Hivemall meetup #2 22
1.RandomForest
• classification, regression
• Based on Smile github.com/haifengl/smile
2.Factorization Machine
• classification, regression (factorization)
Planned to release v0.4 in Oct.
Factorization Machine are often used by data science
competition winners (Criteo/Avazu CTR prediction)
40. Conclusion and Takeaway
New features in v0.4
2015/10/20 Hivemall meetup #2 40
• Random Forest
• Factorization Machine
More will follow in v0.4.1
Next Actions
• Propose Hivemall to
Apache Incubator
• New Hivemall Logo
Hivemall provides a collection of machine
learning algorithms as Hive UDFs/UDTFs
The latest version of Hivemall is available on
Treasure Data and used by several companies
Including OISIX, Livesense, Scaleout, and Freakout.
49. rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.
0"]
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.
13255163"]
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.
052084323"]
Unsupervised Learning: Anomaly Detection
Sensor data etc.
Anomaly detection runs on a series of SQL queries
492015/10/20 Hivemall meetup #2