Introduction	to	Hivemall
and	it’s	new	features	in	v0.4
Research	Engineer
Makoto	YUI	@myui
2015/10/20	Hivemall	meetup	#2 1
Tweet	w/	#hivemallmtup
Ø 2015.04	Joined	Treasure	Data,	Inc.
1st Research	Engineer	in	Treasure	Data
My	mission	in	TD	is	developing	ML-as-a-Service
Ø 2010.04-2015.03	Senior	Researcher	at	National	
Institute	of	Advanced	Industrial	Science	and	
Technology,	Japan.	
Worked	on	a	large-scale	Machine	Learning	project	
and	Parallel	Databases	
Ø 2009.03	Ph.D.	in	Computer	Science	from	NAIST
Ø Super	programmer	award	from	the	MITOU	
Who	am		I	?
2015/10/20	Hivemall	meetup	#2 2
1. What	is	Hivemall
2. How	to	use	Hivemall
3. New	Features	in	Hivemall	v0.4
1. Random	Forest
2. Factorization	Machine
4. Development	Roadmap	of	Hivemall
2015/10/20	Hivemall	meetup	#2 3
What	is	Hivemall
Scalable	machine	learning	library	built	as	a	collection	of	
Hive	UDFs,	licensed	under	the	Apache	License	v2
2015/10/20	Hivemall	meetup	#2 4
What	is	Hivemall
Hadoop	HDFS
(MR v1)
Hive /	PIG
Apache	YARN
Apache	Tez	
DAG	processing
MR	v2
Machine	Learning
Query	Processing
Parallel	Data	
Processing	Framework
Resource	Management
Distributed	File	System
2015/10/20	Hivemall	meetup	#2 5
Scalable	machine	learning	library	built	as	a	collection	of	
Hive	UDFs,	licensed	under	the	Apache	License	v2
Hivemall’s Vision:	ML	on	SQL
Classification	with	Mahout
feature,	-- reducers	perform	model	averaging	in	
avg(weight)	as	weight
SELECT	logress(features,label,..)	as	(feature,weight)
FROM	train
)	t	-- map-only	task
GROUP	BY	feature;	-- shuffled	to	reducers
✓Machine	Learning	made	easy	for	SQL	
developers	(ML	for	the	rest	of	us)
✓Interactive	and	Stable	APIs	w/ SQL	abstraction
This	SQL	query	automatically	runs	in	
parallel	on	Hadoop	
2015/10/20	Hivemall	meetup	#2 6
List	of	Features	in	Hivemall	v0.3.2
Classification	(both	
binary- and	multi-class)
✓ Perceptron
✓ Passive	Aggressive	(PA)
✓ Confidence	Weighted	(CW)
✓ Adaptive	Regularization	of	
Weight	Vectors	(AROW)
✓ Soft	Confidence	Weighted	
✓ AdaGrad+RDA
✓Logistic	Regression	(SGD)
✓PA	Regression
✓AROW	Regression
kNN and	Recommendation
✓ Minhash and	b-Bit	Minhash
(LSH	variant)
✓ Similarity	 Search	using	K-NN
✓ Matrix	Factorization
Feature	engineering
✓ Feature	Hashing
✓ Feature	Scaling
(normalization,	 z-score)	
✓ TF-IDF	vectorizer
✓ Polynomial	Expansion
Anomaly	Detection
✓ Local	Outlier	Factor
Treasure	Data	supports	Hivemall	v0.3.2-3
2015/10/20	Hivemall	meetup	#2 7
Ø CTR	prediction	of	Ad	click	logs
• Algorithm:	Logistic	regression
• Freakout Inc.	and	more
Ø Gender	prediction	of	Ad	click	logs
• Algorithm:	Classification
• Scaleout Inc.
Ø Churn	Detection
• Algorithm:	Regression
• OISIX	and	more
Ø Item/User	recommendation
• Algorithm:	Recommendation	(Matrix	Factorization	/	kNN)	
• Adtech Companies,	ISP	portal,	and	more
Ø Value	prediction	of	Real	estates
• Algorithm:		Regression
• Livesense
Industry	use	cases	of	Hivemall
82015/10/20	Hivemall	meetup	#2
How	to	use	Hivemall
Feature	Vector
Feature	Vector
Data	preparation 2015/10/20	Hivemall	meetup	#2 9
CREATE EXTERNAL TABLE e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How	to	use	Hivemall	- Data	preparation
Define	a	Hive	table	for	training/testing	data
2015/10/20	Hivemall	meetup	#2 10
How	to	use	Hivemall
Feature	Vector
Feature	Vector
Feature	Engineering
2015/10/20	Hivemall	meetup	#2 11
create view e2006tfidf_train_scaled
as label,
Applying	a	Min-Max	Feature	Normalization
How	to	use	Hivemall	- Feature	Engineering
Transforming	a	label	value	
to	a	value	between	0.0	and	1.0
2015/10/20	Hivemall	meetup	#2 12
How	to	use	Hivemall
Feature	Vector
Feature	Vector
2015/10/20	Hivemall	meetup	#2 13
How	to	use	Hivemall	- Training
avg(weight) as weight
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training	by	logistic	regression
map-only	task	to	learn	a	prediction	model
Shuffle	map-outputs	to	reduces	by	feature
Reducers	perform	model	averaging	
in	parallel
2015/10/20	Hivemall	meetup	#2 14
How	to	use	Hivemall	- Training
CREATE TABLE news20b_cw_model1 AS
voted_avg(weight) as weight
as (feature,weight)
) t
GROUP BY feature
Training	of	Confidence	Weighted	Classifier
Vote	to	use	negative	or	positive	
weights	for	avg
+0.7,	+0.3,	+0.2,	-0.1,	+0.7
Training	for	the	CW	classifier
2015/10/20	Hivemall	meetup	#2 15
create table news20mc_ensemble_model1as
cast(feature as int) as feature,
cast(voted_avg(weight)as float) as weight
as (label,feature,weight)
union all
as (label,feature,weight)
union all
as (label,feature,weight)
) t
group by label,feature;
Ensemble	learning	for	stable	prediction	performance
Just	stack	prediction	models	
by	union	all
26 / 43
162015/10/20	Hivemall	meetup	#2
How	to	use	Hivemall
Feature	Vector
Feature	Vector
2015/10/20	Hivemall	meetup	#2 17
How	to	use	Hivemall	- Prediction
CREATE	TABLE	lr_predict
sigmoid(sum(m.weight))	 as	prob
testing_exploded t	LEFT	OUTER	JOIN
lr_model m	ON	(t.feature =	m.feature)
Prediction	is	done	by	LEFT	OUTER	JOIN
between	test	data	and	prediction	model
No	need	to	load	the	entire	model	into	memory
2015/10/20	Hivemall	meetup	#2 18
How	to	use	Hivemall
Batch Training on Hadoop
Online Prediction on RDBMS
Feature	Vector
Feature	Vector
prediction	model
2015/10/20	Hivemall	meetup	#2 19
2015/10/20	Hivemall	meetup	#2 20
Online	Prediction	on	MySQL	(RDBMS)
Quick	(msec)	response	on	a	RDBMS
by	adding	an	index	to	feature	column
1. What	is	Hivemall
2. How	to	use	Hivemall
3. New	Features	in	Hivemall	v0.4
1. Random	Forest
2. Factorization	Machine
4. Development	Roadmap	of	Hivemall
2015/10/20	Hivemall	meetup	#2 21
Features to	be	supported	in	Hivemall	v0.4
2015/10/20	Hivemall	meetup	#2 22
• classification,	regression
• Based	on	Smile
2.Factorization	Machine
• classification,	regression	(factorization)
Planned	to	release	v0.4	in	Oct.
Factorization	Machine	are	often	used	by	data	science	
competition	winners	(Criteo/Avazu CTR	prediction)
2015/10/20	Hivemall	meetup	#2 23
RandomForest	in	Hivemall	v0.4
Ensemble	of	Decision	Trees
Already	available	on	a	development	(smile)	branch
and	it’s	usage	is	explained	in	the	project	wiki
2015/10/20	Hivemall	meetup	#2 24
Training	of	RandomForest
Out-of-bag	tests	and	Variable	Importance	
2015/10/20	Hivemall	meetup	#2 25
2015/10/20	Hivemall	meetup	#2 26
Prediction	of	RandomForest
2015/10/20	Hivemall	meetup	#2 27
2015/10/20	Hivemall	meetup	#2 28
Factorization	Machine
Matrix	Factorization
2015/10/20	Hivemall	meetup	#2 29
Factorization	Machine
Context	information	(e.g.,	time)	
can	be	considered
2015/10/20	Hivemall	meetup	#2 30
Factorization	Machine
Factorization	Model	with	degress=2	(2-way	interaction)
Global Bias
Regression coefficience
of j-th variable
Pairwise Interaction
2015/10/20	Hivemall	meetup	#2 31
Factorization	Machine
Factorization	Machine
≈ Polynomial	Regression	+	Factorization
For	a	feature	[a,	b],	the	degree-2	polynomial	features	are	[1,	a,	b,	a^2,	ab,	b^2].
2015/10/20	Hivemall	meetup	#2 32
Factorization	Machine	
1. What	is	Hivemall
2. How	to	use	Hivemall
3. New	Features	in	Hivemall	v0.4
1. Random	Forest
2. Factorization	Machine
4. Development	Roadmap	of	Hivemall
2015/10/20	Hivemall	meetup	#2 33
Features to	be	supported	in	Hivemall	v0.4.1
2015/10/20	Hivemall	meetup	#2 34
1.Gradient	Tree	Boosting
• classifier,	regression
2.Field-aware	Factorization	Machine
• classification,	regression	(factorization)
• Existing	implementation,	i.e.,	LibFFM,	only	can	be	
applied	for	classification	
Planned	to	release	v0.4.1	in	Nov/Dec.
2015/10/20	Hivemall	meetup	#2 35
Gradient	Tree	Boosting	(or	Gradient	Boosting	Trees)	
RF	≈	Bagging	+	Decision	Trees	
parallel execution of	decision trees
GBT	≈	Boosting	+	Decision	Trees
Sequential execution of	decision trees
2015/10/20	Hivemall	meetup	#2 36
Gradient	Tree	Boosting
Features to	be	supported	in	Hivemall	v0.4.2
2015/10/20	Hivemall	meetup	#2 37
1. Online	LDA
• topic	modeling,	clustering
2. Mix	server	on	Apache	YARN
• Service	for	parameter	sharing	among	workers
• working	w/	@maropu
Planned	to	release	v0.4.2	in	Dec/Jan.
External	service	to	share	parameters	by	distributed	
training	processes	in	the	middle	of	training
2015/10/20	Hivemall	meetup	#2 38
What’s	Mix	Server?
Model	updates
Async add
Piggy	back	if	…
AVG/Argmin KLD	accumulator
hash(feature)	%	N
Non-blocking	Channel
(single	shared	TCP	connection	w/	TCP	keepalive)
Mix	serv.Mix	serv.
is	not	being	blocked
Taking	benefits	of	asynchronous	non-blocking	I/O	
is	the	core	idea	behind	Hivemall’s MIX	protocol
2015/10/20	Hivemall	meetup	#2 39
create	table	kdd10a_pa1_model1	as
cast(voted_avg(weight)	as	float)	as	weight
train_pa1(addBias(features),label,"-mix	host01,host02,host03")	
as	(feature,weight)
)	t	
group	by	feature;
How	to	use	Mix	Server
Conclusion	and	Takeaway
New	features	in	v0.4
2015/10/20	Hivemall	meetup	#2 40
• Random	Forest
• Factorization	Machine
More	will	follow	in	v0.4.1
Next	Actions
• Propose	Hivemall	to
Apache	Incubator
• New	Hivemall	Logo
Hivemall	provides	a	collection	of	machine	
learning	algorithms	as	Hive	UDFs/UDTFs
The	latest	version	of	Hivemall	is	available	on
Treasure	Data	and	used	by	several	companies	
Including	OISIX,	Livesense,	Scaleout,	and	Freakout.
2015/10/20	Hivemall	meetup	#2 41
Beyond	Query-as-a-Service!
We							Open-source!	We	invented	..
We	are	hiring	machine	learning	engineer!
2015/10/20	Hivemall	meetup	#2 42
Additional	slides
Rating	prediction	of	a	Matrix	
Can	be	applied	for	user/Item	Recommendation
432015/10/20	Hivemall	meetup	#2
Matrix	Factorization
Factorize	a	matrix	
into	a	product	of	matrices
having	k-latent	factor
2015/10/20	Hivemall	meetup	#2
Mean	Rating
Matrix	Factorization
for	each	user/item
Criteria	of	Biased	MF
2015/10/20	Hivemall	meetup	#2
Training	of	Matrix	Factorization
Support iterative training using local disk cache
2015/10/20	Hivemall	meetup	#2
Prediction	of	Matrix	Factorization
2015/10/20	Hivemall	meetup	#2
ØAlgorithm	is	different
Spark:	ALS-WR	
(considers	regularization)
Hivemall:	Biased-MF	
(considers	regularization	and	biases)
Spark:	100+	line	Scala	coding
Hivemall:	SQL	(would	be	more	easy	to	use)
ØPrediction	Accuracy
Almost	same	for	MovieLens 10M	datasets
2015/10/20	Hivemall	meetup	#2 48
Comparison	to	Spark	MLlib
rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.
Unsupervised	Learning:	Anomaly	Detection
Sensor	data	etc.
Anomaly	detection	runs	on	a	series	of	SQL	queries
492015/10/20	Hivemall	meetup	#2
2015/10/20	Hivemall	meetup	#2 50
Anomalies	in	a	Sensor	Data
Image	Source:
2015/10/20	Hivemall	meetup	#2 51
Local	Outlier	Factor	(LoF)
Basic	idea	of	LOF:	comparing	the	local	density	of	a	
point	with	the	densities of	its	neighbors
2015/10/20	Hivemall	meetup	#2 52
DEMO:	Local	Outlier	Factor
rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.

2nd Hivemall meetup 20151020