Hadoop and beyond: power tools for data mining

Hadoop and beyond:
power tools for data mining
Mark Levy, 13 March 2013
Cloud Computing Module
Birkbeck/UCL

Hadoop and beyond
Outline:
• the data I work with
• Hadoop without Java
• Map-Reduce unfriendly algorithms
• Hadoop without Map-Reduce
• alternatives in the cloud
• alternatives on your laptop

NB
• all software mentioned is Open Source
• won't cover key-value stores
• I don't use all of these tools

Last.fm datasets
Core datasets:
• 45M users, many active
• 60M artists
• 100M audio ﬁngerprints
• 600M tracks (hmm...)
• 19M physical recordings
• 3M distinct tags
• 2.5M <user,item,tag> taggings per month
• 1B <user,time,track> scrobbles per month
• full user-track graph has ~50B edges
(more often work with ~500M edges)

Problem Scenario 1
Need Hadoop, don't want Java:
• need to build prototypes, fast
• need to do interactive data analysis
• want terse, highly readable code
• improve maintainability
• improve correctness

Hadoop without Java
Some options:
• Hive (Yahoo!)
• Pig (Yahoo!)
• Cascading (ok it's still Java...)
• Scalding (Twitter)
• Hadoop streaming (various)
not to mention 11 more listed here:
http://blog.matthewrathbone.com/2013/01/05/a-quick-guide-to-hadoop-map-reduce-frameworks.html

Apache Hive
SQL access to data on Hadoop
pros:
• minimal learning curve
• interactive shell
• easy to check correctness of code
cons:
• can be ineﬃcient
• hard to ﬁx when it is

Word count in Hive
CREATE TABLE input (line STRING);
LOAD DATA LOCAL INPATH '/input' OVERWRITE INTO TABLE input;
SELECT word, COUNT(*) FROM input
LATERAL VIEW explode(split(text, ' ')) wTable as word
GROUP BY word;

[but would you use SQL to count words?]

Apache Pig
High level scripting language for Hadoop
pros:
• more primitive operations than Hive (and UDFs)
• more ﬂexible than Hive
• interactive shell
cons:
• harder learning curve than Hive
• tempting to write longer programs but no code
modularity beyond functions

Word count in Pig
A = load '/input';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches 'w+';
D = group C by word;
E = foreach D generate COUNT(C), group;
store E into '/output/wordcount';

[ apply operations to "relations" (tuples) ]

Cascading
Java data pipelining for Hadoop
pros:
• as ﬂexible as Pig
• uses a real programming langauge
• ideal for longer workﬂows
cons:
• new concepts to learn ("spout","sink","tap",...)
• still verbose (full wordcount ex. code > 150 lines)

Word count in Cascading
Scheme sourceScheme = new TextLine(new Fields("line"));
Tap source = new Hfs(sourceScheme, "/input");

Scheme sinkScheme = new TextLine(new Fields("word", "count"));
Tap sink = new Hfs(sinkScheme, "/output/wordcount", SinkMode.REPLACE);

Pipe assembly = new Pipe("wordcount");
String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)";
Function function = new RegexGenerator(new Fields("word"), regex);
assembly = new Each(assembly, new Fields("line"), function);
assembly = new GroupBy(assembly, new Fields("word"));
Aggregator count = new Count(new Fields("count"));
assembly = new Every(assembly, count);

Properties properties = new Properties();
FlowConnector.setApplicationJarClass(properties, Main.class);
FlowConnector flowConnector = new FlowConnector(properties);
Flow flow = flowConnector.connect("word-count", source, sink, assembly);
flow.complete();

Scalding
Scala data pipelining for Hadoop
pros:
• as ﬂexible as Pig
• uses a real programming language
• much terser than Java
cons:
• community still small (but in use at Twitter)
• ???

Word count in Scalding
import com.twitter.scalding._

class WordCountJob(args : Args) extends Job(args) {
TextLine(args("input"))
.flatMap('line -> 'word){ line: String => line.split("""s+""") }
.groupBy('word){ _.size }
.write(Tsv(args("output")))
}

[and a one-liner to run it]

Hadoop streaming
Map-reduce in any language
e.g. Dumbo wrapper for Python
pros:
• use your favourite language for map-reduce
• easy to mix local and cloud processing
cons:
• limited community
• limited functionality beyond map-reduce

Word count in Dumbo
def map(key,text):
for word in text.split():
yield word,1 # ignore key

def reduce(word,counts):
yield word,sum(counts)

import dumbo
dumbo.run(map,reduce,combiner=reduce)

[and a one-liner to run it]

Problem Scenario 1b
Need Hadoop, don't want Java:
• drive native code in parallel
E.g. audio analysis for:
• beat locations, bpm
• key estimation
• chord sequence estimation
• energy
• music/speech?
• ...

Audio Analysis
Problem:
• millions of audio tracks on own dfs
• long-running C++ analysis code
• depends on numerous libraries
• verbose output

Audio Analysis
Solution:
• bash + Dumbo Hadoop streaming
Outline:
• build C++ code
• zip up binary and libs
• send zipﬁle and some track IDs to each machine
• extract and run binary in map task with
subprocess.Popen()

Audio Analysis
class AnalysisMapper:
init():
extract(analyzer.tar.bz2,”bin”)
map(key,trackID):
file = fetch_audio_file(trackID)
proc = subprocess.Popen(
[“bin/analyzer”,file],
stdout = subprocess.PIPE)
(out,err) = proc.communicate()
yield trackID,out

Problem Scenario 2
Map-reduce unfriendly computation:
• iterative algorithms on same data
• huge mapper output ("map-increase")
• curse of slowest reducer

Graph Recommendations
Random walk on user-item graph

 4 
4

4 
 4  4   4 
4 4 4 
 t
4
 4 
4
4 
 4
4  4 
 4 U
 4 4 
 4
4 
 4
4
 4 


Many short routes from U to t ⇒ recommend!

 4 
4

4 
 4  4   4 
4 4 4 
 t
4
 4 
4
4 
 4
4  4 
 4 U
 4 4 
 4
4 
 4
4
 4 


random walk is equivalent to
• Label Propagation (Baluja et al., 2008)
• belongs to family of algorithms that
are easy to code in map-reduce

Label Propagation
User-track graph, edge weights = scrobbles:

2 4a 
 4
4
 4
U
4 b
4 
1
1 4
c
 4 
V
2
3  4
d
4 
5
W
3 4
 4
e 
3
4
 f4 
4
X

Label Propagation
User nodes are labelled with scrobbled tracks:

2 4
 4
a 
(a,0.2)
(b,0.4) 4
(c,0.4)
 4
U
4 b
4 
1
(b,0.5)
(d,0.5) 1 c4
 4 
V
2
(b,0.2) 3  4
d
4 
(d,0.3) 5
(e,0.5) W
3  4
e4 
3
(a,0.3)
(d,0.3) 4
(e,0.4)  f4 
4
X

Label Propagation
Propagate, accumulate, normalise:
2 4
 4
a 
(a,0.2)
(b,0.4) 4
(c,0.4)
 4
U
4 b
4 
1
(b,0.5)
(d,0.5) 1 c4
 4 
V
2 1 x (b,0.5),(d,0.5)
(b,0.2) 3  4
d
4  x (b,0.2),(d,0.3),(e,0.5)
3
(d,0.3) 5 Þ(b,0.37),d(0.47),(e,0.17)
(e,0.5) W
3  4
e4 
3 next iteration e will
(a,0.3) propagate to user V
(d,0.3) 4
(e,0.4)  f4 
4
X

Label Propagation
After some iterations:
• labels at item nodes = similar items
• new labels at user nodes = recommendations

Map-Reduce Graph
Algorithms
general approach assuming:
• no global state
• state at node recomputed from scratch
from incoming messages on each iteration

other examples:
• breadth-ﬁrst search
• page rank

Map-Reduce Graph
Algorithms
inputs:
• adjacency lists, state at each node
output:
• updated state at each node

2  4
a4 
4
U 4  U,[(a,2),(b,4),(c,4)]
 b4
4

 c4 
4
adjacency list for node U

Label Propagation
class PropagatingMapper:
map(nodeID,value):
# value holds label-weight pairs
# and adjacency list for node
labels,adj_list = value
for node,weight in adj_list:
# send a “stripe” of label-weight
# pairs to each neighbouring node
msg = [(label,prob*weight) for
label,prob in labels]
yield node,msg

Label Propagation
class Reducer:
reduce(nodeID,msgs):
# accumulate
labels = defaultdict(lambda:0)
for msg in msgs:
for label,w in msg:
labels[label] += w
# normalise, prune
normalise(labels,MAX_LABELS_PER_NODE)
yield nodeID,labels

Label Propagation
Not map-reduce friendly:
• send graph over network on every iteration
• huge mapper output:
• mappers soon send MAX_LABELS_PER_NODE
updates along every edge
• some reducers receive huge input:
• too slow if reducer streams the data,
OOM otherwise
• NB can't partition real graphs to avoid this
• many natural graphs are scale-free e.g.
AltaVista web graph top 1% of nodes adjacent
to 53% of edges

Problem Scenario 2b
Map-reduce unfriendly computation:
• shared memory

Examples:
• almost all machine learning:
• split training examples between machines
• all machines need to read/write many shared
parameter values

Hadoop without map-reduce
Graph processing
• Apache Giraph (Facebook)

Hadoop YARN
• Knitting Boar, Iterative Reduce
http://www.cloudera.com/content/cloudera/en/resources/library/hadoo
pworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html

• ???

Alternatives in the cloud
Graph Processing:
• GraphLab (CMU)
Task-speciﬁc:
• Yahoo! LDA
General:
• HPCC
• Spark (Berkeley)

Spark and Shark
In-memory cluster computing
pros:
• fast!! (Shark is 100x faster than Hive)
• code in Scala or Java or Python
• can run on Hadoop YARN or Apache Mesos
• ideal for iterative algorithms, nearline analytics
• includes a Pregel clone & stream processing

cons:
• hardware requirements???

GraphLab
Distributed graph processing
pros:
• vertex-centric programming model
• handles true web-scale graphs
• many toolkits already:
• collaborative ﬁltering, topic modelling, graphical models,
machine vision, graph analysis

cons:
• new applications require non-trivial C++ coding

Word count in Spark
val file = spark.textFile(“hdfs://input”)
val counts = file.flatMap(line => line.split(”
“))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(“hdfs://output/wordcount”)

Logistic regression in Spark
val points = spark.textFile(…).map(parsePoint).cache()
var w = Vector.random(D) // current separating plane
for (i <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) – 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println(“Final separating plane: “ + w)

[ points remain in memory for all iterations ]

Alternatives on your laptop
Graph processing
• GraphChi (CMU)
Machine learning
• Sophia-ML (Google)
• vowpal wabbit (Yahoo!, Microsoft)

GraphChi
Graph processing on your laptop
pros:
• still handles graphs with billions of edges
• graph structure can be modiﬁed at runtime
• Java/Scala ports under active development
• some toolkits available:
• collaborative ﬁltering, graph analysis

cons:
• existing C++ toolkit code is hard to extend

vowpal wabbit
classiﬁcation, regression, LDA, bandits, ...
pros:
• handles huge ("terafeature") training datasets
• very fast
• state of the art algorithms
• can run in distributed mode on Hadoop streaming
cons:
• hard-core documentation

Take homes
Think before you use Hadoop
• use your laptop for most problems
• use a graph framework for graph data

Keep your Hadoop code simple
• if you're just querying data use Hive
• if not use a workﬂow framework

Check out the competition
• Spark and HPCC look impressive

Thanks for listening!
Goodbye Hello

gamboviol@gmail.com
@gamboviol

Hadoop and beyond: power tools for data mining

Recommended

Recommended

More Related Content

Featured

Featured (20)

Hadoop and beyond: power tools for data mining