SlideShare a Scribd company logo
1 of 52
Download to read offline
Hadoop and beyond:
power tools for data mining
    Mark Levy, 13 March 2013
    Cloud Computing Module
          Birkbeck/UCL
Hadoop and beyond
Outline:
 • the data I work with
 • Hadoop without Java
 • Map-Reduce unfriendly algorithms
 • Hadoop without Map-Reduce
 • alternatives in the cloud
 • alternatives on your laptop
NB
• all software mentioned is Open Source
• won't cover key-value stores
• I don't use all of these tools
Last.fm: scrobbling
Last.fm: scrobbling
Last.fm: tagging
Last.fm: personalised radio
Last.fm: recommendations
Last.fm: recommendations
Last.fm datasets
Core datasets:
 • 45M users, many active
 • 60M artists
 • 100M audio fingerprints
 • 600M tracks (hmm...)
 • 19M physical recordings
 • 3M distinct tags
 •  2.5M <user,item,tag> taggings per month
 •  1B <user,time,track> scrobbles per month
 • full user-track graph has ~50B edges
    (more often work with ~500M edges)
Problem Scenario 1
Need Hadoop, don't want Java:
  • need to build prototypes, fast
  • need to do interactive data analysis
  • want terse, highly readable code
   • improve maintainability
   • improve correctness
Hadoop without Java
 Some options:
  • Hive (Yahoo!)
  • Pig (Yahoo!)
  • Cascading (ok it's still Java...)
  • Scalding (Twitter)
  • Hadoop streaming (various)
not to mention 11 more listed here:
http://blog.matthewrathbone.com/2013/01/05/a-quick-guide-to-hadoop-map-reduce-frameworks.html
Apache Hive
SQL access to data on Hadoop
pros:
 • minimal learning curve
 • interactive shell
 • easy to check correctness of code
cons:
 • can be inefficient
 • hard to fix when it is
Word count in Hive
CREATE TABLE input (line STRING);
LOAD DATA LOCAL INPATH '/input' OVERWRITE INTO TABLE input;
SELECT word, COUNT(*) FROM input
LATERAL VIEW explode(split(text, ' ')) wTable as word
GROUP BY word;


[but would you use SQL to count words?]
Apache Pig
High level scripting language for Hadoop
pros:
 • more primitive operations than Hive (and UDFs)
 • more flexible than Hive
 • interactive shell
cons:
 • harder learning curve than Hive
 • tempting to write longer programs but no code
  modularity beyond functions
Word count in Pig
A = load '/input';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches 'w+';
D = group C by word;
E = foreach D generate COUNT(C), group;
store E into '/output/wordcount';


[ apply operations to "relations" (tuples) ]
Cascading
Java data pipelining for Hadoop
pros:
 • as flexible as Pig
 • uses a real programming langauge
 • ideal for longer workflows
cons:
 • new concepts to learn ("spout","sink","tap",...)
 • still verbose (full wordcount ex. code > 150 lines)
Word count in Cascading
Scheme sourceScheme = new TextLine(new Fields("line"));
Tap source = new Hfs(sourceScheme, "/input");

Scheme sinkScheme = new TextLine(new Fields("word", "count"));
Tap sink = new Hfs(sinkScheme, "/output/wordcount", SinkMode.REPLACE);

Pipe assembly = new Pipe("wordcount");
String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)";
Function function = new RegexGenerator(new Fields("word"), regex);
assembly = new Each(assembly, new Fields("line"), function);
assembly = new GroupBy(assembly, new Fields("word"));
Aggregator count = new Count(new Fields("count"));
assembly = new Every(assembly, count);

Properties properties = new Properties();
FlowConnector.setApplicationJarClass(properties, Main.class);
FlowConnector flowConnector = new FlowConnector(properties);
Flow flow = flowConnector.connect("word-count", source, sink, assembly);
flow.complete();
Scalding
Scala data pipelining for Hadoop
pros:
 • as flexible as Pig
 • uses a real programming language
 • much terser than Java
cons:
 • community still small (but in use at Twitter)
 • ???
Word count in Scalding
import com.twitter.scalding._

class WordCountJob(args : Args) extends Job(args) {
  TextLine(args("input"))
   .flatMap('line -> 'word){ line: String => line.split("""s+""") }
   .groupBy('word){ _.size }
   .write(Tsv(args("output")))
}



[and a one-liner to run it]
Hadoop streaming
Map-reduce in any language
e.g. Dumbo wrapper for Python
pros:
 • use your favourite language for map-reduce
 • easy to mix local and cloud processing
cons:
 • limited community
 • limited functionality beyond map-reduce
Word count in Dumbo
def map(key,text):
     for word in text.split():
          yield word,1 # ignore key

def reduce(word,counts):
     yield word,sum(counts)

import dumbo
dumbo.run(map,reduce,combiner=reduce)


[and a one-liner to run it]
Problem Scenario 1b
Need Hadoop, don't want Java:
  • drive native code in parallel
E.g. audio analysis for:
  • beat locations, bpm
  • key estimation
  • chord sequence estimation
  • energy
  • music/speech?
  • ...
Audio Analysis
Problem:
 • millions of audio tracks on own dfs
 • long-running C++ analysis code
 • depends on numerous libraries
 • verbose output
Audio Analysis
Solution:
 • bash + Dumbo Hadoop streaming
Outline:
 • build C++ code
 • zip up binary and libs
 • send zipfile and some track IDs to each machine
 • extract and run binary in map task with
   subprocess.Popen()
Audio Analysis
class AnalysisMapper:
  init():
    extract(analyzer.tar.bz2,”bin”)
  map(key,trackID):
    file = fetch_audio_file(trackID)
    proc = subprocess.Popen(
            [“bin/analyzer”,file],
            stdout = subprocess.PIPE)
    (out,err) = proc.communicate()
    yield trackID,out
Problem Scenario 2
Map-reduce unfriendly computation:
  • iterative algorithms on same data
  • huge mapper output ("map-increase")
  • curse of slowest reducer
Graph Recommendations
Random walk on user-item graph

                     4 
                      4


                    4 
                   4        4   4 
                              4      4              4 
                                                   t
                                                    4
           4 
            4
                     4 
                    4
    4                        4 
                             4 U
   4                                       4 
                                           4
                     4 
                    4
                                  4
                                 4       
                                      
Graph Recommendations
Many short routes from U to t ⇒ recommend!


                     4 
                      4


                    4 
                   4        4   4 
                              4      4              4 
                                                   t
                                                    4
           4 
            4
                     4 
                    4
    4                        4 
                             4 U
   4                                       4 
                                           4
                     4 
                    4
                                  4
                                 4       
                                      
Graph Recommendations
random walk is equivalent to
 • Label Propagation (Baluja et al., 2008)
 • belongs to family of algorithms that
   are easy to code in map-reduce
Label Propagation
User-track graph, edge weights = scrobbles:

                 2             4a 
                              4
                     4
                               4
     U
                 4              b
                                4   
             1
             1                  4
                                c
                               4 
     V
             2
                 3             4
                                d
                                4   
                     5
     W
         3                      4
                               4
                                e   
                 3
                         4
                               f4 
                                 4
     X
Label Propagation
 User nodes are labelled with scrobbled tracks:

                      2             4
                                   4
                                    a    
(a,0.2)
(b,0.4)                   4
(c,0.4)
                                   4
          U
                      4             b
                                    4    
                  1
(b,0.5)
(d,0.5)           1                 c4
                                   4 
          V
                  2
(b,0.2)               3            4
                                    d
                                     4   
(d,0.3)                   5
(e,0.5)   W
              3                    4
                                    e4   
                      3
(a,0.3)
(d,0.3)                       4
(e,0.4)                            f4 
                                     4
          X
Label Propagation
 Propagate, accumulate, normalise:
                      2             4
                                   4
                                    a    
(a,0.2)
(b,0.4)                   4
(c,0.4)
                                   4
          U
                      4             b
                                    4    
                  1
(b,0.5)
(d,0.5)           1                 c4
                                   4 
          V
                  2                       1 x (b,0.5),(d,0.5)
(b,0.2)               3            4
                                    d
                                     4    x (b,0.2),(d,0.3),(e,0.5)
                                          3
(d,0.3)                   5               Þ(b,0.37),d(0.47),(e,0.17)
(e,0.5)   W
              3                    4
                                    e4   
                      3                   next iteration e will
(a,0.3)                                   propagate to user V
(d,0.3)                       4
(e,0.4)                            f4 
                                     4
          X
Label Propagation
After some iterations:
 •  labels at item nodes = similar items
 •  new labels at user nodes = recommendations
Map-Reduce Graph
Algorithms
general approach assuming:
 • no global state
 • state at node recomputed from scratch
   from incoming messages on each iteration

other examples:
 • breadth-first search
 • page rank
Map-Reduce Graph
Algorithms
inputs:
 • adjacency lists, state at each node
output:
 • updated state at each node

           2           4
                        a4   
               4
  U                      4  U,[(a,2),(b,4),(c,4)]
                       b4
           4


                       c4 
                         4
      adjacency list for node U
Label Propagation
class PropagatingMapper:
  map(nodeID,value):
    # value holds label-weight pairs
    # and adjacency list for node
    labels,adj_list = value
    for node,weight in adj_list:
      # send a “stripe” of label-weight
      # pairs to each neighbouring node
      msg = [(label,prob*weight) for
         label,prob in labels]
      yield node,msg
Label Propagation
class Reducer:
  reduce(nodeID,msgs):
    # accumulate
    labels = defaultdict(lambda:0)
    for msg in msgs:
        for label,w in msg:
            labels[label] += w
    # normalise, prune
    normalise(labels,MAX_LABELS_PER_NODE)
    yield nodeID,labels
Label Propagation
Not map-reduce friendly:
 •  send graph over network on every iteration
 •  huge mapper output:
    • mappers soon send MAX_LABELS_PER_NODE
      updates along every edge
 •  some reducers receive huge input:
    • too slow if reducer streams the data,
      OOM otherwise
 •  NB can't partition real graphs to avoid this
    • many natural graphs are scale-free e.g.
      AltaVista web graph top 1% of nodes adjacent
      to 53% of edges
Problem Scenario 2b
Map-reduce unfriendly computation:
  • shared memory

Examples:
 • almost all machine learning:
  • split training examples between machines
  • all machines need to read/write many shared
   parameter values
Hadoop without map-reduce
Graph processing
 • Apache Giraph (Facebook)

Hadoop YARN
 • Knitting Boar, Iterative Reduce
http://www.cloudera.com/content/cloudera/en/resources/library/hadoo
pworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html

 • ???
Alternatives in the cloud
Graph Processing:
 • GraphLab (CMU)
Task-specific:
 • Yahoo! LDA
General:
 • HPCC
 • Spark (Berkeley)
Spark and Shark
In-memory cluster computing
pros:
 •  fast!! (Shark is 100x faster than Hive)
 •  code in Scala or Java or Python
 •  can run on Hadoop YARN or Apache Mesos
 •  ideal for iterative algorithms, nearline analytics
 •  includes a Pregel clone & stream processing

cons:
 •  hardware requirements???
GraphLab
Distributed graph processing
pros:
 •  vertex-centric programming model
 •  handles true web-scale graphs
 •  many toolkits already:
  • collaborative filtering, topic modelling, graphical models,
    machine vision, graph analysis

cons:
 •  new applications require non-trivial C++ coding
Word count in Spark
val file = spark.textFile(“hdfs://input”)
val counts = file.flatMap(line => line.split(”
“))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile(“hdfs://output/wordcount”)
Logistic regression in Spark
val points = spark.textFile(…).map(parsePoint).cache()
var w = Vector.random(D) // current separating plane
for (i <- 1 to ITERATIONS) {
   val gradient = points.map(p =>
    (1 / (1 + exp(-p.y*(w dot p.x))) – 1) * p.y * p.x
   ).reduce(_ + _)
   w -= gradient
 }
println(“Final separating plane: “ + w)


[ points remain in memory for all iterations ]
Alternatives on your laptop
Graph processing
 • GraphChi (CMU)
Machine learning
 • Sophia-ML (Google)
 • vowpal wabbit (Yahoo!, Microsoft)
GraphChi
Graph processing on your laptop
pros:
 •  still handles graphs with billions of edges
 •  graph structure can be modified at runtime
 •  Java/Scala ports under active development
 •  some toolkits available:
  • collaborative filtering, graph analysis

cons:
 •  existing C++ toolkit code is hard to extend
vowpal wabbit
classification, regression, LDA, bandits, ...
pros:
 •  handles huge ("terafeature") training datasets
 •  very fast
 •  state of the art algorithms
 •  can run in distributed mode on Hadoop streaming
cons:
 •  hard-core documentation
Take homes
Think before you use Hadoop
 •  use your laptop for most problems
 •  use a graph framework for graph data

Keep your Hadoop code simple
 •  if you're just querying data use Hive
 •  if not use a workflow framework

Check out the competition
 •  Spark and HPCC look impressive
Thanks for listening!
Goodbye         Hello



gamboviol@gmail.com
@gamboviol

More Related Content

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Hadoop and beyond: power tools for data mining

  • 1. Hadoop and beyond: power tools for data mining Mark Levy, 13 March 2013 Cloud Computing Module Birkbeck/UCL
  • 2. Hadoop and beyond Outline: • the data I work with • Hadoop without Java • Map-Reduce unfriendly algorithms • Hadoop without Map-Reduce • alternatives in the cloud • alternatives on your laptop
  • 3. NB • all software mentioned is Open Source • won't cover key-value stores • I don't use all of these tools
  • 10. Last.fm datasets Core datasets: • 45M users, many active • 60M artists • 100M audio fingerprints • 600M tracks (hmm...) • 19M physical recordings • 3M distinct tags •  2.5M <user,item,tag> taggings per month •  1B <user,time,track> scrobbles per month • full user-track graph has ~50B edges  (more often work with ~500M edges)
  • 11. Problem Scenario 1 Need Hadoop, don't want Java: • need to build prototypes, fast • need to do interactive data analysis • want terse, highly readable code • improve maintainability • improve correctness
  • 12.
  • 13. Hadoop without Java Some options: • Hive (Yahoo!) • Pig (Yahoo!) • Cascading (ok it's still Java...) • Scalding (Twitter) • Hadoop streaming (various) not to mention 11 more listed here: http://blog.matthewrathbone.com/2013/01/05/a-quick-guide-to-hadoop-map-reduce-frameworks.html
  • 14. Apache Hive SQL access to data on Hadoop pros: • minimal learning curve • interactive shell • easy to check correctness of code cons: • can be inefficient • hard to fix when it is
  • 15. Word count in Hive CREATE TABLE input (line STRING); LOAD DATA LOCAL INPATH '/input' OVERWRITE INTO TABLE input; SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) wTable as word GROUP BY word; [but would you use SQL to count words?]
  • 16. Apache Pig High level scripting language for Hadoop pros: • more primitive operations than Hive (and UDFs) • more flexible than Hive • interactive shell cons: • harder learning curve than Hive • tempting to write longer programs but no code modularity beyond functions
  • 17. Word count in Pig A = load '/input'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches 'w+'; D = group C by word; E = foreach D generate COUNT(C), group; store E into '/output/wordcount'; [ apply operations to "relations" (tuples) ]
  • 18. Cascading Java data pipelining for Hadoop pros: • as flexible as Pig • uses a real programming langauge • ideal for longer workflows cons: • new concepts to learn ("spout","sink","tap",...) • still verbose (full wordcount ex. code > 150 lines)
  • 19. Word count in Cascading Scheme sourceScheme = new TextLine(new Fields("line")); Tap source = new Hfs(sourceScheme, "/input"); Scheme sinkScheme = new TextLine(new Fields("word", "count")); Tap sink = new Hfs(sinkScheme, "/output/wordcount", SinkMode.REPLACE); Pipe assembly = new Pipe("wordcount"); String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)"; Function function = new RegexGenerator(new Fields("word"), regex); assembly = new Each(assembly, new Fields("line"), function); assembly = new GroupBy(assembly, new Fields("word")); Aggregator count = new Count(new Fields("count")); assembly = new Every(assembly, count); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); FlowConnector flowConnector = new FlowConnector(properties); Flow flow = flowConnector.connect("word-count", source, sink, assembly); flow.complete();
  • 20. Scalding Scala data pipelining for Hadoop pros: • as flexible as Pig • uses a real programming language • much terser than Java cons: • community still small (but in use at Twitter) • ???
  • 21. Word count in Scalding import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine(args("input")) .flatMap('line -> 'word){ line: String => line.split("""s+""") } .groupBy('word){ _.size } .write(Tsv(args("output"))) } [and a one-liner to run it]
  • 22. Hadoop streaming Map-reduce in any language e.g. Dumbo wrapper for Python pros: • use your favourite language for map-reduce • easy to mix local and cloud processing cons: • limited community • limited functionality beyond map-reduce
  • 23. Word count in Dumbo def map(key,text): for word in text.split(): yield word,1 # ignore key def reduce(word,counts): yield word,sum(counts) import dumbo dumbo.run(map,reduce,combiner=reduce) [and a one-liner to run it]
  • 24. Problem Scenario 1b Need Hadoop, don't want Java: • drive native code in parallel E.g. audio analysis for: • beat locations, bpm • key estimation • chord sequence estimation • energy • music/speech? • ...
  • 25. Audio Analysis Problem: • millions of audio tracks on own dfs • long-running C++ analysis code • depends on numerous libraries • verbose output
  • 26. Audio Analysis Solution: • bash + Dumbo Hadoop streaming Outline: • build C++ code • zip up binary and libs • send zipfile and some track IDs to each machine • extract and run binary in map task with  subprocess.Popen()
  • 27. Audio Analysis class AnalysisMapper: init(): extract(analyzer.tar.bz2,”bin”) map(key,trackID): file = fetch_audio_file(trackID) proc = subprocess.Popen( [“bin/analyzer”,file], stdout = subprocess.PIPE) (out,err) = proc.communicate() yield trackID,out
  • 28. Problem Scenario 2 Map-reduce unfriendly computation: • iterative algorithms on same data • huge mapper output ("map-increase") • curse of slowest reducer
  • 29. Graph Recommendations Random walk on user-item graph  4  4 4   4  4   4  4 4 4   t 4  4  4 4   4 4  4   4 U  4 4   4 4   4 4  4  
  • 30. Graph Recommendations Many short routes from U to t ⇒ recommend!  4  4 4   4  4   4  4 4 4   t 4  4  4 4   4 4  4   4 U  4 4   4 4   4 4  4  
  • 31. Graph Recommendations random walk is equivalent to • Label Propagation (Baluja et al., 2008) • belongs to family of algorithms that  are easy to code in map-reduce
  • 32. Label Propagation User-track graph, edge weights = scrobbles: 2 4a   4 4  4 U 4 b 4  1 1 4 c  4  V 2 3  4 d 4  5 W 3 4  4 e  3 4  f4  4 X
  • 33. Label Propagation  User nodes are labelled with scrobbled tracks: 2 4  4 a  (a,0.2) (b,0.4) 4 (c,0.4)  4 U 4 b 4  1 (b,0.5) (d,0.5) 1 c4  4  V 2 (b,0.2) 3  4 d 4  (d,0.3) 5 (e,0.5) W 3  4 e4  3 (a,0.3) (d,0.3) 4 (e,0.4)  f4  4 X
  • 34. Label Propagation  Propagate, accumulate, normalise: 2 4  4 a  (a,0.2) (b,0.4) 4 (c,0.4)  4 U 4 b 4  1 (b,0.5) (d,0.5) 1 c4  4  V 2 1 x (b,0.5),(d,0.5) (b,0.2) 3  4 d 4  x (b,0.2),(d,0.3),(e,0.5) 3 (d,0.3) 5 Þ(b,0.37),d(0.47),(e,0.17) (e,0.5) W 3  4 e4  3 next iteration e will (a,0.3) propagate to user V (d,0.3) 4 (e,0.4)  f4  4 X
  • 35. Label Propagation After some iterations: •  labels at item nodes = similar items •  new labels at user nodes = recommendations
  • 36. Map-Reduce Graph Algorithms general approach assuming: • no global state • state at node recomputed from scratch  from incoming messages on each iteration other examples: • breadth-first search • page rank
  • 37. Map-Reduce Graph Algorithms inputs: • adjacency lists, state at each node output: • updated state at each node 2  4 a4  4 U 4  U,[(a,2),(b,4),(c,4)]  b4 4  c4  4 adjacency list for node U
  • 38. Label Propagation class PropagatingMapper: map(nodeID,value): # value holds label-weight pairs # and adjacency list for node labels,adj_list = value for node,weight in adj_list: # send a “stripe” of label-weight # pairs to each neighbouring node msg = [(label,prob*weight) for label,prob in labels] yield node,msg
  • 39. Label Propagation class Reducer: reduce(nodeID,msgs): # accumulate labels = defaultdict(lambda:0) for msg in msgs: for label,w in msg: labels[label] += w # normalise, prune normalise(labels,MAX_LABELS_PER_NODE) yield nodeID,labels
  • 40. Label Propagation Not map-reduce friendly: •  send graph over network on every iteration •  huge mapper output: • mappers soon send MAX_LABELS_PER_NODE updates along every edge •  some reducers receive huge input: • too slow if reducer streams the data, OOM otherwise •  NB can't partition real graphs to avoid this • many natural graphs are scale-free e.g. AltaVista web graph top 1% of nodes adjacent to 53% of edges
  • 41. Problem Scenario 2b Map-reduce unfriendly computation: • shared memory Examples: • almost all machine learning: • split training examples between machines • all machines need to read/write many shared parameter values
  • 42. Hadoop without map-reduce Graph processing • Apache Giraph (Facebook) Hadoop YARN • Knitting Boar, Iterative Reduce http://www.cloudera.com/content/cloudera/en/resources/library/hadoo pworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html • ???
  • 43. Alternatives in the cloud Graph Processing: • GraphLab (CMU) Task-specific: • Yahoo! LDA General: • HPCC • Spark (Berkeley)
  • 44. Spark and Shark In-memory cluster computing pros: •  fast!! (Shark is 100x faster than Hive) •  code in Scala or Java or Python •  can run on Hadoop YARN or Apache Mesos •  ideal for iterative algorithms, nearline analytics •  includes a Pregel clone & stream processing cons: •  hardware requirements???
  • 45. GraphLab Distributed graph processing pros: •  vertex-centric programming model •  handles true web-scale graphs •  many toolkits already: • collaborative filtering, topic modelling, graphical models, machine vision, graph analysis cons: •  new applications require non-trivial C++ coding
  • 46. Word count in Spark val file = spark.textFile(“hdfs://input”) val counts = file.flatMap(line => line.split(” “)) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(“hdfs://output/wordcount”)
  • 47. Logistic regression in Spark val points = spark.textFile(…).map(parsePoint).cache() var w = Vector.random(D) // current separating plane for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) – 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println(“Final separating plane: “ + w) [ points remain in memory for all iterations ]
  • 48. Alternatives on your laptop Graph processing • GraphChi (CMU) Machine learning • Sophia-ML (Google) • vowpal wabbit (Yahoo!, Microsoft)
  • 49. GraphChi Graph processing on your laptop pros: •  still handles graphs with billions of edges •  graph structure can be modified at runtime •  Java/Scala ports under active development •  some toolkits available: • collaborative filtering, graph analysis cons: •  existing C++ toolkit code is hard to extend
  • 50. vowpal wabbit classification, regression, LDA, bandits, ... pros: •  handles huge ("terafeature") training datasets •  very fast •  state of the art algorithms •  can run in distributed mode on Hadoop streaming cons: •  hard-core documentation
  • 51. Take homes Think before you use Hadoop •  use your laptop for most problems •  use a graph framework for graph data Keep your Hadoop code simple •  if you're just querying data use Hive •  if not use a workflow framework Check out the competition •  Spark and HPCC look impressive
  • 52. Thanks for listening! Goodbye Hello gamboviol@gmail.com @gamboviol