SlideShare a Scribd company logo
1 of 37
SHOULD I USE
SCALDING OR SCOOBI
OR SCRUNCH?
CHRIS SEVERS @ccsevers
eBay SEARCH SCIENCE
Hadoop Summit
June 26th, 2013
Obligatory Big Data Stuff
•Some fun facts about eBay from Hugh Williams’ blog:
–We have over 50 petabytes of data stored in
our Hadoop and Teradata clusters
–We have over 400 million items for sale
–We process more than 250 million user queries per day
–We serve over 100,000 pages per second
–Our users spend over 180 years in total every day looking
at items
–We have over 112 million active users
–We sold over US$75 billion in merchandize in 2012
•http://hughewilliams.com/2013/06/24/the-size-and-scale-
of-ebay-2013-edition/
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 2
THE ANSWER
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 4
YES.
THANK YOU AND GOOD NIGHT
•Questions/comments?
•Thanks to Avi Bryant, @avibryant for settling this issue.
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 5
NO REALLY, WHICH ONE SHOULD I USE?!
•All three (Scalding, Scoobi, Scrunch) are amazing projects
•They seem to be converging to a common API
•There are small differences, but if you can use one you will
likely be productive with the others very quickly
•They are all much better than the alternatives.
•Scalding: https://github.com/twitter/scalding, @scalding
–Authors: @posco @argyris @avibryant
•Scoobi: https://github.com/NICTA/scoobi
–Authors: @bmlever @etorreborre
•Scrunch: http://crunch.apache.org/scrunch.html
–Author: @josh_wills
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 6
THE AGENDA
1. Quick survey of the current landscape outside
Scalding, Scoobi, and Scrunch
2. A light comparison of Scalding, Scoobi, and Scrunch.
3. Some code samples
4. The future
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 7
THE ALTERNATIVES
I promise this part will be quick
VANILLA MAPREDUCE
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 9
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
PIG
•Apache Pig is a really great tool for quick, ad-hoc data
analysis
•While we can do amazing things with it, I’m not sure we
should
•Anything complicated requires User Defined Functions
(UDFs)
•UDFs require a separate code base
•Now you have to maintain two separate languages for
no good reason
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 10
APACHE HIVE
•On previous slide: s/Pig/Hive/g
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 11
STREAMING
•Can be concise
•Easy to test
–cat myfile.txt | ./mymapper.sh | sort | ./myreducer.sh
•Same problems as vanila MR when it comes to
multistage flows
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 12
CASCADING/CRUNCH
•Higher level abstractions are good
•Tell the API what you want to do, let it sort out the actual
series of MR jobs
•Very easy to do cogroup, join, multiple passes
•Still a bit too verbose
•Feels like shoehorning a fundamentally functional notion
into an imperative context
•If you can’t move away from Java, this is your best bet
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 13
LET’S COMPARE
This slide is bright yellow!
FEATURE COMPARISON
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 15
Scalding Scoobi Scrunch
Data model Tuple or
distributed
collection
Distributed
collection
Distributed
collection
Has Algebird
baked in
Yes No No
Is Java-free No Yes No
Backed by
Cloudera
No No Yes
Free Yes Yes Yes
SOME SCALA CODE
val myLines = getStuff
val myWords = myLines.flatMap(w =>
w.split("s+"))
val myWordsGrouped = myLines.groupBy(identity)
val countedWords = myWordsGrouped.
mapValues(x=>x.size)
write(countedWords)
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 16
SOME SCALDING CODE
val myLines = TextLine(path)
val myWords= myLines.flatMap(w =>
w.split(" "))
.groupBy(identity)
.size
myWords.write(TypedTSV(output))
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 17
SOME SCOOBI CODE
val lines = fromTextFile("hdfs://in/...")
val counts = lines.flatMap(_.split(" "))
.map(word => (word, 1))
.groupByKey
.combine(_+_)
persist(counts.toTextFile("hdfs://out/...", overwrite=true))
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 18
SOME SCRUNCH CODE
val pipeline = new Pipeline[WordCountExample]
def wordCount(fileName: String) = {
pipeline.read(from.textFile(fileName))
.flatMap(_.toLowerCase.split("W+"))
.count
}
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 19
ADVANTAGES
•Type checking
–Find errors at compile time, not at job submission time (or
even worse, 5 hours after job submission time)
•Single language
–Scala is a fully functional programming language
•Productivity
–Since the code you write looks like collections code you can
use the Scala REPL to prototype
•Clarity
–Write code as a series of operations and let the job planner
smash it all together
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 20
BREAD AND BUTTER
•You can be effective within hours by just learning a few
simple ideas
–map
–flatMap
–filter
–groupBy
–reduce
–foldLeft
•Everything above takes a function as an argument.
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 21
map
•Does what you think it does
scala> val mylist = List(1,2,3)
mylist: List[Int] = List(1, 2, 3)
scala> mylist.map(x => x + 5)
res0: List[Int] = List(6, 7, 8)
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 22
flatMap
•Kind of like map
•Does a map then a flatten
scala> val mystrings = List("hello there", "hadoop summit")
mystrings: List[String] = List(hello there, hadoop summit)
scala> mystrings.map(x => x.split(" "))
res5: List[Array[String]] =
List(Array(hello, there), Array(hadoop, summit))
scala> mystrings.map(x => x.split(" ")).flatten
res6: List[String] = List(hello, there, hadoop, summit)
scala> mystrings.flatMap(x => x.split(" "))
res7: List[String] = List(hello, there, hadoop, summit)
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 23
filter
•Pretty obvious
scala> mystrings.filter(x => x.contains("hadoop"))
res8: List[String] = List(hadoop summit)
•Takes a predicate function
•Use this a lot
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 24
groupBy
•Puts items together using an arbitrary function
scala> mylist.groupBy(x => x % 2 == 0)
res9: scala.collection.immutable.Map[Boolean,List[Int]] =
Map(false -> List(1, 3), true -> List(2))
scala> mylist.groupBy(x => x % 2)
res10: scala.collection.immutable.Map[Int,List[Int]] =
Map(1 -> List(1, 3), 0 -> List(2))
scala> mystrings.groupBy(x => x.length)
res11: scala.collection.immutable.Map[Int,List[String]] =
Map(11 -> List(hello there), 13 -> List(hadoop summit))
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 25
reduce
•Not necessarily what you think
•Signature: (T,T) => T
scala> mylist.reduce( (l,r) => l + r )
res12: Int = 6
scala> mystrings.reduce( (l,r) => l + r )
res13: String = hello therehadoop summit
scala> mystrings.reduce( (l,r) => l + " " + r )
res14: String = hello there hadoop summit
•In the case of Scalding/Scoobi/Scrunch, this happens on
the values after a group operation.
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 26
foldLeft
•This is a fancy reduce
•Signature: (z: B)((B,T) => B
•The input z is called the accumulator
scala> mylist.foldLeft(Set[Int]())((s,x) => s + x)
res15: scala.collection.immutable.Set[Int] =
Set(1, 2, 3)
scala> mylist.foldLeft(List[Int]())((xs, x) => x :: xs)
res16: List[Int] = List(3, 2, 1)
•Like reduce, this happens on the values after a groupBy
•Called slightly different things in Scoobi/Scrunch
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 27
MONOIDS: WHY YOU SHOULD CARE ABOUT
MATH
•From Wikipedia:
–a monoid is an algebraic structure with a single associative
binary operation and an identity element.
•Almost everything you want to do is a monoid
–Standard addition of numeric types is the most common
–List/map/set/string concatenation
–Top k elements
–Bloom filter, count-min sketch, hyperloglog
–stochastic gradient descent
–moments of distributions
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 28
MORE MONOID STUFF
•If you are aggregating, you are probably using a monoid
•Scalding has Algebird and monoid support baked in
•Scoobi and Scrunch can use Algebird (or any other
monoid library) with almost no work
–combine { case (l,r) => monoid.plus(l,r) }
•Algebird handles tuples with ease
•Very easy to define monoids for your own types
•Algebird: https://github.com/twitter/algebird @algebird
–Authors: Lots!
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 29
THE FUTURE
It’s here now!
SPARK
•Spark is a system for executing general computation
graphs, not just MR
•The syntax looks very much like Scalding, Scoobi
and, Scrunch
–It inspired the API on a couple of them
•Spark runs on YARN as of the latest release
•Can cache intermediate data
–Iterative problems become much easier
•Developed by the AMPLab at UC Berkeley
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 31
GREAT, NOW I HAVE TO LEARN 4 THINGS
INSTEAD OF 3
•Scalding, Scoobi, and Scrunch seem to have all sprung
into being around the same time and independently of
each other
•Spark was around a little before that
•Do we really need 3 (or 4) very similar solutions?
•Wouldn’t it be nice if we could just pick one and all get
behind it?
•I was prepared to make a definitive statement about the
best one, but then I learned something new
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 32
HAVE CAKE AND EAT IT
•There is currently working being done on a common API
that spans Scalding, Scoobi, Scrunch and Spark
•Not much is implemented yet, but all 4 groups are talking
and working things out
•The main use case is already done
–After word count everything else is just academic
•Check it out here: https://github.com/jwills/driskill
•In the future* you’ll be able to write against this common
API and then decide which system you want to execute
the job
–Think of choosing a list, a buffer or a vector
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 33
HOW CAN WE HELP?
•Get involved
•If something bothers you, fix it
•If you want a new feature, try and build it
•Everyone involved is actually quite friendly
•You can build jars to run on your cluster and no one will
know there is Scala Inside™
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 34
CONCLUSION
We’re almost done!
THINGS TO TAKE AWAY
•Mapreduce is a functional problem, we should use
functional tools
•You can increase productivity, safety, and maintainability
all at once with no down side
•Thinking of data flows in a functional way opens up
many new possibilities
•The community is awesome
THANKS! (FOR REAL THIS TIME)
•Questions/comments?
SHOULD I USE SCALDING OR SCOOBI OR
SCRUNCH? 37

More Related Content

What's hot

A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Scala introduction
Scala introductionScala introduction
Scala introductionvito jeng
 
Turtle Graphics in Groovy
Turtle Graphics in GroovyTurtle Graphics in Groovy
Turtle Graphics in GroovyJim Driscoll
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science Chucheng Hsieh
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Scala for Java programmers
Scala for Java programmersScala for Java programmers
Scala for Java programmers輝 子安
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015StampedeCon
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to ScalaTim Underwood
 
Scala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 WorldScala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 WorldBTI360
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David SzakallasDatabricks
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...CloudxLab
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)Qiangning Hong
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Eelco Visser
 
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...MongoDB
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 

What's hot (20)

Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Meet scala
Meet scalaMeet scala
Meet scala
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Scala introduction
Scala introductionScala introduction
Scala introduction
 
Turtle Graphics in Groovy
Turtle Graphics in GroovyTurtle Graphics in Groovy
Turtle Graphics in Groovy
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Scala for Java programmers
Scala for Java programmersScala for Java programmers
Scala for Java programmers
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to Scala
 
Scala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 WorldScala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 World
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
 
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Coding in Style
Coding in StyleCoding in Style
Coding in Style
 

Viewers also liked

How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating SystemProcess, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating SystemLieYah Daliah
 
Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Sriram Krishnan
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 

Viewers also liked (7)

How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating SystemProcess, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
 
Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014
 
ebay
ebayebay
ebay
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBaseComparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
 

Similar to Should I Use Scalding or Scoobi or Scrunch?

Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya
Presentation on functional data mining at the IGT Cloud meet up at eBay NetanyaPresentation on functional data mining at the IGT Cloud meet up at eBay Netanya
Presentation on functional data mining at the IGT Cloud meet up at eBay NetanyaChristopher Severs
 
ScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin OderskyScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin OderskyTypesafe
 
Introductionto fp with groovy
Introductionto fp with groovyIntroductionto fp with groovy
Introductionto fp with groovyIsuru Samaraweera
 
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...Uri Cohen
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopDataWorks Summit
 
From Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risksFrom Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risksSeniorDevOnly
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Roger Huang
 
Functional Programming With Scala
Functional Programming With ScalaFunctional Programming With Scala
Functional Programming With ScalaKnoldus Inc.
 
A Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoA Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoMatt Stine
 
Functional programming with Scala
Functional programming with ScalaFunctional programming with Scala
Functional programming with ScalaNeelkanth Sachdeva
 
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...Andrew Phillips
 
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...Andrew Phillips
 

Similar to Should I Use Scalding or Scoobi or Scrunch? (20)

Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya
Presentation on functional data mining at the IGT Cloud meet up at eBay NetanyaPresentation on functional data mining at the IGT Cloud meet up at eBay Netanya
Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya
 
ScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin OderskyScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin Odersky
 
Introductionto fp with groovy
Introductionto fp with groovyIntroductionto fp with groovy
Introductionto fp with groovy
 
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Yes, Sql!
Yes, Sql!Yes, Sql!
Yes, Sql!
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Groovy unleashed
Groovy unleashed Groovy unleashed
Groovy unleashed
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for Hadoop
 
From Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risksFrom Java to Scala - advantages and possible risks
From Java to Scala - advantages and possible risks
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
 
Scala 20140715
Scala 20140715Scala 20140715
Scala 20140715
 
Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
 
Scala intro workshop
Scala intro workshopScala intro workshop
Scala intro workshop
 
Functional Programming With Scala
Functional Programming With ScalaFunctional Programming With Scala
Functional Programming With Scala
 
A Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoA Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to Go
 
Functional programming with Scala
Functional programming with ScalaFunctional programming with Scala
Functional programming with Scala
 
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
 
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Should I Use Scalding or Scoobi or Scrunch?

  • 1. SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? CHRIS SEVERS @ccsevers eBay SEARCH SCIENCE Hadoop Summit June 26th, 2013
  • 2. Obligatory Big Data Stuff •Some fun facts about eBay from Hugh Williams’ blog: –We have over 50 petabytes of data stored in our Hadoop and Teradata clusters –We have over 400 million items for sale –We process more than 250 million user queries per day –We serve over 100,000 pages per second –Our users spend over 180 years in total every day looking at items –We have over 112 million active users –We sold over US$75 billion in merchandize in 2012 •http://hughewilliams.com/2013/06/24/the-size-and-scale- of-ebay-2013-edition/ SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 2
  • 4. SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 4 YES.
  • 5. THANK YOU AND GOOD NIGHT •Questions/comments? •Thanks to Avi Bryant, @avibryant for settling this issue. SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 5
  • 6. NO REALLY, WHICH ONE SHOULD I USE?! •All three (Scalding, Scoobi, Scrunch) are amazing projects •They seem to be converging to a common API •There are small differences, but if you can use one you will likely be productive with the others very quickly •They are all much better than the alternatives. •Scalding: https://github.com/twitter/scalding, @scalding –Authors: @posco @argyris @avibryant •Scoobi: https://github.com/NICTA/scoobi –Authors: @bmlever @etorreborre •Scrunch: http://crunch.apache.org/scrunch.html –Author: @josh_wills SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 6
  • 7. THE AGENDA 1. Quick survey of the current landscape outside Scalding, Scoobi, and Scrunch 2. A light comparison of Scalding, Scoobi, and Scrunch. 3. Some code samples 4. The future SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 7
  • 8. THE ALTERNATIVES I promise this part will be quick
  • 9. VANILLA MAPREDUCE SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 9 package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
  • 10. PIG •Apache Pig is a really great tool for quick, ad-hoc data analysis •While we can do amazing things with it, I’m not sure we should •Anything complicated requires User Defined Functions (UDFs) •UDFs require a separate code base •Now you have to maintain two separate languages for no good reason SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 10
  • 11. APACHE HIVE •On previous slide: s/Pig/Hive/g SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 11
  • 12. STREAMING •Can be concise •Easy to test –cat myfile.txt | ./mymapper.sh | sort | ./myreducer.sh •Same problems as vanila MR when it comes to multistage flows SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 12
  • 13. CASCADING/CRUNCH •Higher level abstractions are good •Tell the API what you want to do, let it sort out the actual series of MR jobs •Very easy to do cogroup, join, multiple passes •Still a bit too verbose •Feels like shoehorning a fundamentally functional notion into an imperative context •If you can’t move away from Java, this is your best bet SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 13
  • 14. LET’S COMPARE This slide is bright yellow!
  • 15. FEATURE COMPARISON SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 15 Scalding Scoobi Scrunch Data model Tuple or distributed collection Distributed collection Distributed collection Has Algebird baked in Yes No No Is Java-free No Yes No Backed by Cloudera No No Yes Free Yes Yes Yes
  • 16. SOME SCALA CODE val myLines = getStuff val myWords = myLines.flatMap(w => w.split("s+")) val myWordsGrouped = myLines.groupBy(identity) val countedWords = myWordsGrouped. mapValues(x=>x.size) write(countedWords) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 16
  • 17. SOME SCALDING CODE val myLines = TextLine(path) val myWords= myLines.flatMap(w => w.split(" ")) .groupBy(identity) .size myWords.write(TypedTSV(output)) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 17
  • 18. SOME SCOOBI CODE val lines = fromTextFile("hdfs://in/...") val counts = lines.flatMap(_.split(" ")) .map(word => (word, 1)) .groupByKey .combine(_+_) persist(counts.toTextFile("hdfs://out/...", overwrite=true)) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 18
  • 19. SOME SCRUNCH CODE val pipeline = new Pipeline[WordCountExample] def wordCount(fileName: String) = { pipeline.read(from.textFile(fileName)) .flatMap(_.toLowerCase.split("W+")) .count } SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 19
  • 20. ADVANTAGES •Type checking –Find errors at compile time, not at job submission time (or even worse, 5 hours after job submission time) •Single language –Scala is a fully functional programming language •Productivity –Since the code you write looks like collections code you can use the Scala REPL to prototype •Clarity –Write code as a series of operations and let the job planner smash it all together SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 20
  • 21. BREAD AND BUTTER •You can be effective within hours by just learning a few simple ideas –map –flatMap –filter –groupBy –reduce –foldLeft •Everything above takes a function as an argument. SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 21
  • 22. map •Does what you think it does scala> val mylist = List(1,2,3) mylist: List[Int] = List(1, 2, 3) scala> mylist.map(x => x + 5) res0: List[Int] = List(6, 7, 8) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 22
  • 23. flatMap •Kind of like map •Does a map then a flatten scala> val mystrings = List("hello there", "hadoop summit") mystrings: List[String] = List(hello there, hadoop summit) scala> mystrings.map(x => x.split(" ")) res5: List[Array[String]] = List(Array(hello, there), Array(hadoop, summit)) scala> mystrings.map(x => x.split(" ")).flatten res6: List[String] = List(hello, there, hadoop, summit) scala> mystrings.flatMap(x => x.split(" ")) res7: List[String] = List(hello, there, hadoop, summit) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 23
  • 24. filter •Pretty obvious scala> mystrings.filter(x => x.contains("hadoop")) res8: List[String] = List(hadoop summit) •Takes a predicate function •Use this a lot SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 24
  • 25. groupBy •Puts items together using an arbitrary function scala> mylist.groupBy(x => x % 2 == 0) res9: scala.collection.immutable.Map[Boolean,List[Int]] = Map(false -> List(1, 3), true -> List(2)) scala> mylist.groupBy(x => x % 2) res10: scala.collection.immutable.Map[Int,List[Int]] = Map(1 -> List(1, 3), 0 -> List(2)) scala> mystrings.groupBy(x => x.length) res11: scala.collection.immutable.Map[Int,List[String]] = Map(11 -> List(hello there), 13 -> List(hadoop summit)) SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 25
  • 26. reduce •Not necessarily what you think •Signature: (T,T) => T scala> mylist.reduce( (l,r) => l + r ) res12: Int = 6 scala> mystrings.reduce( (l,r) => l + r ) res13: String = hello therehadoop summit scala> mystrings.reduce( (l,r) => l + " " + r ) res14: String = hello there hadoop summit •In the case of Scalding/Scoobi/Scrunch, this happens on the values after a group operation. SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 26
  • 27. foldLeft •This is a fancy reduce •Signature: (z: B)((B,T) => B •The input z is called the accumulator scala> mylist.foldLeft(Set[Int]())((s,x) => s + x) res15: scala.collection.immutable.Set[Int] = Set(1, 2, 3) scala> mylist.foldLeft(List[Int]())((xs, x) => x :: xs) res16: List[Int] = List(3, 2, 1) •Like reduce, this happens on the values after a groupBy •Called slightly different things in Scoobi/Scrunch SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 27
  • 28. MONOIDS: WHY YOU SHOULD CARE ABOUT MATH •From Wikipedia: –a monoid is an algebraic structure with a single associative binary operation and an identity element. •Almost everything you want to do is a monoid –Standard addition of numeric types is the most common –List/map/set/string concatenation –Top k elements –Bloom filter, count-min sketch, hyperloglog –stochastic gradient descent –moments of distributions SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 28
  • 29. MORE MONOID STUFF •If you are aggregating, you are probably using a monoid •Scalding has Algebird and monoid support baked in •Scoobi and Scrunch can use Algebird (or any other monoid library) with almost no work –combine { case (l,r) => monoid.plus(l,r) } •Algebird handles tuples with ease •Very easy to define monoids for your own types •Algebird: https://github.com/twitter/algebird @algebird –Authors: Lots! SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 29
  • 31. SPARK •Spark is a system for executing general computation graphs, not just MR •The syntax looks very much like Scalding, Scoobi and, Scrunch –It inspired the API on a couple of them •Spark runs on YARN as of the latest release •Can cache intermediate data –Iterative problems become much easier •Developed by the AMPLab at UC Berkeley SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 31
  • 32. GREAT, NOW I HAVE TO LEARN 4 THINGS INSTEAD OF 3 •Scalding, Scoobi, and Scrunch seem to have all sprung into being around the same time and independently of each other •Spark was around a little before that •Do we really need 3 (or 4) very similar solutions? •Wouldn’t it be nice if we could just pick one and all get behind it? •I was prepared to make a definitive statement about the best one, but then I learned something new SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 32
  • 33. HAVE CAKE AND EAT IT •There is currently working being done on a common API that spans Scalding, Scoobi, Scrunch and Spark •Not much is implemented yet, but all 4 groups are talking and working things out •The main use case is already done –After word count everything else is just academic •Check it out here: https://github.com/jwills/driskill •In the future* you’ll be able to write against this common API and then decide which system you want to execute the job –Think of choosing a list, a buffer or a vector SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 33
  • 34. HOW CAN WE HELP? •Get involved •If something bothers you, fix it •If you want a new feature, try and build it •Everyone involved is actually quite friendly •You can build jars to run on your cluster and no one will know there is Scala Inside™ SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 34
  • 36. THINGS TO TAKE AWAY •Mapreduce is a functional problem, we should use functional tools •You can increase productivity, safety, and maintainability all at once with no down side •Thinking of data flows in a functional way opens up many new possibilities •The community is awesome
  • 37. THANKS! (FOR REAL THIS TIME) •Questions/comments? SHOULD I USE SCALDING OR SCOOBI OR SCRUNCH? 37