SlideShare a Scribd company logo
1 of 186
Download to read offline
Scalding
the not-so-basics
Konrad 'ktoso' Malawski	

Scala Days 2014 @ Berlin
Konrad `@ktosopl` Malawski
typesafe.com	

geecon.org	

Java.pl / KrakowScala.pl	

sckrk.com / meetup.com/Paper-Cup @ London	

GDGKrakow.pl 	

meetup.com/Lambda-Lounge-Krakow
hAkker @
http://hadoop.apache.org/
http://research.google.com/archive/mapreduce.html
How old is this guy?
http://hadoop.apache.org/
http://research.google.com/archive/mapreduce.html
Google MapReduce, paper: 2004
Hadoop (Yahoo impl): 2005
the Big Landscape
Hadoop
https://github.com/twitter/scalding
Scalding is “on top of” Hadoop
https://github.com/twitter/scalding
Scalding is “on top of” Cascading,	

which is “on top of” Hadoop
http://www.cascading.org/
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding,	

which is “on top of” Cascading,	

which is “on top of” Hadoop
http://www.cascading.org/
https://github.com/twitter/summingbird
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
HDFS yes,	

MapReduce no
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
HDFS yes,	

MapReduce no
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
HDFS yes,	

MapReduce no
Possibly soon?!
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark has nothing to do with all this.
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
-streams
https://github.com/twitter/scalding
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop
http://www.cascading.org/
https://github.com/twitter/summingbird
http://storm.incubator.apache.org/
http://spark.apache.org/
this talk
Why?
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
in Memory
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
in Memory
in Memory
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
in Memory
in Memory
in Memory
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
!
!
text!
.split(" ")!
.map(a => (a, 1))!
.groupBy(_._1)!
.map(a => (a._1, a._2.map(_._2).sum))!
in Memory
in Memory
in Memory
in Memory
in Memory
package org.myorg;!
!
import org.apache.hadoop.fs.Path;!
import org.apache.hadoop.io.IntWritable;!
import org.apache.hadoop.io.LongWritable;!
import org.apache.hadoop.io.Text;!
import org.apache.hadoop.mapred.*;!
!
import java.io.IOException;!
import java.util.Iterator;!
import java.util.StringTokenizer;!
!
public class WordCount {!
!
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {!
private final static IntWritable one = new IntWritable(1);!
private Text word = new Text();!
!
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro
IOException {!
String line = value.toString();!
StringTokenizer tokenizer = new StringTokenizer(line);!
while (tokenizer.hasMoreTokens()) {!
word.set(tokenizer.nextToken());!
output.collect(word, one);!
Why Scalding?
Word Count in Hadoop MR
private final static IntWritable one = new IntWritable(1);!
private Text word = new Text();!
!
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro
IOException {!
String line = value.toString();!
StringTokenizer tokenizer = new StringTokenizer(line);!
while (tokenizer.hasMoreTokens()) {!
word.set(tokenizer.nextToken());!
output.collect(word, one);!
}!
}!
}!
!
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {!
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {!
int sum = 0;!
while (values.hasNext()) {!
sum += values.next().get();!
}!
output.collect(key, new IntWritable(sum));!
}!
}!
!
public static void main(String[] args) throws Exception {!
JobConf conf = new JobConf(WordCount.class);!
conf.setJobName("wordcount");!
!
conf.setOutputKeyClass(Text.class);!
conf.setOutputValueClass(IntWritable.class);!
!
conf.setMapperClass(Map.class);!
conf.setCombinerClass(Reduce.class);!
conf.setReducerClass(Reduce.class);!
!
conf.setInputFormat(TextInputFormat.class);!
conf.setOutputFormat(TextOutputFormat.class);!
!
FileInputFormat.setInputPaths(conf, new Path(args[0]));!
FileOutputFormat.setOutputPath(conf, new Path(args[1]));!
!
JobClient.runJob(conf);!
}!
}!
Why Scalding?
Word Count in Hadoop MR
“Field API”
map
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
map
Scala:
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
map
Scala:
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
map
IterableSource(data)
.map('number -> 'doubled) { n: Int => n * 2 }
Scala:
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
map
IterableSource(data)
.map('number -> 'doubled) { n: Int => n * 2 }
Scala:
available in Pipe
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
map
IterableSource(data)
.map('number -> 'doubled) { n: Int => n * 2 }
Scala:
available in Pipestays in Pipe
val data = 1 :: 2 :: 3 :: Nil!
!
val doubled = data map { _ * 2 }!
!
// Int => Int
map
IterableSource(data)!
.map('number -> 'doubled) { n: Int => n * 2 }!
!
!
// Int => Int
Scala:
must choose type!
mapTo
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
mapTo
Scala:
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
mapTo
Scala:
“release reference”
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
mapTo
Scala:
“release reference”
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
mapTo
IterableSource(data)
.mapTo('doubled) { n: Int => n * 2 }
Scala:
“release reference”
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
mapTo
IterableSource(data)
.mapTo('doubled) { n: Int => n * 2 }
Scala:
doubled stays in Pipe
“release reference”
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
mapTo
IterableSource(data)
.mapTo('doubled) { n: Int => n * 2 }
Scala:
doubled stays in Pipenumber is removed
“release reference”
flatMap
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",") // Array[String]
} map { _.toInt } // List[Int]
flatMap
Scala:
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",") // Array[String]
} map { _.toInt } // List[Int]
flatMap
Scala:
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",") // Array[String]
} map { _.toInt } // List[Int]
flatMap
TextLine(data) // like List[String]
.flatMap('line -> 'word) { _.split(",") } // like List[String]
.map('word -> 'number) { _.toInt } // like List[Int]
Scala:
flatMap
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",").map(_.toInt) // Array[Int]
}
flatMap
Scala:
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",").map(_.toInt) // Array[Int]
}
flatMap
TextLine(data) // like List[String]
.flatMap('line -> 'word) { _.split(",").map(_.toInt) }
// like List[Int]
Scala:
groupBy
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groupBy
Scala:
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groupBy
Scala:
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.size }
Scala:
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.size }
Scala:
groups all with == value
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.size }
Scala:
groups all with == value 'lessThanTenCounts
groupBy
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.sum('total) }
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.sum('total) }
'total = [3, 74]
import org.apache.hadoop.util.ToolRunner!
import com.twitter.scalding!
!
object ScaldingJobRunner extends App {!
!
ToolRunner.run(new Configuration, new scalding.Tool, args)!
!
}
Main Class - "Runner"
import org.apache.hadoop.util.ToolRunner!
import com.twitter.scalding!
!
object ScaldingJobRunner extends App {!
!
ToolRunner.run(new Configuration, new scalding.Tool, args)!
!
}
Main Class - "Runner"
from App
class WordCountJob(args: Args) extends Job(args) {!
!
!
!
!
!
!
!
!
!
!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
!
!
!
!
!
!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
!
!
!
!
!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
!
!
!
def tokenize(text: String): Array[String] = implemented!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { group => group.size('count) }!
!
!
def tokenize(text: String): Array[String] = implemented!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { group => group.size }!
!
!
def tokenize(text: String): Array[String] = implemented!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
!
!
def tokenize(text: String): Array[String] = implemented!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
.write(Tsv(outputFile))!
!
def tokenize(text: String): Array[String] = implemented!
}
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
!
val inputFile = args("input")!
val outputFile = args("output")!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
.write(Tsv(outputFile))!
!
def tokenize(text: String): Array[String] = implemented!
}
Word Count in Scalding
4{
1 day in the life of	

a guy implementing Scalding jobs
“How much are my shops selling?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.write(Tsv(output))
“How much are my shops selling?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.write(Tsv(output))
1!107!
2!144!
3!16!
… …
“How much are my shops selling?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.write(Tsv(output, writeHeader = true))
“How much are my shops selling?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.write(Tsv(output, writeHeader = true))
shopId! totalSoldItems!
1!! ! ! 107!
2!! ! ! 144!
3!! ! ! 16!
…!! ! ! …
“Which are the top selling shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { _.sortBy('totalSoldItems).reverse }!
.write(Tsv(output, writeHeader = true))
“Which are the top selling shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { _.sortBy('totalSoldItems).reverse }!
.write(Tsv(output, writeHeader = true))
shopId! totalSoldItems!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16!
…!! ! ! …
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }!
.write(Tsv(output, writeHeader = true))
shopId! totalSoldItems!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }!
.write(Tsv(output, writeHeader = true))
shopId! totalSoldItems!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16
SLOW! Instead do sortWithTake!SLOW! Instead do sortWithTake!
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortedReverseTake[Long]('totalSold -> 'x, 3) !
}!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortedReverseTake[Long]('totalSold -> 'x, 3) !
}!
.write(Tsv(output, writeHeader = true))
x!
List((5,146), (2,142), (3,32))!
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortedReverseTake[Long]('totalSold -> 'x, 3) !
}!
.write(Tsv(output, writeHeader = true))
x!
List((5,146), (2,142), (3,32))!
WAT!?
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortedReverseTake[Long]('totalSold -> 'x, 3) !
}!
.write(Tsv(output, writeHeader = true))
x!
List((5,146), (2,142), (3,32))!
WAT!?
Emits scala.collection.List[_]
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
}!
}!
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
}!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
}!
}!
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
}!
.write(Tsv(output, writeHeader = true))
Provide Ordering explicitly because implicit Ordering	

is not enough for Tuple2 here
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
}!
}!
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
}!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
}!
}!
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
}!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
shopId! totalSoldItems!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
}!
}!
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
}!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
}!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
}!
}!
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
}!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
MUCH faster Job	

= 	

Happier me.
Reduce, these Monoids
Reduce, these Monoids
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
Reduce, these Monoids
Reduce, these Monoids
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
interface:
Reduce, these Monoids
+ 3 laws:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
interface:
Reduce, these Monoids
+ 3 laws:
Closure:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
interface:
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
interface:
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
Associativity:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
interface:
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
Associativity:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
∀a,b,c∈T:(a·b)·c=a·(b·c)
(a + b) + c!
==!
a + (b + c)
interface:
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
Associativity:
Identity element:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
∀a,b,c∈T:(a·b)·c=a·(b·c)
(a + b) + c!
==!
a + (b + c)
interface:
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
Associativity:
Identity element:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
}
∀a,b∈T:a·b∈T
∀a,b,c∈T:(a·b)·c=a·(b·c)
(a + b) + c!
==!
a + (b + c)
interface:
∃z∈T:∀a∈T:z·a=a·z=a z + a == a + z == a
Reduce, these Monoids
object IntSum extends Monoid[Int] {!
def zero = 0!
def +(a: Int, b: Int) = a + b!
}
Summing:
Monoid ops can start “Map-side”
bear, 2
car, 3
deer, 2
Monoid ops can already start
being computed map-side!
Monoid ops can already start
being computed map-side!
river, 2
Monoid ops can start “Map-side”
average()	

sum()
sortWithTake()	

histogram()
Examples:
bear, 2
car, 3
deer, 2
river, 2
Obligatory: “Go check out Algebird, NOW!” slide
https://github.com/twitter/algebird
ALGE-birds
BloomFilterMonoid
https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL
val NUM_HASHES = 6!
val WIDTH = 32!
val SEED = 1!
val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)!
!
val bf1 = bfMonoid.create("1", "2", "3", "4", "100")!
val bf2 = bfMonoid.create("12", "45")!
val bf = bf1 ++ bf2!
// bf: com.twitter.algebird.BF =!
!
val approxBool = bf.contains("1")!
// approxBool: com.twitter.algebird.ApproximateBoolean =
ApproximateBoolean(true,0.9290349745708529)!
!
val res = approxBool.isTrue!
// res: Boolean = true
BloomFilterMonoid
https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL
val NUM_HASHES = 6!
val WIDTH = 32!
val SEED = 1!
val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)!
!
val bf1 = bfMonoid.create("1", "2", "3", "4", "100")!
val bf2 = bfMonoid.create("12", "45")!
val bf = bf1 ++ bf2!
// bf: com.twitter.algebird.BF =!
!
val approxBool = bf.contains("1")!
// approxBool: com.twitter.algebird.ApproximateBoolean =
ApproximateBoolean(true,0.9290349745708529)!
!
val res = approxBool.isTrue!
// res: Boolean = true
BloomFilterMonoid
https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL
val NUM_HASHES = 6!
val WIDTH = 32!
val SEED = 1!
val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)!
!
val bf1 = bfMonoid.create("1", "2", "3", "4", "100")!
val bf2 = bfMonoid.create("12", "45")!
val bf = bf1 ++ bf2!
// bf: com.twitter.algebird.BF =!
!
val approxBool = bf.contains("1")!
// approxBool: com.twitter.algebird.ApproximateBoolean =
ApproximateBoolean(true,0.9290349745708529)!
!
val res = approxBool.isTrue!
// res: Boolean = true
BloomFilterMonoid
Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { !
(bf: BF, itemId: String) => bf + itemId !
}!
}!
.map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }!
.map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }!
.discard('itemBloom)!
.write(Tsv(output, writeHeader = true))
BloomFilterMonoid
Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { !
(bf: BF, itemId: String) => bf + itemId !
}!
}!
.map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }!
.map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }!
.discard('itemBloom)!
.write(Tsv(output, writeHeader = true))
shopId! hasSoldBeer!hasSoldWurst!
1!! ! ! false!! ! ! true!
2!! ! ! false!! ! ! true!
3!! ! ! false!! ! ! true!
4!! ! ! true! ! ! ! false!
5!! ! ! true! ! ! ! false!
BloomFilterMonoid
Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { !
(bf: BF, itemId: String) => bf + itemId !
}!
}!
.map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }!
.map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }!
.discard('itemBloom)!
.write(Tsv(output, writeHeader = true))
shopId! hasSoldBeer!hasSoldWurst!
1!! ! ! false!! ! ! true!
2!! ! ! false!! ! ! true!
3!! ! ! false!! ! ! true!
4!! ! ! true! ! ! ! false!
5!! ! ! true! ! ! ! false!
Why not Set[String]? It would OutOfMemory.
BloomFilterMonoid
Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { !
(bf: BF, itemId: String) => bf + itemId !
}!
}!
.map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }!
.map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }!
.discard('itemBloom)!
.write(Tsv(output, writeHeader = true))
shopId! hasSoldBeer!hasSoldWurst!
1!! ! ! false!! ! ! true!
2!! ! ! false!! ! ! true!
3!! ! ! false!! ! ! true!
4!! ! ! true! ! ! ! false!
5!! ! ! true! ! ! ! false!
ApproximateBoolean(true,0.9999580954658956)
Why not Set[String]? It would OutOfMemory.
Joins
Joins
that.joinWithLarger('id1 -> 'id2, other)!
that.joinWithSmaller('id1 -> 'id2, other)!
!
!
that.joinWithTiny('id1 -> 'id2, other)
Joins
that.joinWithLarger('id1 -> 'id2, other)!
that.joinWithSmaller('id1 -> 'id2, other)!
!
!
that.joinWithTiny('id1 -> 'id2, other)
joinWithTiny is appropriate when you know that # of rows
in bigger pipe > mappers * # rows in smaller pipe, where
mappers is the number of mappers in the job.
Joins
that.joinWithLarger('id1 -> 'id2, other)!
that.joinWithSmaller('id1 -> 'id2, other)!
!
!
that.joinWithTiny('id1 -> 'id2, other)
joinWithTiny is appropriate when you know that # of rows
in bigger pipe > mappers * # rows in smaller pipe, where
mappers is the number of mappers in the job.
The “usual”
Joins
val people = IterableSource(!
(1, “hans”) ::!
(2, “bob”) ::!
(3, “hermut”) ::!
(4, “heinz”) ::!
(5, “klemens”) :: … :: Nil,!
('id, 'name))
val cars = IterableSource(!
(99, 1, “bmw") :: !
(123, 2, "mercedes”) ::!
(240, 11, “other”) :: Nil,!
('carId, 'ownerId, 'carName))!
Joins
import com.twitter.scalding.FunctionImplicits._!
!
people.joinWithLarger('id -> 'ownerId, cars)!
.map(('name, 'carName) -> 'sentence) { !
(name: String, car: String) =>!
s"Hello $name, your $car is really nice"!
}!
.project('sentence)!
.write(output)
val people = IterableSource(!
(1, “hans”) ::!
(2, “bob”) ::!
(3, “hermut”) ::!
(4, “heinz”) ::!
(5, “klemens”) :: … :: Nil,!
('id, 'name))
val cars = IterableSource(!
(99, 1, “bmw") :: !
(123, 2, "mercedes”) ::!
(240, 11, “other”) :: Nil,!
('carId, 'ownerId, 'carName))!
Joins
import com.twitter.scalding.FunctionImplicits._!
!
people.joinWithLarger('id -> 'ownerId, cars)!
.map(('name, 'carName) -> 'sentence) { !
(name: String, car: String) =>!
s"Hello $name, your $car is really nice"!
}!
.project('sentence)!
.write(output)
Hello hans, your bmw is really nice!
Hello bob, your bob's car is really nice!
val people = IterableSource(!
(1, “hans”) ::!
(2, “bob”) ::!
(3, “hermut”) ::!
(4, “heinz”) ::!
(5, “klemens”) :: … :: Nil,!
('id, 'name))
val cars = IterableSource(!
(99, 1, “bmw") :: !
(123, 2, "mercedes”) ::!
(240, 11, “other”) :: Nil,!
('carId, 'ownerId, 'carName))!
“map-side” join
that.joinWithTiny('id1 -> 'id2, tinyPipe)
Choose this when:	

!
or:	

when the Left side is 3 orders of magnitude larger.
Left > max(mappers,reducers) * Right!
Skew Joins
val sampleRate = 0.001!
val reducers = 10!
val replicationFactor = 1!
val replicator = SkewReplicationA(replicationFactor)!
!
!
val genders: RichPipe = …!
val followers: RichPipe = …!
!
followers!
.skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
.write(Tsv("output"))
Skew Joins
val sampleRate = 0.001!
val reducers = 10!
val replicationFactor = 1!
val replicator = SkewReplicationA(replicationFactor)!
!
!
val genders: RichPipe = …!
val followers: RichPipe = …!
!
followers!
.skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
.write(Tsv("output"))
1. Sample from the left and right pipes with some small probability,

in order to determine approximately how often each join key appears in each pipe.
Skew Joins
val sampleRate = 0.001!
val reducers = 10!
val replicationFactor = 1!
val replicator = SkewReplicationA(replicationFactor)!
!
!
val genders: RichPipe = …!
val followers: RichPipe = …!
!
followers!
.skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
.write(Tsv("output"))
1. Sample from the left and right pipes with some small probability,

in order to determine approximately how often each join key appears in each pipe.
2. Use these estimated counts to replicate the join keys, 

according to the given replication strategy.
Skew Joins
val sampleRate = 0.001!
val reducers = 10!
val replicationFactor = 1!
val replicator = SkewReplicationA(replicationFactor)!
!
!
val genders: RichPipe = …!
val followers: RichPipe = …!
!
followers!
.skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
.write(Tsv("output"))
1. Sample from the left and right pipes with some small probability,

in order to determine approximately how often each join key appears in each pipe.
2. Use these estimated counts to replicate the join keys, 

according to the given replication strategy.
3. Join the replicated pipes together.
Where did my type-safety go?!
Where did my type-safety go?!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
.write(Tsv(out))!
Where did my type-safety go?!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
.write(Tsv(out))!
Caused by: cascading.flow.FlowException: local step failed
	

 at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219)	

	

 at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)	

	

 at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)	

	

 at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)	

	

 at java.util.concurrent.FutureTask.run(FutureTask.java:266)	

	

 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)	

	

 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)	

	

 at java.lang.Thread.run(Thread.java:744)	

Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation	

	

 at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81)	

	

 at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34)	

	

 at cascading.flow.stream.SourceStage.map(SourceStage.java:102)	

	

 at cascading.flow.stream.SourceStage.call(SourceStage.java:53)	

	

 at cascading.flow.stream.SourceStage.call(SourceStage.java:38)	

	

 at java.util.concurrent.FutureTask.run(FutureTask.java:266)	

	

 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)	

	

 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)	

	

 at java.lang.Thread.run(Thread.java:744)	

Caused by: java.lang.NumberFormatException: For input string: "bob"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	

 at java.lang.Long.parseLong(Long.java:589)	

	

 at java.lang.Long.parseLong(Long.java:631)	

	

 at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50)	

	

 at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29)
Where did my type-safety go?!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
.write(Tsv(out))!
Caused by: cascading.flow.FlowException: local step failed
	

 at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219)	

	

 at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)	

	

 at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)	

	

 at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)	

	

 at java.util.concurrent.FutureTask.run(FutureTask.java:266)	

	

 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)	

	

 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)	

	

 at java.lang.Thread.run(Thread.java:744)	

Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation	

	

 at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81)	

	

 at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34)	

	

 at cascading.flow.stream.SourceStage.map(SourceStage.java:102)	

	

 at cascading.flow.stream.SourceStage.call(SourceStage.java:53)	

	

 at cascading.flow.stream.SourceStage.call(SourceStage.java:38)	

	

 at java.util.concurrent.FutureTask.run(FutureTask.java:266)	

	

 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)	

	

 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)	

	

 at java.lang.Thread.run(Thread.java:744)	

Caused by: java.lang.NumberFormatException: For input string: "bob"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	

 at java.lang.Long.parseLong(Long.java:589)	

	

 at java.lang.Long.parseLong(Long.java:631)	

	

 at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50)	

	

 at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29)
“oh, right… We
changed that file to be
user names, not ids…”
Trap it!
Tsv(in, ('userId1, 'userId2, 'rel))!
.addTrap(Tsv(“errors")) // add a trap!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
.write(Tsv(out))
Trap it!
Tsv(in, ('userId1, 'userId2, 'rel))!
.addTrap(Tsv(“errors")) // add a trap!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
.write(Tsv(out))
solves “dirty data”,	

no help for maintenance
Typed API
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
import TDsl._!
!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
.write(TypedTsv(out))!
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
import TDsl._!
!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
.write(TypedTsv(out))!
Must give Type to
each Field
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
.write(TypedTsv(out))!
import TDsl._!
!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 == "bob" }!
.write(TypedTsv(out))!
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
.write(TypedTsv(out))!
import TDsl._!
!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 == "bob" }!
.write(TypedTsv(out))!
Tuple arity: 2 Tuple arity: 3
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
Caused by: java.lang.IllegalArgumentException: 	

num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2
	

 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176)
TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
.write(TypedTsv(out))!
import TDsl._!
!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 == "bob" }!
.write(TypedTsv(out))!
Tuple arity: 2 Tuple arity: 3
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
Caused by: java.lang.IllegalArgumentException: 	

num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2
	

 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176)
TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
.write(TypedTsv(out))!
import TDsl._!
!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 == "bob" }!
.write(TypedTsv(out))!
Tuple arity: 2 Tuple arity: 3
“planing-time” exception
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
// … with Relationships {!
import TDsl._!
!
userRelationships(date)!
.filter { _._ == "bob" }!
.write(TypedTsv(out))!
!
}
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
// … with Relationships {!
import TDsl._!
!
userRelationships(date)!
.filter { _._ == "bob" }!
.write(TypedTsv(out))!
!
}
Easier to reuse
schemas now
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
// … with Relationships {!
import TDsl._!
!
userRelationships(date)!
.filter { _._ == "bob" }!
.write(TypedTsv(out))!
!
}
Easier to reuse
schemas now
Not coupled by Field names,	

but still too magic for reuse… “_1”?
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
// … with Relationships {!
import TDsl._!
!
userRelationships(date) !
.filter { p: Person => p.name == ”bob" }!
.write(TypedTsv(out))!
!
}
TypedAPI’s
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
.write(Tsv(out))!
// … with Relationships {!
import TDsl._!
!
userRelationships(date) !
.filter { p: Person => p.name == ”bob" }!
.write(TypedTsv(out))!
!
}
TypedPipe[Person]
Typed Joins
case class UserName(id: Long, handle: String)!
case class UserFavs(byUser: Long, favs: List[Long])!
case class UserTweets(byUser: Long, tweets: List[Long])!
!
def users: TypedSource[UserName]!
def favs: TypedSource[UserFavs]!
def tweets: TypedSource[UserTweets]!
!
def output: TypedSink[(UserName, UserFavs, UserTweets)]!
!
users.groupBy(_.id)!
.join(favs.groupBy(_.byUser))!
.join(tweets.groupBy(_.byUser))!
.map { case (uid, ((user, favs), tweets)) =>!
(user, favs, tweets)!
} !
.write(output)!
Typed Joins
case class UserName(id: Long, handle: String)!
case class UserFavs(byUser: Long, favs: List[Long])!
case class UserTweets(byUser: Long, tweets: List[Long])!
!
def users: TypedSource[UserName]!
def favs: TypedSource[UserFavs]!
def tweets: TypedSource[UserTweets]!
!
def output: TypedSink[(UserName, UserFavs, UserTweets)]!
!
users.groupBy(_.id)!
.join(favs.groupBy(_.byUser))!
.join(tweets.groupBy(_.byUser))!
.map { case (uid, ((user, favs), tweets)) =>!
(user, favs, tweets)!
} !
.write(output)!
Typed Joins
case class UserName(id: Long, handle: String)!
case class UserFavs(byUser: Long, favs: List[Long])!
case class UserTweets(byUser: Long, tweets: List[Long])!
!
def users: TypedSource[UserName]!
def favs: TypedSource[UserFavs]!
def tweets: TypedSource[UserTweets]!
!
def output: TypedSink[(UserName, UserFavs, UserTweets)]!
!
users.groupBy(_.id)!
.join(favs.groupBy(_.byUser))!
.join(tweets.groupBy(_.byUser))!
.map { case (uid, ((user, favs), tweets)) =>!
(user, favs, tweets)!
} !
.write(output)!
3-way-merge 	

in 1 MR step
> run pl.project13.oculus.job.WordCountJob !
—local —tool.graph --input in --output out!
!
writing DOT: !
pl.project13.oculus.job.WordCountJob0.dot!
!
writing Steps DOT: !
pl.project13.oculus.job.WordCountJob0_steps.dot
Do the DOT
Do the DOT!
!
!
!
pl.project13.oculus.job.WordCountJob0.dot!
!
!
!
!
!
!
!
!
!
!
!
!
!
pl.project13.oculus.job.WordCountJob0_steps.dot
!
!
!
!
> dot -Tpng pl.project13.oculus.job.WordCountJob0.dot!
!
!
!
!
!
!
!
!
!
!
!
!
!
Do the DOT
!
!
!
!
> dot -Tpng pl.project13.oculus.job.WordCountJob0.dot!
!
!
!
!
!
!
!
!
!
!
!
!
!
Do the DOT
M	

A	

P
!
!
!
!
> dot -Tpng pl.project13.oculus.job.WordCountJob0.dot!
!
!
!
!
!
!
!
!
!
!
!
!
!
Do the DOT
M	

A	

P
R	

E	

D
Do the DOT
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.run!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.runHadoop!
.finish!
}!
!
}!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
}!
.runHadoop!
.finish!
}!
!
}!
<3 Testing
run || runHadoop
“Parallelize all the batches!”
“Parallelize all the batches!”
Feels much like Scala collections
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
Testing friendly
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
Testing friendly
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
Testing friendly
Matrix API
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
Testing friendly
Matrix API
Efficient columnar storage (Parquet)
Scalding Re-Cap
!
!
!
!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
.write(Tsv(outputFile))!
!
!
Scalding Re-Cap
!
!
!
!
!
TextLine(inputFile)!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
.write(Tsv(outputFile))!
!
!
4{
!
!
!
!
!
$ activator new activator-scalding!
!
Try it!
http://typesafe.com/activator/template/activator-scalding
Template by Dean Wampler
Loads Of Links
1. http://parleys.com/play/51c2e0f3e4b0ed877035684f/chapter0/about	

2. https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/ReduceOperations.scala	

3. http://www.slideshare.net/johnynek/scalding?qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=4	

4. http://www.slideshare.net/Hadoop_Summit/severs-june26-255pmroom210av2?
qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=3	

5. http://www.slideshare.net/LivePersonDev/scalding-reaching-efficient-mapreduce?
qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=2	

6. http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/	

7. http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/	

8. https://engineering.twitter.com/university/videos/why-scalding-is-important-for-data-science	

9. https://github.com/parquet/parquet-format	

10. http://www.slideshare.net/ktoso/scalding-hadoop-word-count-in-less-than-60-lines-of-code	

11. https://github.com/scalaz/scalaz	

12. http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/
!
Danke!
Dzięki!
Thanks!
Gracias!
ありがとう!
ktoso @ typesafe.com
t: ktosopl / g: ktoso
blog: project13.pl

More Related Content

What's hot

Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? DataWorks Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learningSamir Bessalah
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.Sergey Zelvenskiy
 
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDSATOSHI TAGOMORI
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
 

What's hot (20)

Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TD
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
 

Viewers also liked

Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceLivePerson
 
Monoids monoids everywhere
Monoids monoids everywhereMonoids monoids everywhere
Monoids monoids everywhereKevin Faro
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsGilles Louppe
 
JavaOne 2013: Java 8 - The Good Parts
JavaOne 2013: Java 8 - The Good PartsJavaOne 2013: Java 8 - The Good Parts
JavaOne 2013: Java 8 - The Good PartsKonrad Malawski
 
Open soucerers - jak zacząć swoją przygodę z open source
Open soucerers - jak zacząć swoją przygodę z open sourceOpen soucerers - jak zacząć swoją przygodę z open source
Open soucerers - jak zacząć swoją przygodę z open sourceKonrad Malawski
 
HBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceHBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceKonrad Malawski
 
Need for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsNeed for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsKonrad Malawski
 
Ebay legacy-code-retreat
Ebay legacy-code-retreatEbay legacy-code-retreat
Ebay legacy-code-retreatKonrad Malawski
 
Git tak po prostu (SFI version)
Git tak po prostu (SFI version)Git tak po prostu (SFI version)
Git tak po prostu (SFI version)Konrad Malawski
 
Scala dsls-dissecting-and-implementing-rogue
Scala dsls-dissecting-and-implementing-rogueScala dsls-dissecting-and-implementing-rogue
Scala dsls-dissecting-and-implementing-rogueKonrad Malawski
 
TDD drogą do oświecenia w Scali
TDD drogą do oświecenia w ScaliTDD drogą do oświecenia w Scali
TDD drogą do oświecenia w ScaliKonrad Malawski
 
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)Konrad Malawski
 
Android my Scala @ JFokus 2013
Android my Scala @ JFokus 2013Android my Scala @ JFokus 2013
Android my Scala @ JFokus 2013Konrad Malawski
 
100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK
100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK
100th SCKRK Meeting - best software engineering papers of 5 years of SCKRKKonrad Malawski
 
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka StreamsFresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka StreamsKonrad Malawski
 
Disrupt 2 Grow - Devoxx 2013
Disrupt 2 Grow - Devoxx 2013Disrupt 2 Grow - Devoxx 2013
Disrupt 2 Grow - Devoxx 2013Konrad Malawski
 
The things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and AkkaThe things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and AkkaKonrad Malawski
 
KrakDroid: Scala on Android
KrakDroid: Scala on AndroidKrakDroid: Scala on Android
KrakDroid: Scala on AndroidKonrad Malawski
 

Viewers also liked (20)

Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
 
Monoids monoids everywhere
Monoids monoids everywhereMonoids monoids everywhere
Monoids monoids everywhere
 
Working with the Scalding Type -Safe API
Working with the Scalding Type -Safe APIWorking with the Scalding Type -Safe API
Working with the Scalding Type -Safe API
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random Forests
 
JavaOne 2013: Java 8 - The Good Parts
JavaOne 2013: Java 8 - The Good PartsJavaOne 2013: Java 8 - The Good Parts
JavaOne 2013: Java 8 - The Good Parts
 
Open soucerers - jak zacząć swoją przygodę z open source
Open soucerers - jak zacząć swoją przygodę z open sourceOpen soucerers - jak zacząć swoją przygodę z open source
Open soucerers - jak zacząć swoją przygodę z open source
 
HBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceHBase RowKey design for Akka Persistence
HBase RowKey design for Akka Persistence
 
Need for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsNeed for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applications
 
Ebay legacy-code-retreat
Ebay legacy-code-retreatEbay legacy-code-retreat
Ebay legacy-code-retreat
 
Android at-xsolve
Android at-xsolveAndroid at-xsolve
Android at-xsolve
 
Git tak po prostu (SFI version)
Git tak po prostu (SFI version)Git tak po prostu (SFI version)
Git tak po prostu (SFI version)
 
Scala dsls-dissecting-and-implementing-rogue
Scala dsls-dissecting-and-implementing-rogueScala dsls-dissecting-and-implementing-rogue
Scala dsls-dissecting-and-implementing-rogue
 
TDD drogą do oświecenia w Scali
TDD drogą do oświecenia w ScaliTDD drogą do oświecenia w Scali
TDD drogą do oświecenia w Scali
 
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)
 
Android my Scala @ JFokus 2013
Android my Scala @ JFokus 2013Android my Scala @ JFokus 2013
Android my Scala @ JFokus 2013
 
100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK
100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK
100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK
 
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka StreamsFresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
 
Disrupt 2 Grow - Devoxx 2013
Disrupt 2 Grow - Devoxx 2013Disrupt 2 Grow - Devoxx 2013
Disrupt 2 Grow - Devoxx 2013
 
The things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and AkkaThe things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and Akka
 
KrakDroid: Scala on Android
KrakDroid: Scala on AndroidKrakDroid: Scala on Android
KrakDroid: Scala on Android
 

Similar to Scalding - the not-so-basics @ ScalaDays 2014

Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?osfameron
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik ErlandsonDatabricks
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupLandoop Ltd
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?
Lukasz Byczynski
 
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Workhorse Computing
 
学生向けScalaハンズオンテキスト
学生向けScalaハンズオンテキスト学生向けScalaハンズオンテキスト
学生向けScalaハンズオンテキストOpt Technologies
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science Chucheng Hsieh
 
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014Michał Oniszczuk
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghStuart Roebuck
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...DataStax Academy
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Alexey Zinoviev
 

Similar to Scalding - the not-so-basics @ ScalaDays 2014 (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
 
Spark_Documentation_Template1
Spark_Documentation_Template1Spark_Documentation_Template1
Spark_Documentation_Template1
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?

 
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
 
学生向けScalaハンズオンテキスト
学生向けScalaハンズオンテキスト学生向けScalaハンズオンテキスト
学生向けScalaハンズオンテキスト
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
 
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
 
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
 

More from Konrad Malawski

Networks and Types - the Future of Akka @ ScalaDays NYC 2018
Networks and Types - the Future of Akka @ ScalaDays NYC 2018Networks and Types - the Future of Akka @ ScalaDays NYC 2018
Networks and Types - the Future of Akka @ ScalaDays NYC 2018Konrad Malawski
 
Akka Typed (quick talk) - JFokus 2018
Akka Typed (quick talk) - JFokus 2018Akka Typed (quick talk) - JFokus 2018
Akka Typed (quick talk) - JFokus 2018Konrad Malawski
 
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in't
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in'tScalaSwarm 2017 Keynote: Tough this be madness yet theres method in't
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in'tKonrad Malawski
 
State of Akka 2017 - The best is yet to come
State of Akka 2017 - The best is yet to comeState of Akka 2017 - The best is yet to come
State of Akka 2017 - The best is yet to comeKonrad Malawski
 
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCBuilding a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCKonrad Malawski
 
Akka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldAkka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldKonrad Malawski
 
Reactive integrations with Akka Streams
Reactive integrations with Akka StreamsReactive integrations with Akka Streams
Reactive integrations with Akka StreamsKonrad Malawski
 
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsKonrad Malawski
 
Reactive Streams, j.u.concurrent & Beyond!
Reactive Streams, j.u.concurrent & Beyond!Reactive Streams, j.u.concurrent & Beyond!
Reactive Streams, j.u.concurrent & Beyond!Konrad Malawski
 
End to End Akka Streams / Reactive Streams - from Business to Socket
End to End Akka Streams / Reactive Streams - from Business to SocketEnd to End Akka Streams / Reactive Streams - from Business to Socket
End to End Akka Streams / Reactive Streams - from Business to SocketKonrad Malawski
 
The Cloud-natives are RESTless @ JavaOne
The Cloud-natives are RESTless @ JavaOneThe Cloud-natives are RESTless @ JavaOne
The Cloud-natives are RESTless @ JavaOneKonrad Malawski
 
Akka Streams in Action @ ScalaDays Berlin 2016
Akka Streams in Action @ ScalaDays Berlin 2016Akka Streams in Action @ ScalaDays Berlin 2016
Akka Streams in Action @ ScalaDays Berlin 2016Konrad Malawski
 
Krakow communities @ 2016
Krakow communities @ 2016Krakow communities @ 2016
Krakow communities @ 2016Konrad Malawski
 
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...Konrad Malawski
 
How Reactive Streams & Akka Streams change the JVM Ecosystem
How Reactive Streams & Akka Streams change the JVM EcosystemHow Reactive Streams & Akka Streams change the JVM Ecosystem
How Reactive Streams & Akka Streams change the JVM EcosystemKonrad Malawski
 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldKonrad Malawski
 
Reactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsReactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsKonrad Malawski
 
Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014Konrad Malawski
 
2014 akka-streams-tokyo-japanese
2014 akka-streams-tokyo-japanese2014 akka-streams-tokyo-japanese
2014 akka-streams-tokyo-japaneseKonrad Malawski
 

More from Konrad Malawski (20)

Networks and Types - the Future of Akka @ ScalaDays NYC 2018
Networks and Types - the Future of Akka @ ScalaDays NYC 2018Networks and Types - the Future of Akka @ ScalaDays NYC 2018
Networks and Types - the Future of Akka @ ScalaDays NYC 2018
 
Akka Typed (quick talk) - JFokus 2018
Akka Typed (quick talk) - JFokus 2018Akka Typed (quick talk) - JFokus 2018
Akka Typed (quick talk) - JFokus 2018
 
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in't
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in'tScalaSwarm 2017 Keynote: Tough this be madness yet theres method in't
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in't
 
State of Akka 2017 - The best is yet to come
State of Akka 2017 - The best is yet to comeState of Akka 2017 - The best is yet to come
State of Akka 2017 - The best is yet to come
 
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCBuilding a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
 
Akka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldAkka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming World
 
Reactive integrations with Akka Streams
Reactive integrations with Akka StreamsReactive integrations with Akka Streams
Reactive integrations with Akka Streams
 
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabs
 
Reactive Streams, j.u.concurrent & Beyond!
Reactive Streams, j.u.concurrent & Beyond!Reactive Streams, j.u.concurrent & Beyond!
Reactive Streams, j.u.concurrent & Beyond!
 
End to End Akka Streams / Reactive Streams - from Business to Socket
End to End Akka Streams / Reactive Streams - from Business to SocketEnd to End Akka Streams / Reactive Streams - from Business to Socket
End to End Akka Streams / Reactive Streams - from Business to Socket
 
The Cloud-natives are RESTless @ JavaOne
The Cloud-natives are RESTless @ JavaOneThe Cloud-natives are RESTless @ JavaOne
The Cloud-natives are RESTless @ JavaOne
 
Akka Streams in Action @ ScalaDays Berlin 2016
Akka Streams in Action @ ScalaDays Berlin 2016Akka Streams in Action @ ScalaDays Berlin 2016
Akka Streams in Action @ ScalaDays Berlin 2016
 
Krakow communities @ 2016
Krakow communities @ 2016Krakow communities @ 2016
Krakow communities @ 2016
 
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
 
Zen of Akka
Zen of AkkaZen of Akka
Zen of Akka
 
How Reactive Streams & Akka Streams change the JVM Ecosystem
How Reactive Streams & Akka Streams change the JVM EcosystemHow Reactive Streams & Akka Streams change the JVM Ecosystem
How Reactive Streams & Akka Streams change the JVM Ecosystem
 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorld
 
Reactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsReactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka Streams
 
Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014
 
2014 akka-streams-tokyo-japanese
2014 akka-streams-tokyo-japanese2014 akka-streams-tokyo-japanese
2014 akka-streams-tokyo-japanese
 

Recently uploaded

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Scalding - the not-so-basics @ ScalaDays 2014

  • 1. Scalding the not-so-basics Konrad 'ktoso' Malawski Scala Days 2014 @ Berlin
  • 2. Konrad `@ktosopl` Malawski typesafe.com geecon.org Java.pl / KrakowScala.pl sckrk.com / meetup.com/Paper-Cup @ London GDGKrakow.pl meetup.com/Lambda-Lounge-Krakow hAkker @
  • 8. https://github.com/twitter/scalding Scalding is “on top of” Cascading, which is “on top of” Hadoop http://www.cascading.org/
  • 9. https://github.com/twitter/scalding Summingbird is “op top of” Scalding, which is “on top of” Cascading, which is “on top of” Hadoop http://www.cascading.org/ https://github.com/twitter/summingbird
  • 10. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/
  • 11. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/
  • 12. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ HDFS yes, MapReduce no
  • 13. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/
  • 14. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ HDFS yes, MapReduce no
  • 15. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ HDFS yes, MapReduce no Possibly soon?!
  • 16. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark has nothing to do with all this. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ -streams
  • 17. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ this talk
  • 18. Why?
  • 19. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))!
  • 20. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory
  • 21. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory in Memory
  • 22. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory in Memory in Memory
  • 23. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory in Memory in Memory in Memory
  • 24. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory in Memory in Memory in Memory in Memory
  • 25. package org.myorg;! ! import org.apache.hadoop.fs.Path;! import org.apache.hadoop.io.IntWritable;! import org.apache.hadoop.io.LongWritable;! import org.apache.hadoop.io.Text;! import org.apache.hadoop.mapred.*;! ! import java.io.IOException;! import java.util.Iterator;! import java.util.StringTokenizer;! ! public class WordCount {! ! public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {! private final static IntWritable one = new IntWritable(1);! private Text word = new Text();! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro IOException {! String line = value.toString();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasMoreTokens()) {! word.set(tokenizer.nextToken());! output.collect(word, one);! Why Scalding? Word Count in Hadoop MR
  • 26. private final static IntWritable one = new IntWritable(1);! private Text word = new Text();! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro IOException {! String line = value.toString();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasMoreTokens()) {! word.set(tokenizer.nextToken());! output.collect(word, one);! }! }! }! ! public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! int sum = 0;! while (values.hasNext()) {! sum += values.next().get();! }! output.collect(key, new IntWritable(sum));! }! }! ! public static void main(String[] args) throws Exception {! JobConf conf = new JobConf(WordCount.class);! conf.setJobName("wordcount");! ! conf.setOutputKeyClass(Text.class);! conf.setOutputValueClass(IntWritable.class);! ! conf.setMapperClass(Map.class);! conf.setCombinerClass(Reduce.class);! conf.setReducerClass(Reduce.class);! ! conf.setInputFormat(TextInputFormat.class);! conf.setOutputFormat(TextOutputFormat.class);! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));! FileOutputFormat.setOutputPath(conf, new Path(args[1]));! ! JobClient.runJob(conf);! }! }! Why Scalding? Word Count in Hadoop MR
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 36. map
  • 37. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map Scala:
  • 38. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map Scala:
  • 39. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 } Scala:
  • 40. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 } Scala: available in Pipe
  • 41. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 } Scala: available in Pipestays in Pipe
  • 42. val data = 1 :: 2 :: 3 :: Nil! ! val doubled = data map { _ * 2 }! ! // Int => Int map IterableSource(data)! .map('number -> 'doubled) { n: Int => n * 2 }! ! ! // Int => Int Scala: must choose type!
  • 43. mapTo
  • 44. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo Scala:
  • 45. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo Scala: “release reference”
  • 46. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo Scala: “release reference”
  • 47. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo IterableSource(data) .mapTo('doubled) { n: Int => n * 2 } Scala: “release reference”
  • 48. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo IterableSource(data) .mapTo('doubled) { n: Int => n * 2 } Scala: doubled stays in Pipe “release reference”
  • 49. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo IterableSource(data) .mapTo('doubled) { n: Int => n * 2 } Scala: doubled stays in Pipenumber is removed “release reference”
  • 51. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] flatMap Scala:
  • 52. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] flatMap Scala:
  • 53. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] flatMap TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int] Scala:
  • 55. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } flatMap Scala:
  • 56. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } flatMap TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int] Scala:
  • 58. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy Scala:
  • 59. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy Scala:
  • 60. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size } Scala:
  • 61. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size } Scala: groups all with == value
  • 62. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size } Scala: groups all with == value 'lessThanTenCounts
  • 65. groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }
  • 66. groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }
  • 67. groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) } 'total = [3, 74]
  • 68. import org.apache.hadoop.util.ToolRunner! import com.twitter.scalding! ! object ScaldingJobRunner extends App {! ! ToolRunner.run(new Configuration, new scalding.Tool, args)! ! } Main Class - "Runner"
  • 69. import org.apache.hadoop.util.ToolRunner! import com.twitter.scalding! ! object ScaldingJobRunner extends App {! ! ToolRunner.run(new Configuration, new scalding.Tool, args)! ! } Main Class - "Runner" from App
  • 70. class WordCountJob(args: Args) extends Job(args) {! ! ! ! ! ! ! ! ! ! ! } Word Count in Scalding
  • 71. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! ! ! ! ! ! ! } Word Count in Scalding
  • 72. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! ! ! ! ! ! } Word Count in Scalding
  • 73. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! ! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 74. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { group => group.size('count) }! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 75. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { group => group.size }! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 76. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 77. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 78. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding 4{
  • 79. 1 day in the life of a guy implementing Scalding jobs
  • 80. “How much are my shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output))
  • 81. “How much are my shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output)) 1!107! 2!144! 3!16! … …
  • 82. “How much are my shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output, writeHeader = true))
  • 83. “How much are my shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 1!! ! ! 107! 2!! ! ! 144! 3!! ! ! 16! …!! ! ! …
  • 84. “Which are the top selling shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy('totalSoldItems).reverse }! .write(Tsv(output, writeHeader = true))
  • 85. “Which are the top selling shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy('totalSoldItems).reverse }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16! …!! ! ! …
  • 86. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true))
  • 87. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16
  • 88. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16 SLOW! Instead do sortWithTake!SLOW! Instead do sortWithTake!
  • 89. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true))
  • 90. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true)) x! List((5,146), (2,142), (3,32))!
  • 91. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true)) x! List((5,146), (2,142), (3,32))! WAT!?
  • 92. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true)) x! List((5,146), (2,142), (3,32))! WAT!? Emits scala.collection.List[_]
  • 93. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true))
  • 94. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) Provide Ordering explicitly because implicit Ordering is not enough for Tuple2 here
  • 95. Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?”
  • 96. Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?” shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16
  • 97. Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?”
  • 98. Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?” MUCH faster Job = Happier me.
  • 101. trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } Reduce, these Monoids
  • 102. Reduce, these Monoids trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } interface:
  • 103. Reduce, these Monoids + 3 laws: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } interface:
  • 104. Reduce, these Monoids + 3 laws: Closure: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } interface:
  • 105. Reduce, these Monoids + 3 laws: (T, T) => TClosure: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T interface:
  • 106. Reduce, these Monoids + 3 laws: (T, T) => TClosure: Associativity: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T interface:
  • 107. Reduce, these Monoids + 3 laws: (T, T) => TClosure: Associativity: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T ∀a,b,c∈T:(a·b)·c=a·(b·c) (a + b) + c! ==! a + (b + c) interface:
  • 108. Reduce, these Monoids + 3 laws: (T, T) => TClosure: Associativity: Identity element: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T ∀a,b,c∈T:(a·b)·c=a·(b·c) (a + b) + c! ==! a + (b + c) interface:
  • 109. Reduce, these Monoids + 3 laws: (T, T) => TClosure: Associativity: Identity element: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T ∀a,b,c∈T:(a·b)·c=a·(b·c) (a + b) + c! ==! a + (b + c) interface: ∃z∈T:∀a∈T:z·a=a·z=a z + a == a + z == a
  • 110. Reduce, these Monoids object IntSum extends Monoid[Int] {! def zero = 0! def +(a: Int, b: Int) = a + b! } Summing:
  • 111. Monoid ops can start “Map-side” bear, 2 car, 3 deer, 2 Monoid ops can already start being computed map-side! Monoid ops can already start being computed map-side! river, 2
  • 112. Monoid ops can start “Map-side” average() sum() sortWithTake() histogram() Examples: bear, 2 car, 3 deer, 2 river, 2
  • 113. Obligatory: “Go check out Algebird, NOW!” slide https://github.com/twitter/algebird ALGE-birds
  • 114. BloomFilterMonoid https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL val NUM_HASHES = 6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2! // bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")! // approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)! ! val res = approxBool.isTrue! // res: Boolean = true
  • 115. BloomFilterMonoid https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL val NUM_HASHES = 6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2! // bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")! // approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)! ! val res = approxBool.isTrue! // res: Boolean = true
  • 116. BloomFilterMonoid https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL val NUM_HASHES = 6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2! // bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")! // approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)! ! val res = approxBool.isTrue! // res: Boolean = true
  • 117. BloomFilterMonoid Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true))
  • 118. BloomFilterMonoid Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true)) shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false!
  • 119. BloomFilterMonoid Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true)) shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false! Why not Set[String]? It would OutOfMemory.
  • 120. BloomFilterMonoid Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true)) shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false! ApproximateBoolean(true,0.9999580954658956) Why not Set[String]? It would OutOfMemory.
  • 121. Joins
  • 122. Joins that.joinWithLarger('id1 -> 'id2, other)! that.joinWithSmaller('id1 -> 'id2, other)! ! ! that.joinWithTiny('id1 -> 'id2, other)
  • 123. Joins that.joinWithLarger('id1 -> 'id2, other)! that.joinWithSmaller('id1 -> 'id2, other)! ! ! that.joinWithTiny('id1 -> 'id2, other) joinWithTiny is appropriate when you know that # of rows in bigger pipe > mappers * # rows in smaller pipe, where mappers is the number of mappers in the job.
  • 124. Joins that.joinWithLarger('id1 -> 'id2, other)! that.joinWithSmaller('id1 -> 'id2, other)! ! ! that.joinWithTiny('id1 -> 'id2, other) joinWithTiny is appropriate when you know that # of rows in bigger pipe > mappers * # rows in smaller pipe, where mappers is the number of mappers in the job. The “usual”
  • 125. Joins val people = IterableSource(! (1, “hans”) ::! (2, “bob”) ::! (3, “hermut”) ::! (4, “heinz”) ::! (5, “klemens”) :: … :: Nil,! ('id, 'name)) val cars = IterableSource(! (99, 1, “bmw") :: ! (123, 2, "mercedes”) ::! (240, 11, “other”) :: Nil,! ('carId, 'ownerId, 'carName))!
  • 126. Joins import com.twitter.scalding.FunctionImplicits._! ! people.joinWithLarger('id -> 'ownerId, cars)! .map(('name, 'carName) -> 'sentence) { ! (name: String, car: String) =>! s"Hello $name, your $car is really nice"! }! .project('sentence)! .write(output) val people = IterableSource(! (1, “hans”) ::! (2, “bob”) ::! (3, “hermut”) ::! (4, “heinz”) ::! (5, “klemens”) :: … :: Nil,! ('id, 'name)) val cars = IterableSource(! (99, 1, “bmw") :: ! (123, 2, "mercedes”) ::! (240, 11, “other”) :: Nil,! ('carId, 'ownerId, 'carName))!
  • 127. Joins import com.twitter.scalding.FunctionImplicits._! ! people.joinWithLarger('id -> 'ownerId, cars)! .map(('name, 'carName) -> 'sentence) { ! (name: String, car: String) =>! s"Hello $name, your $car is really nice"! }! .project('sentence)! .write(output) Hello hans, your bmw is really nice! Hello bob, your bob's car is really nice! val people = IterableSource(! (1, “hans”) ::! (2, “bob”) ::! (3, “hermut”) ::! (4, “heinz”) ::! (5, “klemens”) :: … :: Nil,! ('id, 'name)) val cars = IterableSource(! (99, 1, “bmw") :: ! (123, 2, "mercedes”) ::! (240, 11, “other”) :: Nil,! ('carId, 'ownerId, 'carName))!
  • 128. “map-side” join that.joinWithTiny('id1 -> 'id2, tinyPipe) Choose this when: ! or: when the Left side is 3 orders of magnitude larger. Left > max(mappers,reducers) * Right!
  • 129. Skew Joins val sampleRate = 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output"))
  • 130. Skew Joins val sampleRate = 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output")) 1. Sample from the left and right pipes with some small probability,
 in order to determine approximately how often each join key appears in each pipe.
  • 131. Skew Joins val sampleRate = 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output")) 1. Sample from the left and right pipes with some small probability,
 in order to determine approximately how often each join key appears in each pipe. 2. Use these estimated counts to replicate the join keys, 
 according to the given replication strategy.
  • 132. Skew Joins val sampleRate = 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output")) 1. Sample from the left and right pipes with some small probability,
 in order to determine approximately how often each join key appears in each pipe. 2. Use these estimated counts to replicate the join keys, 
 according to the given replication strategy. 3. Join the replicated pipes together.
  • 133. Where did my type-safety go?!
  • 134. Where did my type-safety go?! Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))!
  • 135. Where did my type-safety go?! Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))! Caused by: cascading.flow.FlowException: local step failed at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81) at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34) at cascading.flow.stream.SourceStage.map(SourceStage.java:102) at cascading.flow.stream.SourceStage.call(SourceStage.java:53) at cascading.flow.stream.SourceStage.call(SourceStage.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NumberFormatException: For input string: "bob" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29)
  • 136. Where did my type-safety go?! Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))! Caused by: cascading.flow.FlowException: local step failed at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81) at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34) at cascading.flow.stream.SourceStage.map(SourceStage.java:102) at cascading.flow.stream.SourceStage.call(SourceStage.java:53) at cascading.flow.stream.SourceStage.call(SourceStage.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NumberFormatException: For input string: "bob" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29) “oh, right… We changed that file to be user names, not ids…”
  • 137. Trap it! Tsv(in, ('userId1, 'userId2, 'rel))! .addTrap(Tsv(“errors")) // add a trap! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))
  • 138. Trap it! Tsv(in, ('userId1, 'userId2, 'rel))! .addTrap(Tsv(“errors")) // add a trap! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out)) solves “dirty data”, no help for maintenance
  • 140. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!
  • 141. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))!
  • 142. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! Must give Type to each Field
  • 143. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))!
  • 144. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))! Tuple arity: 2 Tuple arity: 3
  • 145. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! Caused by: java.lang.IllegalArgumentException: num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176) TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))! Tuple arity: 2 Tuple arity: 3
  • 146. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! Caused by: java.lang.IllegalArgumentException: num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176) TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))! Tuple arity: 2 Tuple arity: 3 “planing-time” exception
  • 147. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))! ! }
  • 148. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))! ! } Easier to reuse schemas now
  • 149. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))! ! } Easier to reuse schemas now Not coupled by Field names, but still too magic for reuse… “_1”?
  • 150. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date) ! .filter { p: Person => p.name == ”bob" }! .write(TypedTsv(out))! ! }
  • 151. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date) ! .filter { p: Person => p.name == ”bob" }! .write(TypedTsv(out))! ! } TypedPipe[Person]
  • 152. Typed Joins case class UserName(id: Long, handle: String)! case class UserFavs(byUser: Long, favs: List[Long])! case class UserTweets(byUser: Long, tweets: List[Long])! ! def users: TypedSource[UserName]! def favs: TypedSource[UserFavs]! def tweets: TypedSource[UserTweets]! ! def output: TypedSink[(UserName, UserFavs, UserTweets)]! ! users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)!
  • 153. Typed Joins case class UserName(id: Long, handle: String)! case class UserFavs(byUser: Long, favs: List[Long])! case class UserTweets(byUser: Long, tweets: List[Long])! ! def users: TypedSource[UserName]! def favs: TypedSource[UserFavs]! def tweets: TypedSource[UserTweets]! ! def output: TypedSink[(UserName, UserFavs, UserTweets)]! ! users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)!
  • 154. Typed Joins case class UserName(id: Long, handle: String)! case class UserFavs(byUser: Long, favs: List[Long])! case class UserTweets(byUser: Long, tweets: List[Long])! ! def users: TypedSource[UserName]! def favs: TypedSource[UserFavs]! def tweets: TypedSource[UserTweets]! ! def output: TypedSink[(UserName, UserFavs, UserTweets)]! ! users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)! 3-way-merge in 1 MR step
  • 155. > run pl.project13.oculus.job.WordCountJob ! —local —tool.graph --input in --output out! ! writing DOT: ! pl.project13.oculus.job.WordCountJob0.dot! ! writing Steps DOT: ! pl.project13.oculus.job.WordCountJob0_steps.dot Do the DOT
  • 157. ! ! ! ! > dot -Tpng pl.project13.oculus.job.WordCountJob0.dot! ! ! ! ! ! ! ! ! ! ! ! ! ! Do the DOT
  • 158. ! ! ! ! > dot -Tpng pl.project13.oculus.job.WordCountJob0.dot! ! ! ! ! ! ! ! ! ! ! ! ! ! Do the DOT M A P
  • 159. ! ! ! ! > dot -Tpng pl.project13.oculus.job.WordCountJob0.dot! ! ! ! ! ! ! ! ! ! ! ! ! ! Do the DOT M A P R E D
  • 162. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 163. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 164. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 165. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 166. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 167. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 168. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 169. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .runHadoop! .finish! }! ! }! <3 Testing
  • 170. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .runHadoop! .finish! }! ! }! <3 Testing run || runHadoop
  • 171.
  • 172. “Parallelize all the batches!”
  • 173. “Parallelize all the batches!” Feels much like Scala collections
  • 174. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading
  • 175. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps
  • 176. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to
  • 177. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala
  • 178. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly
  • 179. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly
  • 180. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly Matrix API
  • 181. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly Matrix API Efficient columnar storage (Parquet)
  • 182. Scalding Re-Cap ! ! ! ! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! !
  • 183. Scalding Re-Cap ! ! ! ! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! ! 4{
  • 184. ! ! ! ! ! $ activator new activator-scalding! ! Try it! http://typesafe.com/activator/template/activator-scalding Template by Dean Wampler
  • 185. Loads Of Links 1. http://parleys.com/play/51c2e0f3e4b0ed877035684f/chapter0/about 2. https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/ReduceOperations.scala 3. http://www.slideshare.net/johnynek/scalding?qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=4 4. http://www.slideshare.net/Hadoop_Summit/severs-june26-255pmroom210av2? qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=3 5. http://www.slideshare.net/LivePersonDev/scalding-reaching-efficient-mapreduce? qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=2 6. http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/ 7. http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/ 8. https://engineering.twitter.com/university/videos/why-scalding-is-important-for-data-science 9. https://github.com/parquet/parquet-format 10. http://www.slideshare.net/ktoso/scalding-hadoop-word-count-in-less-than-60-lines-of-code 11. https://github.com/scalaz/scalaz 12. http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/