SlideShare a Scribd company logo
1 of 186
Download to read offline
the not-so-basics
Konrad 'ktoso' Malawski	

Scala Days 2014 @ Berlin
Konrad `@ktosopl` Malawski / / @ London
hAkker @
How old is this guy?
Google MapReduce, paper: 2004
Hadoop (Yahoo impl): 2005
the Big Landscape
Scalding is “on top of” Hadoop
Scalding is “on top of” Cascading,	

which is “on top of” Hadoop
Summingbird is “op top of” Scalding,	

which is “on top of” Cascading,	

which is “on top of” Hadoop
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
HDFS yes,	

MapReduce no
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
HDFS yes,	

MapReduce no
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark is a bit “separate” currently.
HDFS yes,	

MapReduce no
Possibly soon?!
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop;
Spark has nothing to do with all this.
Summingbird is “op top of” Scalding or Storm,	

which is “on top of” Cascading,	

which is “on top of” Hadoop
this talk
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
.split(" ")!
.map(a => (a, 1))!
.map(a => (a._1,!
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
.split(" ")!
.map(a => (a, 1))!
.map(a => (a._1,!
in Memory
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
.split(" ")!
.map(a => (a, 1))!
.map(a => (a._1,!
in Memory
in Memory
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
.split(" ")!
.map(a => (a, 1))!
.map(a => (a._1,!
in Memory
in Memory
in Memory
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
.split(" ")!
.map(a => (a, 1))!
.map(a => (a._1,!
in Memory
in Memory
in Memory
in Memory
Stuff > Memory
Scala collections... fun but, memory bound!
val text = "so many words... waaah! ..."!
.split(" ")!
.map(a => (a, 1))!
.map(a => (a._1,!
in Memory
in Memory
in Memory
in Memory
in Memory
package org.myorg;!
import org.apache.hadoop.fs.Path;!
import org.apache.hadoop.mapred.*;!
import java.util.Iterator;!
import java.util.StringTokenizer;!
public class WordCount {!
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {!
private final static IntWritable one = new IntWritable(1);!
private Text word = new Text();!
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro
IOException {!
String line = value.toString();!
StringTokenizer tokenizer = new StringTokenizer(line);!
while (tokenizer.hasMoreTokens()) {!
output.collect(word, one);!
Why Scalding?
Word Count in Hadoop MR
private final static IntWritable one = new IntWritable(1);!
private Text word = new Text();!
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro
IOException {!
String line = value.toString();!
StringTokenizer tokenizer = new StringTokenizer(line);!
while (tokenizer.hasMoreTokens()) {!
output.collect(word, one);!
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {!
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {!
int sum = 0;!
while (values.hasNext()) {!
sum +=;!
output.collect(key, new IntWritable(sum));!
public static void main(String[] args) throws Exception {!
JobConf conf = new JobConf(WordCount.class);!
FileInputFormat.setInputPaths(conf, new Path(args[0]));!
FileOutputFormat.setOutputPath(conf, new Path(args[1]));!
Why Scalding?
Word Count in Hadoop MR
“Field API”
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
.map('number -> 'doubled) { n: Int => n * 2 }
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
.map('number -> 'doubled) { n: Int => n * 2 }
available in Pipe
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
.map('number -> 'doubled) { n: Int => n * 2 }
available in Pipestays in Pipe
val data = 1 :: 2 :: 3 :: Nil!
val doubled = data map { _ * 2 }!
// Int => Int
.map('number -> 'doubled) { n: Int => n * 2 }!
// Int => Int
must choose type!
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
“release reference”
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
“release reference”
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
.mapTo('doubled) { n: Int => n * 2 }
“release reference”
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
.mapTo('doubled) { n: Int => n * 2 }
doubled stays in Pipe
“release reference”
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
data = null
.mapTo('doubled) { n: Int => n * 2 }
doubled stays in Pipenumber is removed
“release reference”
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",") // Array[String]
} map { _.toInt } // List[Int]
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",") // Array[String]
} map { _.toInt } // List[Int]
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",") // Array[String]
} map { _.toInt } // List[Int]
TextLine(data) // like List[String]
.flatMap('line -> 'word) { _.split(",") } // like List[String]
.map('word -> 'number) { _.toInt } // like List[Int]
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",").map(_.toInt) // Array[Int]
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
line.split(",").map(_.toInt) // Array[Int]
TextLine(data) // like List[String]
.flatMap('line -> 'word) { _.split(",").map(_.toInt) }
// like List[Int]
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.size }
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.size }
groups all with == value
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.size }
groups all with == value 'lessThanTenCounts
IterableSource(List(1, 2, 30, 42), 'num)
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.sum('total) }
IterableSource(List(1, 2, 30, 42), 'num)
.map('num -> 'lessThanTen) { i: Int => i < 10 }
.groupBy('lessThanTen) { _.sum('total) }
'total = [3, 74]
import org.apache.hadoop.util.ToolRunner!
import com.twitter.scalding!
object ScaldingJobRunner extends App {!
! Configuration, new scalding.Tool, args)!
Main Class - "Runner"
import org.apache.hadoop.util.ToolRunner!
import com.twitter.scalding!
object ScaldingJobRunner extends App {!
! Configuration, new scalding.Tool, args)!
Main Class - "Runner"
from App
class WordCountJob(args: Args) extends Job(args) {!
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
val inputFile = args("input")!
val outputFile = args("output")!
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
val inputFile = args("input")!
val outputFile = args("output")!
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
val inputFile = args("input")!
val outputFile = args("output")!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
def tokenize(text: String): Array[String] = implemented!
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
val inputFile = args("input")!
val outputFile = args("output")!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { group => group.size('count) }!
def tokenize(text: String): Array[String] = implemented!
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
val inputFile = args("input")!
val outputFile = args("output")!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { group => group.size }!
def tokenize(text: String): Array[String] = implemented!
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
val inputFile = args("input")!
val outputFile = args("output")!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
def tokenize(text: String): Array[String] = implemented!
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
val inputFile = args("input")!
val outputFile = args("output")!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
def tokenize(text: String): Array[String] = implemented!
Word Count in Scalding
class WordCountJob(args: Args) extends Job(args) {!
val inputFile = args("input")!
val outputFile = args("output")!
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
def tokenize(text: String): Array[String] = implemented!
Word Count in Scalding
1 day in the life of	

a guy implementing Scalding jobs
“How much are my shops selling?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
“How much are my shops selling?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
… …
“How much are my shops selling?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.write(Tsv(output, writeHeader = true))
“How much are my shops selling?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.write(Tsv(output, writeHeader = true))
shopId! totalSoldItems!
1!! ! ! 107!
2!! ! ! 144!
3!! ! ! 16!
…!! ! ! …
“Which are the top selling shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { _.sortBy('totalSoldItems).reverse }!
.write(Tsv(output, writeHeader = true))
“Which are the top selling shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { _.sortBy('totalSoldItems).reverse }!
.write(Tsv(output, writeHeader = true))
shopId! totalSoldItems!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16!
…!! ! ! …
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }!
.write(Tsv(output, writeHeader = true))
shopId! totalSoldItems!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }!
.write(Tsv(output, writeHeader = true))
shopId! totalSoldItems!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16
SLOW! Instead do sortWithTake!SLOW! Instead do sortWithTake!
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { !
_.sortedReverseTake[Long]('totalSold -> 'x, 3) !
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { !
_.sortedReverseTake[Long]('totalSold -> 'x, 3) !
.write(Tsv(output, writeHeader = true))
List((5,146), (2,142), (3,32))!
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { !
_.sortedReverseTake[Long]('totalSold -> 'x, 3) !
.write(Tsv(output, writeHeader = true))
List((5,146), (2,142), (3,32))!
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { !
_.sortedReverseTake[Long]('totalSold -> 'x, 3) !
.write(Tsv(output, writeHeader = true))
List((5,146), (2,142), (3,32))!
Emits scala.collection.List[_]
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
.write(Tsv(output, writeHeader = true))
Provide Ordering explicitly because implicit Ordering	

is not enough for Tuple2 here
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
shopId! totalSoldItems!
2!! ! ! 144 !
1!! ! ! 107!
3!! ! ! 16
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.sum[Long]('quantity -> ‘totalSoldItems)!
.groupAll { !
_.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { !
(l: (Long, Long), r: (Long, Long)) => !
l._2 < l._2 !
.flatMapTo('x -> ('shopId, 'totalSold)) { !
x: List[(Long, Long)] => x!
.write(Tsv(output, writeHeader = true))
“What’s the top 3 shops?”
MUCH faster Job	


Happier me.
Reduce, these Monoids
Reduce, these Monoids
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
Reduce, these Monoids
Reduce, these Monoids
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
Reduce, these Monoids
+ 3 laws:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
Reduce, these Monoids
+ 3 laws:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
(a + b) + c!
a + (b + c)
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
Identity element:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
(a + b) + c!
a + (b + c)
Reduce, these Monoids
+ 3 laws:
(T, T) => TClosure:
Identity element:
trait Monoid[T] {!
def zero: T!
def +(a: T, b: T): T!
(a + b) + c!
a + (b + c)
∃z∈T:∀a∈T:z·a=a·z=a z + a == a + z == a
Reduce, these Monoids
object IntSum extends Monoid[Int] {!
def zero = 0!
def +(a: Int, b: Int) = a + b!
Monoid ops can start “Map-side”
bear, 2
car, 3
deer, 2
Monoid ops can already start
being computed map-side!
Monoid ops can already start
being computed map-side!
river, 2
Monoid ops can start “Map-side”


bear, 2
car, 3
deer, 2
river, 2
Obligatory: “Go check out Algebird, NOW!” slide
val NUM_HASHES = 6!
val WIDTH = 32!
val SEED = 1!
val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)!
val bf1 = bfMonoid.create("1", "2", "3", "4", "100")!
val bf2 = bfMonoid.create("12", "45")!
val bf = bf1 ++ bf2!
// bf: com.twitter.algebird.BF =!
val approxBool = bf.contains("1")!
// approxBool: com.twitter.algebird.ApproximateBoolean =
val res = approxBool.isTrue!
// res: Boolean = true
val NUM_HASHES = 6!
val WIDTH = 32!
val SEED = 1!
val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)!
val bf1 = bfMonoid.create("1", "2", "3", "4", "100")!
val bf2 = bfMonoid.create("12", "45")!
val bf = bf1 ++ bf2!
// bf: com.twitter.algebird.BF =!
val approxBool = bf.contains("1")!
// approxBool: com.twitter.algebird.ApproximateBoolean =
val res = approxBool.isTrue!
// res: Boolean = true
val NUM_HASHES = 6!
val WIDTH = 32!
val SEED = 1!
val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)!
val bf1 = bfMonoid.create("1", "2", "3", "4", "100")!
val bf2 = bfMonoid.create("12", "45")!
val bf = bf1 ++ bf2!
// bf: com.twitter.algebird.BF =!
val approxBool = bf.contains("1")!
// approxBool: com.twitter.algebird.ApproximateBoolean =
val res = approxBool.isTrue!
// res: Boolean = true
Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.foldLeft('itemName -> 'itemBloom)( { !
(bf: BF, itemId: String) => bf + itemId !
.map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }!
.map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }!
.write(Tsv(output, writeHeader = true))
Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.foldLeft('itemName -> 'itemBloom)( { !
(bf: BF, itemId: String) => bf + itemId !
.map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }!
.map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }!
.write(Tsv(output, writeHeader = true))
shopId! hasSoldBeer!hasSoldWurst!
1!! ! ! false!! ! ! true!
2!! ! ! false!! ! ! true!
3!! ! ! false!! ! ! true!
4!! ! ! true! ! ! ! false!
5!! ! ! true! ! ! ! false!
Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.foldLeft('itemName -> 'itemBloom)( { !
(bf: BF, itemId: String) => bf + itemId !
.map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }!
.map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }!
.write(Tsv(output, writeHeader = true))
shopId! hasSoldBeer!hasSoldWurst!
1!! ! ! false!! ! ! true!
2!! ! ! false!! ! ! true!
3!! ! ! false!! ! ! true!
4!! ! ! true! ! ! ! false!
5!! ! ! true! ! ! ! false!
Why not Set[String]? It would OutOfMemory.
Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))!
.groupBy('shopId) {!
_.foldLeft('itemName -> 'itemBloom)( { !
(bf: BF, itemId: String) => bf + itemId !
.map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }!
.map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }!
.write(Tsv(output, writeHeader = true))
shopId! hasSoldBeer!hasSoldWurst!
1!! ! ! false!! ! ! true!
2!! ! ! false!! ! ! true!
3!! ! ! false!! ! ! true!
4!! ! ! true! ! ! ! false!
5!! ! ! true! ! ! ! false!
Why not Set[String]? It would OutOfMemory.
that.joinWithLarger('id1 -> 'id2, other)!
that.joinWithSmaller('id1 -> 'id2, other)!
that.joinWithTiny('id1 -> 'id2, other)
that.joinWithLarger('id1 -> 'id2, other)!
that.joinWithSmaller('id1 -> 'id2, other)!
that.joinWithTiny('id1 -> 'id2, other)
joinWithTiny is appropriate when you know that # of rows
in bigger pipe > mappers * # rows in smaller pipe, where
mappers is the number of mappers in the job.
that.joinWithLarger('id1 -> 'id2, other)!
that.joinWithSmaller('id1 -> 'id2, other)!
that.joinWithTiny('id1 -> 'id2, other)
joinWithTiny is appropriate when you know that # of rows
in bigger pipe > mappers * # rows in smaller pipe, where
mappers is the number of mappers in the job.
The “usual”
val people = IterableSource(!
(1, “hans”) ::!
(2, “bob”) ::!
(3, “hermut”) ::!
(4, “heinz”) ::!
(5, “klemens”) :: … :: Nil,!
('id, 'name))
val cars = IterableSource(!
(99, 1, “bmw") :: !
(123, 2, "mercedes”) ::!
(240, 11, “other”) :: Nil,!
('carId, 'ownerId, 'carName))!
import com.twitter.scalding.FunctionImplicits._!
people.joinWithLarger('id -> 'ownerId, cars)!
.map(('name, 'carName) -> 'sentence) { !
(name: String, car: String) =>!
s"Hello $name, your $car is really nice"!
val people = IterableSource(!
(1, “hans”) ::!
(2, “bob”) ::!
(3, “hermut”) ::!
(4, “heinz”) ::!
(5, “klemens”) :: … :: Nil,!
('id, 'name))
val cars = IterableSource(!
(99, 1, “bmw") :: !
(123, 2, "mercedes”) ::!
(240, 11, “other”) :: Nil,!
('carId, 'ownerId, 'carName))!
import com.twitter.scalding.FunctionImplicits._!
people.joinWithLarger('id -> 'ownerId, cars)!
.map(('name, 'carName) -> 'sentence) { !
(name: String, car: String) =>!
s"Hello $name, your $car is really nice"!
Hello hans, your bmw is really nice!
Hello bob, your bob's car is really nice!
val people = IterableSource(!
(1, “hans”) ::!
(2, “bob”) ::!
(3, “hermut”) ::!
(4, “heinz”) ::!
(5, “klemens”) :: … :: Nil,!
('id, 'name))
val cars = IterableSource(!
(99, 1, “bmw") :: !
(123, 2, "mercedes”) ::!
(240, 11, “other”) :: Nil,!
('carId, 'ownerId, 'carName))!
“map-side” join
that.joinWithTiny('id1 -> 'id2, tinyPipe)
Choose this when:	


when the Left side is 3 orders of magnitude larger.
Left > max(mappers,reducers) * Right!
Skew Joins
val sampleRate = 0.001!
val reducers = 10!
val replicationFactor = 1!
val replicator = SkewReplicationA(replicationFactor)!
val genders: RichPipe = …!
val followers: RichPipe = …!
.skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
Skew Joins
val sampleRate = 0.001!
val reducers = 10!
val replicationFactor = 1!
val replicator = SkewReplicationA(replicationFactor)!
val genders: RichPipe = …!
val followers: RichPipe = …!
.skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
1. Sample from the left and right pipes with some small probability,

in order to determine approximately how often each join key appears in each pipe.
Skew Joins
val sampleRate = 0.001!
val reducers = 10!
val replicationFactor = 1!
val replicator = SkewReplicationA(replicationFactor)!
val genders: RichPipe = …!
val followers: RichPipe = …!
.skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
1. Sample from the left and right pipes with some small probability,

in order to determine approximately how often each join key appears in each pipe.
2. Use these estimated counts to replicate the join keys, 

according to the given replication strategy.
Skew Joins
val sampleRate = 0.001!
val reducers = 10!
val replicationFactor = 1!
val replicator = SkewReplicationA(replicationFactor)!
val genders: RichPipe = …!
val followers: RichPipe = …!
.skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)!
.project('x1, 'y1, 's1, 'x2, 'y2, 's2)!
1. Sample from the left and right pipes with some small probability,

in order to determine approximately how often each join key appears in each pipe.
2. Use these estimated counts to replicate the join keys, 

according to the given replication strategy.
3. Join the replicated pipes together.
Where did my type-safety go?!
Where did my type-safety go?!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
Where did my type-safety go?!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
Caused by: cascading.flow.FlowException: local step failed

 at cascading.flow.planner.FlowStepJob.blockOnJob(	


 at cascading.flow.planner.FlowStepJob.start(	








 at java.util.concurrent.ThreadPoolExecutor.runWorker(	


 at java.util.concurrent.ThreadPoolExecutor$	



Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation	














 at java.util.concurrent.ThreadPoolExecutor.runWorker(	


 at java.util.concurrent.ThreadPoolExecutor$	



Caused by: java.lang.NumberFormatException: For input string: "bob"
at java.lang.NumberFormatException.forInputString(

 at java.lang.Long.parseLong(	


 at java.lang.Long.parseLong(	


 at cascading.tuple.coerce.LongCoerce.coerce(	


 at cascading.tuple.coerce.LongCoerce.coerce(
Where did my type-safety go?!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
Caused by: cascading.flow.FlowException: local step failed

 at cascading.flow.planner.FlowStepJob.blockOnJob(	


 at cascading.flow.planner.FlowStepJob.start(	








 at java.util.concurrent.ThreadPoolExecutor.runWorker(	


 at java.util.concurrent.ThreadPoolExecutor$	



Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation	














 at java.util.concurrent.ThreadPoolExecutor.runWorker(	


 at java.util.concurrent.ThreadPoolExecutor$	



Caused by: java.lang.NumberFormatException: For input string: "bob"
at java.lang.NumberFormatException.forInputString(

 at java.lang.Long.parseLong(	


 at java.lang.Long.parseLong(	


 at cascading.tuple.coerce.LongCoerce.coerce(	


 at cascading.tuple.coerce.LongCoerce.coerce(
“oh, right… We
changed that file to be
user names, not ids…”
Trap it!
Tsv(in, ('userId1, 'userId2, 'rel))!
.addTrap(Tsv(“errors")) // add a trap!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
Trap it!
Tsv(in, ('userId1, 'userId2, 'rel))!
.addTrap(Tsv(“errors")) // add a trap!
.filter('userId1) { uid1: Long => uid1 == 1337 }!
solves “dirty data”,	

no help for maintenance
Typed API
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
import TDsl._!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
import TDsl._!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
Must give Type to
each Field
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
import TDsl._!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 == "bob" }!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
import TDsl._!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 == "bob" }!
Tuple arity: 2 Tuple arity: 3
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
Caused by: java.lang.IllegalArgumentException: 	

num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2

 at cascading.scheme.util.DelimitedParser.reset(
TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
import TDsl._!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 == "bob" }!
Tuple arity: 2 Tuple arity: 3
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
Caused by: java.lang.IllegalArgumentException: 	

num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2

 at cascading.scheme.util.DelimitedParser.reset(
TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))!
.filter { _._1 === "bob" }!
import TDsl._!
TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))!
.filter { _._1 == "bob" }!
Tuple arity: 2 Tuple arity: 3
“planing-time” exception
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
// … with Relationships {!
import TDsl._!
.filter { _._ == "bob" }!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
// … with Relationships {!
import TDsl._!
.filter { _._ == "bob" }!
Easier to reuse
schemas now
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
// … with Relationships {!
import TDsl._!
.filter { _._ == "bob" }!
Easier to reuse
schemas now
Not coupled by Field names,	

but still too magic for reuse… “_1”?
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
// … with Relationships {!
import TDsl._!
userRelationships(date) !
.filter { p: Person => == ”bob" }!
Tsv(in, ('userId1, 'userId2, 'rel))!
.filter('userId1) { rel: Long => rel == 1337 }!
// … with Relationships {!
import TDsl._!
userRelationships(date) !
.filter { p: Person => == ”bob" }!
Typed Joins
case class UserName(id: Long, handle: String)!
case class UserFavs(byUser: Long, favs: List[Long])!
case class UserTweets(byUser: Long, tweets: List[Long])!
def users: TypedSource[UserName]!
def favs: TypedSource[UserFavs]!
def tweets: TypedSource[UserTweets]!
def output: TypedSink[(UserName, UserFavs, UserTweets)]!
.map { case (uid, ((user, favs), tweets)) =>!
(user, favs, tweets)!
} !
Typed Joins
case class UserName(id: Long, handle: String)!
case class UserFavs(byUser: Long, favs: List[Long])!
case class UserTweets(byUser: Long, tweets: List[Long])!
def users: TypedSource[UserName]!
def favs: TypedSource[UserFavs]!
def tweets: TypedSource[UserTweets]!
def output: TypedSink[(UserName, UserFavs, UserTweets)]!
.map { case (uid, ((user, favs), tweets)) =>!
(user, favs, tweets)!
} !
Typed Joins
case class UserName(id: Long, handle: String)!
case class UserFavs(byUser: Long, favs: List[Long])!
case class UserTweets(byUser: Long, tweets: List[Long])!
def users: TypedSource[UserName]!
def favs: TypedSource[UserFavs]!
def tweets: TypedSource[UserTweets]!
def output: TypedSink[(UserName, UserFavs, UserTweets)]!
.map { case (uid, ((user, favs), tweets)) =>!
(user, favs, tweets)!
} !

in 1 MR step
> run pl.project13.oculus.job.WordCountJob !
—local —tool.graph --input in --output out!
writing DOT: !!
writing Steps DOT: !
Do the DOT
Do the DOT!
> dot -Tpng!
Do the DOT
> dot -Tpng!
Do the DOT


> dot -Tpng!
Do the DOT




Do the DOT
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
<3 Testing
class WordCountJobTest extends FlatSpec !
with ShouldMatchers with TupleConversions {!
"WordCountJob" should "count words" in {!
JobTest(new WordCountJob(_))!
.arg("input", "inFile")!
.arg("output", "outFile")!
.source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))!
.sink[(String, Int)](Tsv("outFile")) { out =>!
out.toList should contain ("kapi" -> 3)!
<3 Testing
run || runHadoop
“Parallelize all the batches!”
“Parallelize all the batches!”
Feels much like Scala collections
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
Testing friendly
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
Testing friendly
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
Testing friendly
Matrix API
“Parallelize all the batches!”
Feels much like Scala collections
Local Mode thanks to Cascading
Easy to add custom Taps
Type Safe, when you want to
Pure Scala
Testing friendly
Matrix API
Efficient columnar storage (Parquet)
Scalding Re-Cap
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
Scalding Re-Cap
.flatMap('line -> 'word) { line: String => tokenize(line) }!
.groupBy('word) { _.size }!
$ activator new activator-scalding!
Try it!
Template by Dean Wampler
Loads Of Links











ktoso @
t: ktosopl / g: ktoso

More Related Content

What's hot

Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? DataWorks Summit
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learningSamir Bessalah
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
Spark Streaming, Machine Learning and streaming API.
Spark Streaming, Machine Learning and streaming API.Spark Streaming, Machine Learning and streaming API.
Spark Streaming, Machine Learning and streaming API.Sergey Zelvenskiy
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDSATOSHI TAGOMORI
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey

What's hot (20)

Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Spark Streaming, Machine Learning and streaming API.
Spark Streaming, Machine Learning and streaming API.Spark Streaming, Machine Learning and streaming API.
Spark Streaming, Machine Learning and streaming API.
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TD
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra

Viewers also liked

Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceLivePerson
Monoids monoids everywhere
Monoids monoids everywhereMonoids monoids everywhere
Monoids monoids everywhereKevin Faro
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsGilles Louppe
JavaOne 2013: Java 8 - The Good Parts
JavaOne 2013: Java 8 - The Good PartsJavaOne 2013: Java 8 - The Good Parts
JavaOne 2013: Java 8 - The Good PartsKonrad Malawski
Open soucerers - jak zacząć swoją przygodę z open source
Open soucerers - jak zacząć swoją przygodę z open sourceOpen soucerers - jak zacząć swoją przygodę z open source
Open soucerers - jak zacząć swoją przygodę z open sourceKonrad Malawski
HBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceHBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceKonrad Malawski
Need for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsNeed for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsKonrad Malawski
Ebay legacy-code-retreat
Ebay legacy-code-retreatEbay legacy-code-retreat
Ebay legacy-code-retreatKonrad Malawski
Git tak po prostu (SFI version)
Git tak po prostu (SFI version)Git tak po prostu (SFI version)
Git tak po prostu (SFI version)Konrad Malawski
Scala dsls-dissecting-and-implementing-rogue
Scala dsls-dissecting-and-implementing-rogueScala dsls-dissecting-and-implementing-rogue
Scala dsls-dissecting-and-implementing-rogueKonrad Malawski
TDD drogą do oświecenia w Scali
TDD drogą do oświecenia w ScaliTDD drogą do oświecenia w Scali
TDD drogą do oświecenia w ScaliKonrad Malawski
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)Konrad Malawski
Android my Scala @ JFokus 2013
Android my Scala @ JFokus 2013Android my Scala @ JFokus 2013
Android my Scala @ JFokus 2013Konrad Malawski
100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK
100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK
100th SCKRK Meeting - best software engineering papers of 5 years of SCKRKKonrad Malawski
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka StreamsFresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka StreamsKonrad Malawski
Disrupt 2 Grow - Devoxx 2013
Disrupt 2 Grow - Devoxx 2013Disrupt 2 Grow - Devoxx 2013
Disrupt 2 Grow - Devoxx 2013Konrad Malawski
The things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and AkkaThe things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and AkkaKonrad Malawski
KrakDroid: Scala on Android
KrakDroid: Scala on AndroidKrakDroid: Scala on Android
KrakDroid: Scala on AndroidKonrad Malawski

Viewers also liked (20)

Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
Monoids monoids everywhere
Monoids monoids everywhereMonoids monoids everywhere
Monoids monoids everywhere
Working with the Scalding Type -Safe API
Working with the Scalding Type -Safe APIWorking with the Scalding Type -Safe API
Working with the Scalding Type -Safe API
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random Forests
JavaOne 2013: Java 8 - The Good Parts
JavaOne 2013: Java 8 - The Good PartsJavaOne 2013: Java 8 - The Good Parts
JavaOne 2013: Java 8 - The Good Parts
Open soucerers - jak zacząć swoją przygodę z open source
Open soucerers - jak zacząć swoją przygodę z open sourceOpen soucerers - jak zacząć swoją przygodę z open source
Open soucerers - jak zacząć swoją przygodę z open source
HBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceHBase RowKey design for Akka Persistence
HBase RowKey design for Akka Persistence
Need for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsNeed for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applications
Ebay legacy-code-retreat
Ebay legacy-code-retreatEbay legacy-code-retreat
Ebay legacy-code-retreat
Android at-xsolve
Android at-xsolveAndroid at-xsolve
Android at-xsolve
Git tak po prostu (SFI version)
Git tak po prostu (SFI version)Git tak po prostu (SFI version)
Git tak po prostu (SFI version)
Scala dsls-dissecting-and-implementing-rogue
Scala dsls-dissecting-and-implementing-rogueScala dsls-dissecting-and-implementing-rogue
Scala dsls-dissecting-and-implementing-rogue
TDD drogą do oświecenia w Scali
TDD drogą do oświecenia w ScaliTDD drogą do oświecenia w Scali
TDD drogą do oświecenia w Scali
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)
[Tokyo Scala User Group] Akka Streams & Reactive Streams (0.7)
Android my Scala @ JFokus 2013
Android my Scala @ JFokus 2013Android my Scala @ JFokus 2013
Android my Scala @ JFokus 2013
100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK
100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK
100th SCKRK Meeting - best software engineering papers of 5 years of SCKRK
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka StreamsFresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Disrupt 2 Grow - Devoxx 2013
Disrupt 2 Grow - Devoxx 2013Disrupt 2 Grow - Devoxx 2013
Disrupt 2 Grow - Devoxx 2013
The things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and AkkaThe things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and Akka
KrakDroid: Scala on Android
KrakDroid: Scala on AndroidKrakDroid: Scala on Android
KrakDroid: Scala on Android

Similar to Scalding - the not-so-basics @ ScalaDays 2014

Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?osfameron
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik ErlandsonDatabricks
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupLandoop Ltd
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?
Lukasz Byczynski
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Workhorse Computing
学生向けScalaハンズオンテキストOpt Technologies
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science Chucheng Hsieh
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014Michał Oniszczuk
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghStuart Roebuck
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...DataStax Academy
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Alexey Zinoviev

Similar to Scalding - the not-so-basics @ ScalaDays 2014 (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?

Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API

More from Konrad Malawski

Networks and Types - the Future of Akka @ ScalaDays NYC 2018
Networks and Types - the Future of Akka @ ScalaDays NYC 2018Networks and Types - the Future of Akka @ ScalaDays NYC 2018
Networks and Types - the Future of Akka @ ScalaDays NYC 2018Konrad Malawski
Akka Typed (quick talk) - JFokus 2018
Akka Typed (quick talk) - JFokus 2018Akka Typed (quick talk) - JFokus 2018
Akka Typed (quick talk) - JFokus 2018Konrad Malawski
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in't
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in'tScalaSwarm 2017 Keynote: Tough this be madness yet theres method in't
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in'tKonrad Malawski
State of Akka 2017 - The best is yet to come
State of Akka 2017 - The best is yet to comeState of Akka 2017 - The best is yet to come
State of Akka 2017 - The best is yet to comeKonrad Malawski
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCBuilding a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCKonrad Malawski
Akka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldAkka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldKonrad Malawski
Reactive integrations with Akka Streams
Reactive integrations with Akka StreamsReactive integrations with Akka Streams
Reactive integrations with Akka StreamsKonrad Malawski
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsKonrad Malawski
Reactive Streams, j.u.concurrent & Beyond!
Reactive Streams, j.u.concurrent & Beyond!Reactive Streams, j.u.concurrent & Beyond!
Reactive Streams, j.u.concurrent & Beyond!Konrad Malawski
End to End Akka Streams / Reactive Streams - from Business to Socket
End to End Akka Streams / Reactive Streams - from Business to SocketEnd to End Akka Streams / Reactive Streams - from Business to Socket
End to End Akka Streams / Reactive Streams - from Business to SocketKonrad Malawski
The Cloud-natives are RESTless @ JavaOne
The Cloud-natives are RESTless @ JavaOneThe Cloud-natives are RESTless @ JavaOne
The Cloud-natives are RESTless @ JavaOneKonrad Malawski
Akka Streams in Action @ ScalaDays Berlin 2016
Akka Streams in Action @ ScalaDays Berlin 2016Akka Streams in Action @ ScalaDays Berlin 2016
Akka Streams in Action @ ScalaDays Berlin 2016Konrad Malawski
Krakow communities @ 2016
Krakow communities @ 2016Krakow communities @ 2016
Krakow communities @ 2016Konrad Malawski
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...Konrad Malawski
How Reactive Streams & Akka Streams change the JVM Ecosystem
How Reactive Streams & Akka Streams change the JVM EcosystemHow Reactive Streams & Akka Streams change the JVM Ecosystem
How Reactive Streams & Akka Streams change the JVM EcosystemKonrad Malawski
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldKonrad Malawski
Reactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsReactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsKonrad Malawski
Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014Konrad Malawski
2014 akka-streams-tokyo-japanese
2014 akka-streams-tokyo-japanese2014 akka-streams-tokyo-japanese
2014 akka-streams-tokyo-japaneseKonrad Malawski

More from Konrad Malawski (20)

Networks and Types - the Future of Akka @ ScalaDays NYC 2018
Networks and Types - the Future of Akka @ ScalaDays NYC 2018Networks and Types - the Future of Akka @ ScalaDays NYC 2018
Networks and Types - the Future of Akka @ ScalaDays NYC 2018
Akka Typed (quick talk) - JFokus 2018
Akka Typed (quick talk) - JFokus 2018Akka Typed (quick talk) - JFokus 2018
Akka Typed (quick talk) - JFokus 2018
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in't
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in'tScalaSwarm 2017 Keynote: Tough this be madness yet theres method in't
ScalaSwarm 2017 Keynote: Tough this be madness yet theres method in't
State of Akka 2017 - The best is yet to come
State of Akka 2017 - The best is yet to comeState of Akka 2017 - The best is yet to come
State of Akka 2017 - The best is yet to come
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCBuilding a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Akka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldAkka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming World
Reactive integrations with Akka Streams
Reactive integrations with Akka StreamsReactive integrations with Akka Streams
Reactive integrations with Akka Streams
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabs
Reactive Streams, j.u.concurrent & Beyond!
Reactive Streams, j.u.concurrent & Beyond!Reactive Streams, j.u.concurrent & Beyond!
Reactive Streams, j.u.concurrent & Beyond!
End to End Akka Streams / Reactive Streams - from Business to Socket
End to End Akka Streams / Reactive Streams - from Business to SocketEnd to End Akka Streams / Reactive Streams - from Business to Socket
End to End Akka Streams / Reactive Streams - from Business to Socket
The Cloud-natives are RESTless @ JavaOne
The Cloud-natives are RESTless @ JavaOneThe Cloud-natives are RESTless @ JavaOne
The Cloud-natives are RESTless @ JavaOne
Akka Streams in Action @ ScalaDays Berlin 2016
Akka Streams in Action @ ScalaDays Berlin 2016Akka Streams in Action @ ScalaDays Berlin 2016
Akka Streams in Action @ ScalaDays Berlin 2016
Krakow communities @ 2016
Krakow communities @ 2016Krakow communities @ 2016
Krakow communities @ 2016
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
Zen of Akka
Zen of AkkaZen of Akka
Zen of Akka
How Reactive Streams & Akka Streams change the JVM Ecosystem
How Reactive Streams & Akka Streams change the JVM EcosystemHow Reactive Streams & Akka Streams change the JVM Ecosystem
How Reactive Streams & Akka Streams change the JVM Ecosystem
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorld
Reactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsReactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka Streams
Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014
2014 akka-streams-tokyo-japanese
2014 akka-streams-tokyo-japanese2014 akka-streams-tokyo-japanese
2014 akka-streams-tokyo-japanese

Recently uploaded

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla CEO/Founder: Sri Ambati Keynote at Wells Fargo Day CEO/Founder: Sri Ambati Keynote at Wells Fargo CEO/Founder: Sri Ambati Keynote at Wells Fargo Day CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan

Recently uploaded (20)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy CEO/Founder: Sri Ambati Keynote at Wells Fargo Day CEO/Founder: Sri Ambati Keynote at Wells Fargo CEO/Founder: Sri Ambati Keynote at Wells Fargo Day CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand

Scalding - the not-so-basics @ ScalaDays 2014

  • 1. Scalding the not-so-basics Konrad 'ktoso' Malawski Scala Days 2014 @ Berlin
  • 2. Konrad `@ktosopl` Malawski / / @ London hAkker @
  • 8. Scalding is “on top of” Cascading, which is “on top of” Hadoop
  • 9. Summingbird is “op top of” Scalding, which is “on top of” Cascading, which is “on top of” Hadoop
  • 10. Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop
  • 11. Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently.
  • 12. Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. HDFS yes, MapReduce no
  • 13. Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently.
  • 14. Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. HDFS yes, MapReduce no
  • 15. Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. HDFS yes, MapReduce no Possibly soon?!
  • 16. Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark has nothing to do with all this. -streams
  • 17. Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop this talk
  • 18. Why?
  • 19. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1,!
  • 20. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1,! in Memory
  • 21. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1,! in Memory in Memory
  • 22. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1,! in Memory in Memory in Memory
  • 23. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1,! in Memory in Memory in Memory in Memory
  • 24. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1,! in Memory in Memory in Memory in Memory in Memory
  • 25. package org.myorg;! ! import org.apache.hadoop.fs.Path;! import;! import;! import;! import org.apache.hadoop.mapred.*;! ! import;! import java.util.Iterator;! import java.util.StringTokenizer;! ! public class WordCount {! ! public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {! private final static IntWritable one = new IntWritable(1);! private Text word = new Text();! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro IOException {! String line = value.toString();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasMoreTokens()) {! word.set(tokenizer.nextToken());! output.collect(word, one);! Why Scalding? Word Count in Hadoop MR
  • 26. private final static IntWritable one = new IntWritable(1);! private Text word = new Text();! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro IOException {! String line = value.toString();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasMoreTokens()) {! word.set(tokenizer.nextToken());! output.collect(word, one);! }! }! }! ! public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! int sum = 0;! while (values.hasNext()) {! sum +=;! }! output.collect(key, new IntWritable(sum));! }! }! ! public static void main(String[] args) throws Exception {! JobConf conf = new JobConf(WordCount.class);! conf.setJobName("wordcount");! ! conf.setOutputKeyClass(Text.class);! conf.setOutputValueClass(IntWritable.class);! ! conf.setMapperClass(Map.class);! conf.setCombinerClass(Reduce.class);! conf.setReducerClass(Reduce.class);! ! conf.setInputFormat(TextInputFormat.class);! conf.setOutputFormat(TextOutputFormat.class);! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));! FileOutputFormat.setOutputPath(conf, new Path(args[1]));! ! JobClient.runJob(conf);! }! }! Why Scalding? Word Count in Hadoop MR
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 36. map
  • 37. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map Scala:
  • 38. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map Scala:
  • 39. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 } Scala:
  • 40. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 } Scala: available in Pipe
  • 41. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 } Scala: available in Pipestays in Pipe
  • 42. val data = 1 :: 2 :: 3 :: Nil! ! val doubled = data map { _ * 2 }! ! // Int => Int map IterableSource(data)! .map('number -> 'doubled) { n: Int => n * 2 }! ! ! // Int => Int Scala: must choose type!
  • 43. mapTo
  • 44. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo Scala:
  • 45. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo Scala: “release reference”
  • 46. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo Scala: “release reference”
  • 47. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo IterableSource(data) .mapTo('doubled) { n: Int => n * 2 } Scala: “release reference”
  • 48. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo IterableSource(data) .mapTo('doubled) { n: Int => n * 2 } Scala: doubled stays in Pipe “release reference”
  • 49. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo IterableSource(data) .mapTo('doubled) { n: Int => n * 2 } Scala: doubled stays in Pipenumber is removed “release reference”
  • 51. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] flatMap Scala:
  • 52. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] flatMap Scala:
  • 53. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] flatMap TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int] Scala:
  • 55. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } flatMap Scala:
  • 56. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } flatMap TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int] Scala:
  • 58. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy Scala:
  • 59. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy Scala:
  • 60. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size } Scala:
  • 61. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size } Scala: groups all with == value
  • 62. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size } Scala: groups all with == value 'lessThanTenCounts
  • 65. groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }
  • 66. groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }
  • 67. groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) } 'total = [3, 74]
  • 68. import org.apache.hadoop.util.ToolRunner! import com.twitter.scalding! ! object ScaldingJobRunner extends App {! ! Configuration, new scalding.Tool, args)! ! } Main Class - "Runner"
  • 69. import org.apache.hadoop.util.ToolRunner! import com.twitter.scalding! ! object ScaldingJobRunner extends App {! ! Configuration, new scalding.Tool, args)! ! } Main Class - "Runner" from App
  • 70. class WordCountJob(args: Args) extends Job(args) {! ! ! ! ! ! ! ! ! ! ! } Word Count in Scalding
  • 71. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! ! ! ! ! ! ! } Word Count in Scalding
  • 72. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! ! ! ! ! ! } Word Count in Scalding
  • 73. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! ! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 74. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { group => group.size('count) }! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 75. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { group => group.size }! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 76. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 77. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  • 78. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding 4{
  • 79. 1 day in the life of a guy implementing Scalding jobs
  • 80. “How much are my shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output))
  • 81. “How much are my shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output)) 1!107! 2!144! 3!16! … …
  • 82. “How much are my shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output, writeHeader = true))
  • 83. “How much are my shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 1!! ! ! 107! 2!! ! ! 144! 3!! ! ! 16! …!! ! ! …
  • 84. “Which are the top selling shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy('totalSoldItems).reverse }! .write(Tsv(output, writeHeader = true))
  • 85. “Which are the top selling shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy('totalSoldItems).reverse }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16! …!! ! ! …
  • 86. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true))
  • 87. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16
  • 88. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16 SLOW! Instead do sortWithTake!SLOW! Instead do sortWithTake!
  • 89. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true))
  • 90. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true)) x! List((5,146), (2,142), (3,32))!
  • 91. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true)) x! List((5,146), (2,142), (3,32))! WAT!?
  • 92. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true)) x! List((5,146), (2,142), (3,32))! WAT!? Emits scala.collection.List[_]
  • 93. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true))
  • 94. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) Provide Ordering explicitly because implicit Ordering is not enough for Tuple2 here
  • 95. Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?”
  • 96. Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?” shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16
  • 97. Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?”
  • 98. Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?” MUCH faster Job = Happier me.
  • 101. trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } Reduce, these Monoids
  • 102. Reduce, these Monoids trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } interface:
  • 103. Reduce, these Monoids + 3 laws: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } interface:
  • 104. Reduce, these Monoids + 3 laws: Closure: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } interface:
  • 105. Reduce, these Monoids + 3 laws: (T, T) => TClosure: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T interface:
  • 106. Reduce, these Monoids + 3 laws: (T, T) => TClosure: Associativity: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T interface:
  • 107. Reduce, these Monoids + 3 laws: (T, T) => TClosure: Associativity: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T ∀a,b,c∈T:(a·b)·c=a·(b·c) (a + b) + c! ==! a + (b + c) interface:
  • 108. Reduce, these Monoids + 3 laws: (T, T) => TClosure: Associativity: Identity element: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T ∀a,b,c∈T:(a·b)·c=a·(b·c) (a + b) + c! ==! a + (b + c) interface:
  • 109. Reduce, these Monoids + 3 laws: (T, T) => TClosure: Associativity: Identity element: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T ∀a,b,c∈T:(a·b)·c=a·(b·c) (a + b) + c! ==! a + (b + c) interface: ∃z∈T:∀a∈T:z·a=a·z=a z + a == a + z == a
  • 110. Reduce, these Monoids object IntSum extends Monoid[Int] {! def zero = 0! def +(a: Int, b: Int) = a + b! } Summing:
  • 111. Monoid ops can start “Map-side” bear, 2 car, 3 deer, 2 Monoid ops can already start being computed map-side! Monoid ops can already start being computed map-side! river, 2
  • 112. Monoid ops can start “Map-side” average() sum() sortWithTake() histogram() Examples: bear, 2 car, 3 deer, 2 river, 2
  • 113. Obligatory: “Go check out Algebird, NOW!” slide ALGE-birds
  • 114. BloomFilterMonoid val NUM_HASHES = 6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2! // bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")! // approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)! ! val res = approxBool.isTrue! // res: Boolean = true
  • 115. BloomFilterMonoid val NUM_HASHES = 6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2! // bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")! // approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)! ! val res = approxBool.isTrue! // res: Boolean = true
  • 116. BloomFilterMonoid val NUM_HASHES = 6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2! // bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")! // approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)! ! val res = approxBool.isTrue! // res: Boolean = true
  • 117. BloomFilterMonoid Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)( { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true))
  • 118. BloomFilterMonoid Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)( { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true)) shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false!
  • 119. BloomFilterMonoid Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)( { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true)) shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false! Why not Set[String]? It would OutOfMemory.
  • 120. BloomFilterMonoid Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)( { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true)) shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false! ApproximateBoolean(true,0.9999580954658956) Why not Set[String]? It would OutOfMemory.
  • 121. Joins
  • 122. Joins that.joinWithLarger('id1 -> 'id2, other)! that.joinWithSmaller('id1 -> 'id2, other)! ! ! that.joinWithTiny('id1 -> 'id2, other)
  • 123. Joins that.joinWithLarger('id1 -> 'id2, other)! that.joinWithSmaller('id1 -> 'id2, other)! ! ! that.joinWithTiny('id1 -> 'id2, other) joinWithTiny is appropriate when you know that # of rows in bigger pipe > mappers * # rows in smaller pipe, where mappers is the number of mappers in the job.
  • 124. Joins that.joinWithLarger('id1 -> 'id2, other)! that.joinWithSmaller('id1 -> 'id2, other)! ! ! that.joinWithTiny('id1 -> 'id2, other) joinWithTiny is appropriate when you know that # of rows in bigger pipe > mappers * # rows in smaller pipe, where mappers is the number of mappers in the job. The “usual”
  • 125. Joins val people = IterableSource(! (1, “hans”) ::! (2, “bob”) ::! (3, “hermut”) ::! (4, “heinz”) ::! (5, “klemens”) :: … :: Nil,! ('id, 'name)) val cars = IterableSource(! (99, 1, “bmw") :: ! (123, 2, "mercedes”) ::! (240, 11, “other”) :: Nil,! ('carId, 'ownerId, 'carName))!
  • 126. Joins import com.twitter.scalding.FunctionImplicits._! ! people.joinWithLarger('id -> 'ownerId, cars)! .map(('name, 'carName) -> 'sentence) { ! (name: String, car: String) =>! s"Hello $name, your $car is really nice"! }! .project('sentence)! .write(output) val people = IterableSource(! (1, “hans”) ::! (2, “bob”) ::! (3, “hermut”) ::! (4, “heinz”) ::! (5, “klemens”) :: … :: Nil,! ('id, 'name)) val cars = IterableSource(! (99, 1, “bmw") :: ! (123, 2, "mercedes”) ::! (240, 11, “other”) :: Nil,! ('carId, 'ownerId, 'carName))!
  • 127. Joins import com.twitter.scalding.FunctionImplicits._! ! people.joinWithLarger('id -> 'ownerId, cars)! .map(('name, 'carName) -> 'sentence) { ! (name: String, car: String) =>! s"Hello $name, your $car is really nice"! }! .project('sentence)! .write(output) Hello hans, your bmw is really nice! Hello bob, your bob's car is really nice! val people = IterableSource(! (1, “hans”) ::! (2, “bob”) ::! (3, “hermut”) ::! (4, “heinz”) ::! (5, “klemens”) :: … :: Nil,! ('id, 'name)) val cars = IterableSource(! (99, 1, “bmw") :: ! (123, 2, "mercedes”) ::! (240, 11, “other”) :: Nil,! ('carId, 'ownerId, 'carName))!
  • 128. “map-side” join that.joinWithTiny('id1 -> 'id2, tinyPipe) Choose this when: ! or: when the Left side is 3 orders of magnitude larger. Left > max(mappers,reducers) * Right!
  • 129. Skew Joins val sampleRate = 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output"))
  • 130. Skew Joins val sampleRate = 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output")) 1. Sample from the left and right pipes with some small probability,
 in order to determine approximately how often each join key appears in each pipe.
  • 131. Skew Joins val sampleRate = 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output")) 1. Sample from the left and right pipes with some small probability,
 in order to determine approximately how often each join key appears in each pipe. 2. Use these estimated counts to replicate the join keys, 
 according to the given replication strategy.
  • 132. Skew Joins val sampleRate = 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output")) 1. Sample from the left and right pipes with some small probability,
 in order to determine approximately how often each join key appears in each pipe. 2. Use these estimated counts to replicate the join keys, 
 according to the given replication strategy. 3. Join the replicated pipes together.
  • 133. Where did my type-safety go?!
  • 134. Where did my type-safety go?! Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))!
  • 135. Where did my type-safety go?! Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))! Caused by: cascading.flow.FlowException: local step failed at cascading.flow.planner.FlowStepJob.blockOnJob( at cascading.flow.planner.FlowStepJob.start( at at at at java.util.concurrent.ThreadPoolExecutor.runWorker( at java.util.concurrent.ThreadPoolExecutor$ at Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation at at at at at at at java.util.concurrent.ThreadPoolExecutor.runWorker( at java.util.concurrent.ThreadPoolExecutor$ at Caused by: java.lang.NumberFormatException: For input string: "bob" at java.lang.NumberFormatException.forInputString( at java.lang.Long.parseLong( at java.lang.Long.parseLong( at cascading.tuple.coerce.LongCoerce.coerce( at cascading.tuple.coerce.LongCoerce.coerce(
  • 136. Where did my type-safety go?! Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))! Caused by: cascading.flow.FlowException: local step failed at cascading.flow.planner.FlowStepJob.blockOnJob( at cascading.flow.planner.FlowStepJob.start( at at at at java.util.concurrent.ThreadPoolExecutor.runWorker( at java.util.concurrent.ThreadPoolExecutor$ at Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation at at at at at at at java.util.concurrent.ThreadPoolExecutor.runWorker( at java.util.concurrent.ThreadPoolExecutor$ at Caused by: java.lang.NumberFormatException: For input string: "bob" at java.lang.NumberFormatException.forInputString( at java.lang.Long.parseLong( at java.lang.Long.parseLong( at cascading.tuple.coerce.LongCoerce.coerce( at cascading.tuple.coerce.LongCoerce.coerce( “oh, right… We changed that file to be user names, not ids…”
  • 137. Trap it! Tsv(in, ('userId1, 'userId2, 'rel))! .addTrap(Tsv(“errors")) // add a trap! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))
  • 138. Trap it! Tsv(in, ('userId1, 'userId2, 'rel))! .addTrap(Tsv(“errors")) // add a trap! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out)) solves “dirty data”, no help for maintenance
  • 140. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!
  • 141. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))!
  • 142. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! Must give Type to each Field
  • 143. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))!
  • 144. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))! Tuple arity: 2 Tuple arity: 3
  • 145. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! Caused by: java.lang.IllegalArgumentException: num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2 at cascading.scheme.util.DelimitedParser.reset( TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))! Tuple arity: 2 Tuple arity: 3
  • 146. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! Caused by: java.lang.IllegalArgumentException: num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2 at cascading.scheme.util.DelimitedParser.reset( TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))! Tuple arity: 2 Tuple arity: 3 “planing-time” exception
  • 147. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))! ! }
  • 148. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))! ! } Easier to reuse schemas now
  • 149. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))! ! } Easier to reuse schemas now Not coupled by Field names, but still too magic for reuse… “_1”?
  • 150. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date) ! .filter { p: Person => == ”bob" }! .write(TypedTsv(out))! ! }
  • 151. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date) ! .filter { p: Person => == ”bob" }! .write(TypedTsv(out))! ! } TypedPipe[Person]
  • 152. Typed Joins case class UserName(id: Long, handle: String)! case class UserFavs(byUser: Long, favs: List[Long])! case class UserTweets(byUser: Long, tweets: List[Long])! ! def users: TypedSource[UserName]! def favs: TypedSource[UserFavs]! def tweets: TypedSource[UserTweets]! ! def output: TypedSink[(UserName, UserFavs, UserTweets)]! ! users.groupBy(! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)!
  • 153. Typed Joins case class UserName(id: Long, handle: String)! case class UserFavs(byUser: Long, favs: List[Long])! case class UserTweets(byUser: Long, tweets: List[Long])! ! def users: TypedSource[UserName]! def favs: TypedSource[UserFavs]! def tweets: TypedSource[UserTweets]! ! def output: TypedSink[(UserName, UserFavs, UserTweets)]! ! users.groupBy(! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)!
  • 154. Typed Joins case class UserName(id: Long, handle: String)! case class UserFavs(byUser: Long, favs: List[Long])! case class UserTweets(byUser: Long, tweets: List[Long])! ! def users: TypedSource[UserName]! def favs: TypedSource[UserFavs]! def tweets: TypedSource[UserTweets]! ! def output: TypedSink[(UserName, UserFavs, UserTweets)]! ! users.groupBy(! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)! 3-way-merge in 1 MR step
  • 155. > run pl.project13.oculus.job.WordCountJob ! —local —tool.graph --input in --output out! ! writing DOT: !! ! writing Steps DOT: ! Do the DOT
  • 157. ! ! ! ! > dot -Tpng! ! ! ! ! ! ! ! ! ! ! ! ! ! Do the DOT
  • 158. ! ! ! ! > dot -Tpng! ! ! ! ! ! ! ! ! ! ! ! ! ! Do the DOT M A P
  • 159. ! ! ! ! > dot -Tpng! ! ! ! ! ! ! ! ! ! ! ! ! ! Do the DOT M A P R E D
  • 162. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 163. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 164. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 165. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 166. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 167. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 168. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  • 169. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .runHadoop! .finish! }! ! }! <3 Testing
  • 170. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .runHadoop! .finish! }! ! }! <3 Testing run || runHadoop
  • 171.
  • 172. “Parallelize all the batches!”
  • 173. “Parallelize all the batches!” Feels much like Scala collections
  • 174. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading
  • 175. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps
  • 176. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to
  • 177. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala
  • 178. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly
  • 179. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly
  • 180. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly Matrix API
  • 181. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly Matrix API Efficient columnar storage (Parquet)
  • 182. Scalding Re-Cap ! ! ! ! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! !
  • 183. Scalding Re-Cap ! ! ! ! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! ! 4{
  • 184. ! ! ! ! ! $ activator new activator-scalding! ! Try it! Template by Dean Wampler
  • 185. Loads Of Links 1. 2. 3. 4. qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=3 5. qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=2 6. 7. 8. 9. 10. 11. 12.