SlideShare a Scribd company logo
1 of 223
Download to read offline
Apache Spark
when things go wrong
@rabbitonweb
Apache Spark - when things go wrong
val sc = new SparkContext(conf)
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
.foreach(println)
If above seems odd, this talk is rather not for you.
@rabbitonweb tweet compliments and complaints
Target for this talk
Target for this talk
Target for this talk
Target for this talk
Target for this talk
Target for this talk
Target for this talk
Target for this talk
Target for this talk
4 things for take-away
4 things for take-away
1. Knowledge of how Apache Spark works
internally
4 things for take-away
1. Knowledge of how Apache Spark works
internally
2. Courage to look at Spark’s implementation
(code)
4 things for take-away
1. Knowledge of how Apache Spark works
internally
2. Courage to look at Spark’s implementation
(code)
3. Notion of how to write efficient Spark
programs
4 things for take-away
1. Knowledge of how Apache Spark works
internally
2. Courage to look at Spark’s implementation
(code)
3. Notion of how to write efficient Spark
programs
4. Basic ability to monitor Spark when things
are starting to act weird
Two words about examples used
Two words about examples used
Super Cool
App
Two words about examples used
Journal
Super Cool
App
events
spark cluster
Two words about examples used
Journal
Super Cool
App
events
node 1
node 2
node 3
master A
master B
spark cluster
Two words about examples used
Journal
Super Cool
App
events
node 1
node 2
node 3
master A
master B
spark cluster
Two words about examples used
Journal
Super Cool
App
events
node 1
node 2
node 3
master A
master B
Two words about examples used
Two words about examples used
sc.textFile(“hdfs://journal/*”)
Two words about examples used
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
Two words about examples used
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
Two words about examples used
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
Two words about examples used
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
Two words about examples used
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
.foreach(println)
What is a RDD?
What is a RDD?
Resilient Distributed Dataset
What is a RDD?
Resilient Distributed Dataset
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
What is a RDD?
node 1
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
node 2 node 3
What is a RDD?
node 1
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
node 2 node 3
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
What is a RDD?
What is a RDD?
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
What is a partition?
What is a partition?
A partition represents subset of data within your
distributed collection.
What is a partition?
A partition represents subset of data within your
distributed collection.
override def getPartitions: Array[Partition] = ???
What is a partition?
A partition represents subset of data within your
distributed collection.
override def getPartitions: Array[Partition] = ???
How this subset is defined depends on type of
the RDD
example: HadoopRDD
val journal = sc.textFile(“hdfs://journal/*”)
example: HadoopRDD
val journal = sc.textFile(“hdfs://journal/*”)
How HadoopRDD is partitioned?
example: HadoopRDD
val journal = sc.textFile(“hdfs://journal/*”)
How HadoopRDD is partitioned?
In HadoopRDD partition is exactly the same as
file chunks in HDFS
example: HadoopRDD
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {
...
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) }
array
}
example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {
...
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) }
array
}
example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {
...
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) }
array
}
example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
How MapPartitionsRDD is partitioned?
example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
How MapPartitionsRDD is partitioned?
MapPartitionsRDD inherits partition information
from its parent RDD
example: MapPartitionsRDD
class MapPartitionsRDD[U: ClassTag, T: ClassTag](...) extends RDD[U](prev) {
...
override def getPartitions: Array[Partition] = firstParent[T].partitions
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
RDD parent
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
.foreach(println)
RDD parent
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
.foreach(println)
RDD parent
sc.textFile()
.groupBy()
.map { }
.filter {
}
.take()
.foreach()
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two types of parent dependencies:
1. narrow dependency
2. wider dependency
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two types of parent dependencies:
1. narrow dependency
2. wider dependency
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two types of parent dependencies:
1. narrow dependency
2. wider dependency
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Tasks
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Tasks
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Stage 1
Stage 2
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Stage 1
Stage 2
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two important concepts:
1. shuffle write
2. shuffle read
toDebugString
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
toDebugString
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
scala> events.toDebugString
toDebugString
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
scala> events.toDebugString
res5: String =
(4) MapPartitionsRDD[22] at filter at <console>:50 []
| MapPartitionsRDD[21] at map at <console>:49 []
| ShuffledRDD[20] at groupBy at <console>:48 []
+-(6) HadoopRDD[17] at textFile at <console>:47 []
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
Stage 1
Stage 2
Running Job aka materializing DAG
sc.textFile() .groupBy() .map { } .filter { }
Stage 1
Stage 2
Running Job aka materializing DAG
sc.textFile() .groupBy() .map { } .filter { } .collect()
Stage 1
Stage 2
Running Job aka materializing DAG
sc.textFile() .groupBy() .map { } .filter { } .collect()
action
Stage 1
Stage 2
Running Job aka materializing DAG
sc.textFile() .groupBy() .map { } .filter { } .collect()
action
Actions are implemented
using sc.runJob method
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
): Array[U]
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
): Array[U]
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
partitions: Seq[Int],
): Array[U]
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
partitions: Seq[Int],
func: Iterator[T] => U,
): Array[U]
Running Job aka materializing DAG
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T] = {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
Running Job aka materializing DAG
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T] = {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
Multiple jobs for single action
/**
* Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to
estimate the number of additional partitions needed to satisfy the limit.
*/
def take(num: Int): Array[T] = {
(….)
val left = num - buf.size
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true)
(….)
res.foreach(buf ++= _.take(num - buf.size))
partsScanned += numPartsToTry
(….)
buf.toArray
}
Lets test what
we’ve learned
Towards efficiency
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
Towards efficiency
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
scala> events.toDebugString
(4) MapPartitionsRDD[22] at filter at <console>:50 []
| MapPartitionsRDD[21] at map at <console>:49 []
| ShuffledRDD[20] at groupBy at <console>:48 []
+-(6) HadoopRDD[17] at textFile at <console>:47 []
Towards efficiency
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
scala> events.toDebugString
(4) MapPartitionsRDD[22] at filter at <console>:50 []
| MapPartitionsRDD[21] at map at <console>:49 []
| ShuffledRDD[20] at groupBy at <console>:48 []
+-(6) HadoopRDD[17] at textFile at <console>:47 []
events.count
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Everyday I’m Shuffling
Everyday I’m Shuffling
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Everyday I’m Shuffling
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Everyday I’m Shuffling
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Everyday I’m Shuffling
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Everyday I’m Shuffling
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
node 1
node 2
node 3
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
node 1
node 2
node 3
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _, 6)
.map { case (date, events) => (date, events.size) }
node 1
node 2
node 3
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _, 6)
.map { case (date, events) => (date, events.size) }
node 1
node 2
node 3
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _, 6)
.map { case (date, events) => (date, events.size) }
node 1
node 2
node 3
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _, 6)
.map { case (date, events) => (date, events.size) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.map( e => (extractDate(e), e))
.combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 +
c2)
.groupBy(extractDate _, 6)
.map { case (date, events) => (date, events.size) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.map( e => (extractDate(e), e))
.combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 +
c2)
A bit more about partitions
val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4
.map( e => (extractDate(e), e))
A bit more about partitions
val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4
.repartition(256) // note, this will cause a shuffle
.map( e => (extractDate(e), e))
A bit more about partitions
val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.map( e => (extractDate(e), e))
A bit more about partitions
val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.coalesce(64) // this will NOT shuffle
.map( e => (extractDate(e), e))
Enough theory
Few optimization tricks
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
b. map vs mapValues (partitioner)
Spark performance - shuffle optimization
map groupBy
Spark performance - shuffle optimization
map groupBy
Spark performance - shuffle optimization
map groupBy join
Spark performance - shuffle optimization
map groupBy join
Spark performance - shuffle optimization
map groupBy join
Optimization: shuffle avoided if
data is already partitioned
Spark performance - shuffle optimization
map groupBy map
Spark performance - shuffle optimization
map groupBy map
Spark performance - shuffle optimization
map groupBy map join
Spark performance - shuffle optimization
map groupBy map join
Spark performance - shuffle optimization
map groupBy mapValues
Spark performance - shuffle optimization
map groupBy mapValues
Spark performance - shuffle optimization
map groupBy mapValues join
Spark performance - shuffle optimization
map groupBy mapValues join
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
b. map vs mapValues (partitioner)
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
b. map vs mapValues (partitioner)
5. Understand basics & internals to omit common mistakes
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
b. map vs mapValues (partitioner)
5. Understand basics & internals to omit common mistakes
Example: Usage of RDD within other RDD. Invalid since only the driver can
call operations on RDDS (not other workers)
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
b. map vs mapValues (partitioner)
5. Understand basics & internals to omit common mistakes
Example: Usage of RDD within other RDD. Invalid since only the driver can
call operations on RDDS (not other workers)
rdd.map((k,v) => otherRDD.get(k))
rdd.map(e => otherRdd.map {})
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
b. map vs mapValues (partitioner)
5. Understand basics & internals to omit common mistakes
Example: Usage of RDD within other RDD. Invalid since only the driver can
call operations on RDDS (not other workers)
rdd.map((k,v) => otherRDD.get(k)) -> rdd.join(otherRdd)
rdd.map(e => otherRdd.map {}) -> rdd.cartesian(otherRdd)
Resources & further read
● Spark Summit Talks @ Youtube
Resources & further read
● Spark Summit Talks @ Youtube
SPARK SUMMIT EUROPE!!!
October 27th to 29th
Resources & further read
● Spark Summit Talks @ Youtube
Resources & further read
● Spark Summit Talks @ Youtube
● “Learn Spark” book, published by O’Reilly
Resources & further read
● Spark Summit Talks @ Youtube
● “Learn Spark” book, published by O’Reilly
● Apache Spark Documentation
○ https://spark.apache.org/docs/latest/monitoring.html
○ https://spark.apache.org/docs/latest/tuning.html
Resources & further read
● Spark Summit Talks @ Youtube
● “Learn Spark” book, published by O’Reilly
● Apache Spark Documentation
○ https://spark.apache.org/docs/latest/monitoring.html
○ https://spark.apache.org/docs/latest/tuning.html
● Mailing List aka solution-to-your-problem-is-probably-already-there
Resources & further read
● Spark Summit Talks @ Youtube
● “Learn Spark” book, published by O’Reilly
● Apache Spark Documentation
○ https://spark.apache.org/docs/latest/monitoring.html
○ https://spark.apache.org/docs/latest/tuning.html
● Mailing List aka solution-to-your-problem-is-probably-already-there
● Pretty soon my blog & github :)
Paweł Szulc
Paweł Szulc
blog: http://rabbitonweb.com
Paweł Szulc
blog: http://rabbitonweb.com
twitter: @rabbitonweb
Paweł Szulc
blog: http://rabbitonweb.com
twitter: @rabbitonweb
github: https://github.com/rabbitonweb

More Related Content

What's hot

Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB
 
Cassandra 2.1
Cassandra 2.1Cassandra 2.1
Cassandra 2.1jbellis
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探台灣資料科學年會
 
Buenos Aires Drools Expert Presentation
Buenos Aires Drools Expert PresentationBuenos Aires Drools Expert Presentation
Buenos Aires Drools Expert PresentationMark Proctor
 
Cassandra summit keynote 2014
Cassandra summit keynote 2014Cassandra summit keynote 2014
Cassandra summit keynote 2014jbellis
 
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02Nick Baguley
 
Cassandra Summit 2015
Cassandra Summit 2015Cassandra Summit 2015
Cassandra Summit 2015jbellis
 
Data Mining with Splunk
Data Mining with SplunkData Mining with Splunk
Data Mining with SplunkDavid Carasso
 
Cassandra Summit 2013 Keynote
Cassandra Summit 2013 KeynoteCassandra Summit 2013 Keynote
Cassandra Summit 2013 Keynotejbellis
 
Cassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorCassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorChristopher Batey
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsServer Density
 
Rails Model Basics
Rails Model BasicsRails Model Basics
Rails Model BasicsJames Gray
 

What's hot (12)

Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
 
Cassandra 2.1
Cassandra 2.1Cassandra 2.1
Cassandra 2.1
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 
Buenos Aires Drools Expert Presentation
Buenos Aires Drools Expert PresentationBuenos Aires Drools Expert Presentation
Buenos Aires Drools Expert Presentation
 
Cassandra summit keynote 2014
Cassandra summit keynote 2014Cassandra summit keynote 2014
Cassandra summit keynote 2014
 
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02
 
Cassandra Summit 2015
Cassandra Summit 2015Cassandra Summit 2015
Cassandra Summit 2015
 
Data Mining with Splunk
Data Mining with SplunkData Mining with Splunk
Data Mining with Splunk
 
Cassandra Summit 2013 Keynote
Cassandra Summit 2013 KeynoteCassandra Summit 2013 Keynote
Cassandra Summit 2013 Keynote
 
Cassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorCassandra London - C* Spark Connector
Cassandra London - C* Spark Connector
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 
Rails Model Basics
Rails Model BasicsRails Model Basics
Rails Model Basics
 

Viewers also liked

Functional Programming & Event Sourcing - a pair made in heaven
Functional Programming & Event Sourcing - a pair made in heavenFunctional Programming & Event Sourcing - a pair made in heaven
Functional Programming & Event Sourcing - a pair made in heavenPawel Szulc
 
Introduction to type classes
Introduction to type classesIntroduction to type classes
Introduction to type classesPawel Szulc
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshopPawel Szulc
 
Introduction to type classes in 30 min
Introduction to type classes in 30 minIntroduction to type classes in 30 min
Introduction to type classes in 30 minPawel Szulc
 
The cats toolbox a quick tour of some basic typeclasses
The cats toolbox  a quick tour of some basic typeclassesThe cats toolbox  a quick tour of some basic typeclasses
The cats toolbox a quick tour of some basic typeclassesPawel Szulc
 
Real world gobbledygook
Real world gobbledygookReal world gobbledygook
Real world gobbledygookPawel Szulc
 
“Going bananas with recursion schemes for fixed point data types”
“Going bananas with recursion schemes for fixed point data types”“Going bananas with recursion schemes for fixed point data types”
“Going bananas with recursion schemes for fixed point data types”Pawel Szulc
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning Keon Kim
 
Going bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data typesGoing bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data typesPawel Szulc
 
Make your programs Free
Make your programs FreeMake your programs Free
Make your programs FreePawel Szulc
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient DataCarol McDonald
 

Viewers also liked (14)

Functional Programming & Event Sourcing - a pair made in heaven
Functional Programming & Event Sourcing - a pair made in heavenFunctional Programming & Event Sourcing - a pair made in heaven
Functional Programming & Event Sourcing - a pair made in heaven
 
Introduction to type classes
Introduction to type classesIntroduction to type classes
Introduction to type classes
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshop
 
Introduction to type classes in 30 min
Introduction to type classes in 30 minIntroduction to type classes in 30 min
Introduction to type classes in 30 min
 
The cats toolbox a quick tour of some basic typeclasses
The cats toolbox  a quick tour of some basic typeclassesThe cats toolbox  a quick tour of some basic typeclasses
The cats toolbox a quick tour of some basic typeclasses
 
Real world gobbledygook
Real world gobbledygookReal world gobbledygook
Real world gobbledygook
 
“Going bananas with recursion schemes for fixed point data types”
“Going bananas with recursion schemes for fixed point data types”“Going bananas with recursion schemes for fixed point data types”
“Going bananas with recursion schemes for fixed point data types”
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning
 
Going bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data typesGoing bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data types
 
Make your programs Free
Make your programs FreeMake your programs Free
Make your programs Free
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient Data
 

Similar to Apache spark when things go wrong

tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-datatranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-dataDavid Peyruc
 
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Gilles Fedak
 
Schema design short
Schema design shortSchema design short
Schema design shortMongoDB
 
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panicoDiego Valerio Camarda
 
Preservation Metadata, CARLI Metadata Matters series, December 2010
Preservation Metadata, CARLI Metadata Matters series, December 2010Preservation Metadata, CARLI Metadata Matters series, December 2010
Preservation Metadata, CARLI Metadata Matters series, December 2010Claire Stewart
 
Twitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud StorageTwitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud Storagelohitvijayarenu
 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...Crossref
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesIan Huston
 
Session #4 b content providers
Session #4 b  content providersSession #4 b  content providers
Session #4 b content providersVitali Pekelis
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...DataKitchen
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームMasayuki Matsushita
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scrapingScrapinghub
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core ModuleKatie Gulley
 
One does not simply builds her own Heroku
One does not simply builds her own HerokuOne does not simply builds her own Heroku
One does not simply builds her own HerokuLaszlo Fogas
 
Skutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor SmithSkutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor SmithSri Ambati
 
Returning the right results - Jettro Coenradie
Returning the right results - Jettro CoenradieReturning the right results - Jettro Coenradie
Returning the right results - Jettro CoenradieNLJUG
 

Similar to Apache spark when things go wrong (20)

Introduction to SDshare
Introduction to SDshareIntroduction to SDshare
Introduction to SDshare
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-datatranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
 
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
 
Schema design short
Schema design shortSchema design short
Schema design short
 
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
 
Mastering the variety dimension - Ontop Demonstration
Mastering the variety dimension - Ontop DemonstrationMastering the variety dimension - Ontop Demonstration
Mastering the variety dimension - Ontop Demonstration
 
DTL Partners Event - FAIR Data Tech overview - Day 1
DTL Partners Event - FAIR Data Tech overview - Day 1DTL Partners Event - FAIR Data Tech overview - Day 1
DTL Partners Event - FAIR Data Tech overview - Day 1
 
Preservation Metadata, CARLI Metadata Matters series, December 2010
Preservation Metadata, CARLI Metadata Matters series, December 2010Preservation Metadata, CARLI Metadata Matters series, December 2010
Preservation Metadata, CARLI Metadata Matters series, December 2010
 
Twitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud StorageTwitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud Storage
 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
 
FESCA 2015 keynote
FESCA 2015 keynoteFESCA 2015 keynote
FESCA 2015 keynote
 
Session #4 b content providers
Session #4 b  content providersSession #4 b  content providers
Session #4 b content providers
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
 
One does not simply builds her own Heroku
One does not simply builds her own HerokuOne does not simply builds her own Heroku
One does not simply builds her own Heroku
 
Skutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor SmithSkutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor Smith
 
Returning the right results - Jettro Coenradie
Returning the right results - Jettro CoenradieReturning the right results - Jettro Coenradie
Returning the right results - Jettro Coenradie
 

More from Pawel Szulc

Getting acquainted with Lens
Getting acquainted with LensGetting acquainted with Lens
Getting acquainted with LensPawel Szulc
 
Maintainable Software Architecture in Haskell (with Polysemy)
Maintainable Software Architecture in Haskell (with Polysemy)Maintainable Software Architecture in Haskell (with Polysemy)
Maintainable Software Architecture in Haskell (with Polysemy)Pawel Szulc
 
Painless Haskell
Painless HaskellPainless Haskell
Painless HaskellPawel Szulc
 
Trip with monads
Trip with monadsTrip with monads
Trip with monadsPawel Szulc
 
Trip with monads
Trip with monadsTrip with monads
Trip with monadsPawel Szulc
 
Illogical engineers
Illogical engineersIllogical engineers
Illogical engineersPawel Szulc
 
RChain - Understanding Distributed Calculi
RChain - Understanding Distributed CalculiRChain - Understanding Distributed Calculi
RChain - Understanding Distributed CalculiPawel Szulc
 
Illogical engineers
Illogical engineersIllogical engineers
Illogical engineersPawel Szulc
 
Understanding distributed calculi in Haskell
Understanding distributed calculi in HaskellUnderstanding distributed calculi in Haskell
Understanding distributed calculi in HaskellPawel Szulc
 
Software engineering the genesis
Software engineering  the genesisSoftware engineering  the genesis
Software engineering the genesisPawel Szulc
 
Category theory is general abolute nonsens
Category theory is general abolute nonsensCategory theory is general abolute nonsens
Category theory is general abolute nonsensPawel Szulc
 
Fun never stops. introduction to haskell programming language
Fun never stops. introduction to haskell programming languageFun never stops. introduction to haskell programming language
Fun never stops. introduction to haskell programming languagePawel Szulc
 
Monads asking the right question
Monads  asking the right questionMonads  asking the right question
Monads asking the right questionPawel Szulc
 
Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Pawel Szulc
 
Javascript development done right
Javascript development done rightJavascript development done right
Javascript development done rightPawel Szulc
 
Architektura to nie bzdura
Architektura to nie bzduraArchitektura to nie bzdura
Architektura to nie bzduraPawel Szulc
 
Testing and Testable Code
Testing and Testable CodeTesting and Testable Code
Testing and Testable CodePawel Szulc
 

More from Pawel Szulc (18)

Getting acquainted with Lens
Getting acquainted with LensGetting acquainted with Lens
Getting acquainted with Lens
 
Impossibility
ImpossibilityImpossibility
Impossibility
 
Maintainable Software Architecture in Haskell (with Polysemy)
Maintainable Software Architecture in Haskell (with Polysemy)Maintainable Software Architecture in Haskell (with Polysemy)
Maintainable Software Architecture in Haskell (with Polysemy)
 
Painless Haskell
Painless HaskellPainless Haskell
Painless Haskell
 
Trip with monads
Trip with monadsTrip with monads
Trip with monads
 
Trip with monads
Trip with monadsTrip with monads
Trip with monads
 
Illogical engineers
Illogical engineersIllogical engineers
Illogical engineers
 
RChain - Understanding Distributed Calculi
RChain - Understanding Distributed CalculiRChain - Understanding Distributed Calculi
RChain - Understanding Distributed Calculi
 
Illogical engineers
Illogical engineersIllogical engineers
Illogical engineers
 
Understanding distributed calculi in Haskell
Understanding distributed calculi in HaskellUnderstanding distributed calculi in Haskell
Understanding distributed calculi in Haskell
 
Software engineering the genesis
Software engineering  the genesisSoftware engineering  the genesis
Software engineering the genesis
 
Category theory is general abolute nonsens
Category theory is general abolute nonsensCategory theory is general abolute nonsens
Category theory is general abolute nonsens
 
Fun never stops. introduction to haskell programming language
Fun never stops. introduction to haskell programming languageFun never stops. introduction to haskell programming language
Fun never stops. introduction to haskell programming language
 
Monads asking the right question
Monads  asking the right questionMonads  asking the right question
Monads asking the right question
 
Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]
 
Javascript development done right
Javascript development done rightJavascript development done right
Javascript development done right
 
Architektura to nie bzdura
Architektura to nie bzduraArchitektura to nie bzdura
Architektura to nie bzdura
 
Testing and Testable Code
Testing and Testable CodeTesting and Testable Code
Testing and Testable Code
 

Recently uploaded

Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 

Recently uploaded (20)

Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 

Apache spark when things go wrong

  • 1. Apache Spark when things go wrong @rabbitonweb
  • 2. Apache Spark - when things go wrong val sc = new SparkContext(conf) sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println) If above seems odd, this talk is rather not for you. @rabbitonweb tweet compliments and complaints
  • 12. 4 things for take-away
  • 13. 4 things for take-away 1. Knowledge of how Apache Spark works internally
  • 14. 4 things for take-away 1. Knowledge of how Apache Spark works internally 2. Courage to look at Spark’s implementation (code)
  • 15. 4 things for take-away 1. Knowledge of how Apache Spark works internally 2. Courage to look at Spark’s implementation (code) 3. Notion of how to write efficient Spark programs
  • 16. 4 things for take-away 1. Knowledge of how Apache Spark works internally 2. Courage to look at Spark’s implementation (code) 3. Notion of how to write efficient Spark programs 4. Basic ability to monitor Spark when things are starting to act weird
  • 17. Two words about examples used
  • 18. Two words about examples used Super Cool App
  • 19. Two words about examples used Journal Super Cool App events
  • 20. spark cluster Two words about examples used Journal Super Cool App events node 1 node 2 node 3 master A master B
  • 21. spark cluster Two words about examples used Journal Super Cool App events node 1 node 2 node 3 master A master B
  • 22. spark cluster Two words about examples used Journal Super Cool App events node 1 node 2 node 3 master A master B
  • 23. Two words about examples used
  • 24. Two words about examples used sc.textFile(“hdfs://journal/*”)
  • 25. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _)
  • 26. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) }
  • 27. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 28. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300)
  • 29. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)
  • 30. What is a RDD?
  • 31. What is a RDD? Resilient Distributed Dataset
  • 32. What is a RDD? Resilient Distributed Dataset
  • 33. ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?
  • 34. node 1 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 What is a RDD?
  • 35. node 1 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?
  • 36. What is a RDD?
  • 37. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work:
  • 38. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent
  • 39. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned
  • 40. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data
  • 41. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data
  • 42. What is a partition?
  • 43. What is a partition? A partition represents subset of data within your distributed collection.
  • 44. What is a partition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ???
  • 45. What is a partition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ??? How this subset is defined depends on type of the RDD
  • 46. example: HadoopRDD val journal = sc.textFile(“hdfs://journal/*”)
  • 47. example: HadoopRDD val journal = sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned?
  • 48. example: HadoopRDD val journal = sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned? In HadoopRDD partition is exactly the same as file chunks in HDFS
  • 49. example: HadoopRDD 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 50. example: HadoopRDD node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 51. example: HadoopRDD node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 52. example: HadoopRDD node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 53. example: HadoopRDD node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 54. example: HadoopRDD node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 55. example: HadoopRDD class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging { ... override def getPartitions: Array[Partition] = { val jobConf = getJobConf() SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }
  • 56. example: HadoopRDD class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging { ... override def getPartitions: Array[Partition] = { val jobConf = getJobConf() SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }
  • 57. example: HadoopRDD class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging { ... override def getPartitions: Array[Partition] = { val jobConf = getJobConf() SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }
  • 58. example: MapPartitionsRDD val journal = sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 59. example: MapPartitionsRDD val journal = sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned?
  • 60. example: MapPartitionsRDD val journal = sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned? MapPartitionsRDD inherits partition information from its parent RDD
  • 61. example: MapPartitionsRDD class MapPartitionsRDD[U: ClassTag, T: ClassTag](...) extends RDD[U](prev) { ... override def getPartitions: Array[Partition] = firstParent[T].partitions
  • 62. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data
  • 63. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data
  • 64. RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)
  • 65. RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)
  • 66. RDD parent sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 67. Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 68. Directed acyclic graph HadoopRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 69. Directed acyclic graph HadoopRDD ShuffeledRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 70. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 71. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 72. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 73. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two types of parent dependencies: 1. narrow dependency 2. wider dependency
  • 74. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two types of parent dependencies: 1. narrow dependency 2. wider dependency
  • 75. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two types of parent dependencies: 1. narrow dependency 2. wider dependency
  • 76. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 77. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Tasks
  • 78. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Tasks
  • 79. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 80. Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 81. Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 82. Stage 1 Stage 2 Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 83. Stage 1 Stage 2 Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two important concepts: 1. shuffle write 2. shuffle read
  • 84. toDebugString val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 85. toDebugString val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString
  • 86. toDebugString val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString res5: String = (4) MapPartitionsRDD[22] at filter at <console>:50 [] | MapPartitionsRDD[21] at map at <console>:49 [] | ShuffledRDD[20] at groupBy at <console>:48 [] +-(6) HadoopRDD[17] at textFile at <console>:47 []
  • 87. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data
  • 88. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data
  • 89. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { }
  • 90. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect()
  • 91. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action
  • 92. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action Actions are implemented using sc.runJob method
  • 93. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( ): Array[U]
  • 94. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], ): Array[U]
  • 95. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], partitions: Seq[Int], ): Array[U]
  • 96. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], partitions: Seq[Int], func: Iterator[T] => U, ): Array[U]
  • 97. Running Job aka materializing DAG /** * Return an array that contains all of the elements in this RDD. */ def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) }
  • 98. Running Job aka materializing DAG /** * Return an array that contains all of the elements in this RDD. */ def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) } /** * Return the number of elements in the RDD. */ def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
  • 99. Multiple jobs for single action /** * Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit. */ def take(num: Int): Array[T] = { (….) val left = num - buf.size val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true) (….) res.foreach(buf ++= _.take(num - buf.size)) partsScanned += numPartsToTry (….) buf.toArray }
  • 101. Towards efficiency val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 102. Towards efficiency val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString (4) MapPartitionsRDD[22] at filter at <console>:50 [] | MapPartitionsRDD[21] at map at <console>:49 [] | ShuffledRDD[20] at groupBy at <console>:48 [] +-(6) HadoopRDD[17] at textFile at <console>:47 []
  • 103. Towards efficiency val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString (4) MapPartitionsRDD[22] at filter at <console>:50 [] | MapPartitionsRDD[21] at map at <console>:49 [] | ShuffledRDD[20] at groupBy at <console>:48 [] +-(6) HadoopRDD[17] at textFile at <console>:47 [] events.count
  • 104. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 105. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 106. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 107. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 108. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 109. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 110. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 111. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 112. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 113. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 114. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 115. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 116. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 117. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 119. Everyday I’m Shuffling val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 120. Everyday I’m Shuffling val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 121. Everyday I’m Shuffling val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 122. Everyday I’m Shuffling val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 123. Everyday I’m Shuffling val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 124. Stage 2 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 125. Stage 2 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 126. Stage 2 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 127. Stage 2 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 128. Stage 2 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 129. Stage 2 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 130. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 131. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 132. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 133. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 134. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 135. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 136. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 137. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 138. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 139. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 140. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 141. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 142. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) }
  • 143. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) }
  • 144. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3
  • 145. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3
  • 146. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3
  • 147. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3
  • 148. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3
  • 149. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) }
  • 150. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e)) .combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 + c2) .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) }
  • 151. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e)) .combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 + c2)
  • 152. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4 .map( e => (extractDate(e), e))
  • 153. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4 .repartition(256) // note, this will cause a shuffle .map( e => (extractDate(e), e))
  • 154. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024 .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e))
  • 155. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024 .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .coalesce(64) // this will NOT shuffle .map( e => (extractDate(e), e))
  • 157.
  • 158.
  • 159.
  • 160.
  • 161.
  • 162.
  • 163.
  • 164.
  • 165.
  • 166.
  • 167.
  • 168.
  • 169.
  • 170.
  • 171.
  • 172.
  • 173.
  • 174.
  • 175.
  • 176.
  • 177.
  • 178.
  • 179.
  • 180.
  • 181.
  • 182.
  • 183.
  • 184.
  • 185.
  • 186.
  • 188. Few optimization tricks 1. Serialization issues (e.g KryoSerializer)
  • 189. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster
  • 190. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs
  • 191. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey
  • 192. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner)
  • 193. Spark performance - shuffle optimization map groupBy
  • 194. Spark performance - shuffle optimization map groupBy
  • 195. Spark performance - shuffle optimization map groupBy join
  • 196. Spark performance - shuffle optimization map groupBy join
  • 197. Spark performance - shuffle optimization map groupBy join Optimization: shuffle avoided if data is already partitioned
  • 198. Spark performance - shuffle optimization map groupBy map
  • 199. Spark performance - shuffle optimization map groupBy map
  • 200. Spark performance - shuffle optimization map groupBy map join
  • 201. Spark performance - shuffle optimization map groupBy map join
  • 202. Spark performance - shuffle optimization map groupBy mapValues
  • 203. Spark performance - shuffle optimization map groupBy mapValues
  • 204. Spark performance - shuffle optimization map groupBy mapValues join
  • 205. Spark performance - shuffle optimization map groupBy mapValues join
  • 206. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner)
  • 207. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner) 5. Understand basics & internals to omit common mistakes
  • 208. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner) 5. Understand basics & internals to omit common mistakes Example: Usage of RDD within other RDD. Invalid since only the driver can call operations on RDDS (not other workers)
  • 209. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner) 5. Understand basics & internals to omit common mistakes Example: Usage of RDD within other RDD. Invalid since only the driver can call operations on RDDS (not other workers) rdd.map((k,v) => otherRDD.get(k)) rdd.map(e => otherRdd.map {})
  • 210. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner) 5. Understand basics & internals to omit common mistakes Example: Usage of RDD within other RDD. Invalid since only the driver can call operations on RDDS (not other workers) rdd.map((k,v) => otherRDD.get(k)) -> rdd.join(otherRdd) rdd.map(e => otherRdd.map {}) -> rdd.cartesian(otherRdd)
  • 211. Resources & further read ● Spark Summit Talks @ Youtube
  • 212. Resources & further read ● Spark Summit Talks @ Youtube SPARK SUMMIT EUROPE!!! October 27th to 29th
  • 213. Resources & further read ● Spark Summit Talks @ Youtube
  • 214. Resources & further read ● Spark Summit Talks @ Youtube ● “Learn Spark” book, published by O’Reilly
  • 215. Resources & further read ● Spark Summit Talks @ Youtube ● “Learn Spark” book, published by O’Reilly ● Apache Spark Documentation ○ https://spark.apache.org/docs/latest/monitoring.html ○ https://spark.apache.org/docs/latest/tuning.html
  • 216. Resources & further read ● Spark Summit Talks @ Youtube ● “Learn Spark” book, published by O’Reilly ● Apache Spark Documentation ○ https://spark.apache.org/docs/latest/monitoring.html ○ https://spark.apache.org/docs/latest/tuning.html ● Mailing List aka solution-to-your-problem-is-probably-already-there
  • 217. Resources & further read ● Spark Summit Talks @ Youtube ● “Learn Spark” book, published by O’Reilly ● Apache Spark Documentation ○ https://spark.apache.org/docs/latest/monitoring.html ○ https://spark.apache.org/docs/latest/tuning.html ● Mailing List aka solution-to-your-problem-is-probably-already-there ● Pretty soon my blog & github :)
  • 218.
  • 219.
  • 223. Paweł Szulc blog: http://rabbitonweb.com twitter: @rabbitonweb github: https://github.com/rabbitonweb