SlideShare a Scribd company logo
1 of 64
Download to read offline
Advanced Spark Programming (1)
Advanced Spark Programming
Persistence (caching)
Advanced Spark Programming
1. RDDs are lazily evaluated
Persistence (caching)
Advanced Spark Programming
1. RDDs are lazily evaluated
2. RDD and its dependencies are recomputed on action
Persistence (caching)
Advanced Spark Programming
1. RDDs are lazily evaluated
2. RDD and its dependencies are recomputed on action
3. We may wish to use the same RDD multiple times.
Persistence (caching)
Advanced Spark Programming
1. RDDs are lazily evaluated
2. RDD and its dependencies are recomputed on action
3. We may wish to use the same RDD multiple times.
4. To avoid re-computing, we can persist the RDD
Persistence (caching)
Advanced Spark Programming
1. If a node that has data persisted on it fails, Spark will
recompute the lost partitions of the data when needed.
Persistence (caching)
Advanced Spark Programming
Persistence (caching)
1. If a node that has data persisted on it fails, Spark will
recompute the lost partitions of the data when needed.
2. We can also replicate our data on multiple nodes if we
want to be able to handle node failure without slowdown.
Advanced Spark Programming
Persistence (caching) - Example
Advanced Spark Programming
1. var nums = sc.parallelize((1 to 100000), 50)
2.
Persistence (caching) - Example
Advanced Spark Programming
1. var nums = sc.parallelize((1 to 100000), 50)
2. def mysum(itr:Iterator[Int]):Iterator[Int] = {
3. return Array(itr.sum).toIterator
4. }
5.
6. var partitions = nums.mapPartitions(mysum)
Persistence (caching) - Example
Advanced Spark Programming
1. var nums = sc.parallelize((1 to 100000), 50)
2. def mysum(itr:Iterator[Int]):Iterator[Int] = {
3. return Array(itr.sum).toIterator
4. }
5.
6. var partitions = nums.mapPartitions(mysum)
7. def incrByOne(x:Int ):Int = {
a. return x+1;
8. }
9.
10. var partitions1 = partitions.map(incrByOne)
11. partitions1.collect()
12. // say, partitions1 is going to be used very frequently
Persistence (caching) - Example
Advanced Spark Programming
Persistence (caching) - Example
persist()
partitions1.persist()
res21: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at map at <console>:29
partitions1.getStorageLevel
res27: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1)
Advanced Spark Programming
//To unpersist an RDD - before repersisting we need to unpersist
partitions1.unpersist()
Persistence (caching) - Example
persist()
Advanced Spark Programming
//To unpersist an RDD - before repersisting we need to unpersist
partitions1.unpersist()
import org.apache.spark.storage.StorageLevel
//This persists our RDD into memrory and disk
partitions1.persist(StorageLevel.MEMORY_AND_DISK)
Persistence (caching) - Example
persist()
Advanced Spark Programming
//To unpersist an RDD - before repersisting we need to unpersist
partitions1.unpersist()
import org.apache.spark.storage.StorageLevel
//This persists our RDD into memrory and disk
partitions1.persist(StorageLevel.MEMORY_AND_DISK)
partitions1.getStorageLevel
res2: org.apache.spark.storage.StorageLevel = StorageLevel(true, true, false, true, 1)
StorageLevel.MEMORY_AND_DISK
res7: org.apache.spark.storage.StorageLevel = StorageLevel(true, true, false, true, 1)
Persistence (caching) - Example
persist()
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
var mysl = StorageLevel(true, true, false, true, 1)
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
useDisk
var mysl = StorageLevel(true, true, false, true, 1)
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
useDisk useMemory
var mysl = StorageLevel(true, true, false, true, 1)
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
useDisk useMemory useOffHeap
var mysl = StorageLevel(true, true, false, true, 1)
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
useDisk useMemory useOffHeap deserialized
var mysl = StorageLevel(true, true, false, true, 1)
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
useDisk useMemory useOffHeap deserialized
var mysl = StorageLevel(true, true, false, true, 1)
replication
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
useDisk useMemory useOffHeap deserialized
var mysl = StorageLevel(true, true, false, true, 1)
replication
partitions1.persist(mysl)
Advanced Spark Programming
Persistence - StorageLevel Shortcuts
partitions1.persist(StorageLevel.MEMORY_ONLY)
Advanced Spark Programming
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit
in memory, some partitions will not be cached and will be recomputed on
the fly each time they're needed. This is the default level.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
1
partitions1.persist(StorageLevel.MEMORY_ONLY)
Advanced Spark Programming
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit
in memory, store the partitions on disk that don't fit in memory, and read
them from there when they're needed.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
1
partitions1.persist(StorageLevel.MEMORY_AND_DISK)
Advanced Spark Programming
Store RDD as serialized Java objects (one byte array per partition). This is
generally more space-efficient than deserialized objects, especially when
using a fast serializer, but more CPU-intensive to read.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
1
partitions1.persist(StorageLevel.MEMORY_ONLY_SER)
Advanced Spark Programming
Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in
memory to disk instead of recomputing them on the fly each time they're
needed.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
1
partitions1.persist(StorageLevel.MEMORY_AND_DISK_SER)
Advanced Spark Programming
Store the RDD partitions only on disk.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
1
partitions1.persist(StorageLevel.DISK_ONLY)
Advanced Spark Programming
Same as the levels above, but replicate each partition on two cluster nodes.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
- - - -
2
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
Advanced Spark Programming
Store RDD in serialized format in Tachyon.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
1
OFF_HEAP (experimental)
Advanced Spark Programming
Which Storage Level to Choose?
Advanced Spark Programming
● RDD fits comfortably in memory, use MEMORY_ONLY
Which Storage Level to Choose?
Advanced Spark Programming
● RDD fits comfortably in memory, use MEMORY_ONLY
● If not, try MEMORY_ONLY_SER
Which Storage Level to Choose?
Advanced Spark Programming
Which Storage Level to Choose?
● RDD fits comfortably in memory, use MEMORY_ONLY
● If not, try MEMORY_ONLY_SER
● Don’t persist to disk unless, the computation is really expensive
Advanced Spark Programming
Which Storage Level to Choose?
● RDD fits comfortably in memory, use MEMORY_ONLY
● If not, try MEMORY_ONLY_SER
● Don’t persist to disk unless, the computation is really expensive
● For fast fault recovery, use replicated
Advanced Spark Programming
Data Partitioning
● Layout data to minimize transfer
Advanced Spark Programming
Data Partitioning
● Layout data to minimize transfer
● Not helpful if you need to scan the dataset only once
Advanced Spark Programming
● Layout data to minimize transfer
● Not helpful if you need to scan the dataset only once
● Helpful when dataset is reused multiple times in key-oriented operations
Data Partitioning
Advanced Spark Programming
Data Partitioning
● Layout data to minimize transfer
● Not helpful if you need to scan the dataset only once
● Helpful when dataset is reused multiple times in key-oriented operations
● Available for k-v rdds
Advanced Spark Programming
Data Partitioning
● Layout data to minimize transfer
● Not helpful if you need to scan the dataset only once
● Helpful when dataset is reused multiple times in key-oriented operations
● Available for k-v rdds
● Causes system to group elements based on key
Advanced Spark Programming
Data Partitioning
● Layout data to minimize transfer
● Not helpful if you need to scan the dataset only once
● Helpful when dataset is reused multiple times in key-oriented operations
● Available for k-v rdds
● Causes system to group elements based on key
● Spark does not give explicit control which worker node has which key
Advanced Spark Programming
Data Partitioning
● Layout data to minimize transfer
● Not helpful if you need to scan the dataset only once
● Helpful when dataset is reused multiple times in key-oriented operations
● Available for k-v rdds
● Causes system to group elements based on key
● Spark does not give explicit control which worker node has which key
● It lets program control/ensure which set of key will appear together
○ - based on some hash value for e.g.
○ - or you could range-sort
Advanced Spark Programming
Data Partitioning - Example - Subscriptions
Objective:
Count how many users visited a link that was not one of their subscribed topics.
UserID LinkInfoUserID UserSubscriptionTopicsInfo
UserData
(subscriptions) Events
Advanced Spark Programming
Data Partitioning - Example
Code will run fine as is, but it will be inefficient. Lot of network transfer
Periodically, we combines UserData with last five mins events (smaller)
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
//.partitionBy( new HashPartitioner( 100))
.persist()
// Function called periodically to process a logfile of events in the past 5 minutes;
// we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs.
def processNewLogs( logFileName: String) {
val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName)
val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs
val offTopicVisits = joined.filter({
case (userId, (userInfo, linkInfo)) = > !userInfo.topics.contains( linkInfo.topic)
}).count()
println(" Number of visits to non-subscribed topics: " + offTopicVisits)
}
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
.persist()
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
.persist()
// Function called periodically to process a logfile of events in the past 5 minutes;
// we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs.
def processNewLogs( logFileName: String) {
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
.persist()
// Function called periodically to process a logfile of events in the past 5 minutes;
// we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs.
def processNewLogs( logFileName: String) {
val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName)
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
.persist()
// Function called periodically to process a logfile of events in the past 5 minutes;
// we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs.
def processNewLogs( logFileName: String) {
val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName)
val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
.persist()
// Function called periodically to process a logfile of events in the past 5 minutes;
// we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs.
def processNewLogs( logFileName: String) {
val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName)
val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs
val offTopicVisits = joined.filter {
case (userId, (userInfo, linkInfo)) = > !userInfo.topics.contains( linkInfo.topic)
}.count()
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
//.partitionBy( new HashPartitioner( 100))
.persist()
// Function called periodically to process a logfile of events in the past 5 minutes;
// we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs.
def processNewLogs( logFileName: String) {
val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName)
val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs
val offTopicVisits = joined.filter {
case (userId, (userInfo, linkInfo)) = > !userInfo.topics.contains( linkInfo.topic)
}.count()
println(" Number of visits to non-subscribed topics: " + offTopicVisits)
}
Advanced Spark Programming
Data Partitioning - Example
Advanced Spark Programming
Data Partitioning - Example
import org.apache.spark.HashPartitioner
sc = SparkContext(...)
basedata = sc.sequenceFile("hdfs://...")
var userData = basedata.partitionBy(new HashPartitioner(8)).persist()
Advanced Spark Programming
Data Partitioning - Partitioners
1. Reorganize the keys of a PairRDD
2. Can be applied using partionBy
3. Examples
a. HashPartitioner
b. RangePartitioner
Advanced Spark Programming
● Implements hash-based partitioning using Java's Object.hashCode.
● Does not work if the key is an array
Data Partitioning - HashPartitioner
Advanced Spark Programming
● Partitions sortable records by range into roughly equal ranges
● The ranges are determined by sampling
Data Partitioning - RangePartitioner
Advanced Spark Programming
Data Re-partitioning - Example
Advanced Spark Programming
Data Re-partitioning - Example
import org.apache.spark.HashPartitioner
import org.apache.spark.RangePartitioner
var rdd = sc.parallelize(1 to 10).map((_,1))
rdd.partitionBy(new HashPartitioner(2)).glom().collect()
rdd.partitionBy(new RangePartitioner(2,rdd)).glom().collect()
Advanced Spark Programming
Operations That Benefit from Partitioning
● cogroup()
● groupWith(), groupByKey(), reduceByKey(),
● join(), leftOuterJoin(), rightOuter Join()
● combineByKey(), and lookup().
Data Partitioning
Advanced Spark Programming
Data Partitioning - Creating custom partitioner
[ [(sandeep,1)], [(giri,1), (abhishek,1)], [(sravani,1), (jude,1)] ]
Partition 2 Partition 3Partition 1
[ [(giri,1), (abhishek,1), (jude,1)], [(sravani,1), (sandeep,1)] ]
Partition 2Partition 1
Advanced Spark Programming
Data Partitioning - Creating custom partitioner
import org.apache.spark.Partitioner
class TwoPartsPartitioner(override val numPartitions: Int) extends Partitioner {
def getPartition(key: Any): Int = key match {
case s: String => {
if (s(0).toUpper > 'J') 1 else 0
}
}
override def equals(other: Any): Boolean = other.isInstanceOf[TwoPartsPartitioner]
override def hashCode: Int = 0
}
https://gist.github.com/girisandeep/f90e456da6f2381f9c86e8e6bc4e8260
Advanced Spark Programming
Data Partitioning - Creating custom partitioner
https://gist.github.com/girisandeep/f90e456da6f2381f9c86e8e6bc4e8260
var x = sc.parallelize(Array(("sandeep",1),("giri",1),("abhishek",1),("sravani",1),("jude",1)), 3)
x.glom().collect()
//Array(Array((sandeep,1)), Array((giri,1), (abhishek,1)), Array((sravani,1), (jude,1)))
//[ [(sandeep,1)], [(giri,1), (abhishek,1)], [(sravani,1), (jude,1)] ]
var y = x.partitionBy(new TwoPartsPartitioner(2))
y.glom().collect()
//Array(Array((giri,1), (abhishek,1), (jude,1)), Array((sandeep,1), (sravani,1)))
//[ [(giri,1), (abhishek,1), (jude,1)], [(sandeep,1), (sravani,1)] ]

More Related Content

What's hot

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark InternalsKnoldus Inc.
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 

What's hot (20)

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Spark Deep Dive
Spark Deep DiveSpark Deep Dive
Spark Deep Dive
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 

Similar to Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab

Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademySpark_RDD_SyedAcademy
Spark_RDD_SyedAcademySyed Hadoop
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkEren Avşaroğulları
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkDavid Smelker
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniSpark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniDemi Ben-Ari
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into SparkAshish kumar
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data AnalysesAlaa Elhadba
 

Similar to Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademySpark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark
SparkSpark
Spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
In-memory database
In-memory databaseIn-memory database
In-memory database
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Spark
SparkSpark
Spark
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniSpark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
 
Spark
SparkSpark
Spark
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into Spark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data Analyses
 
Spark learning
Spark learningSpark learning
Spark learning
 

More from CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep LearningCloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning OverviewCloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksCloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural NetsCloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningCloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabCloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabCloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random ForestsCloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision TreesCloudxLab
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector MachinesCloudxLab
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 

More from CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab

  • 3. Advanced Spark Programming 1. RDDs are lazily evaluated Persistence (caching)
  • 4. Advanced Spark Programming 1. RDDs are lazily evaluated 2. RDD and its dependencies are recomputed on action Persistence (caching)
  • 5. Advanced Spark Programming 1. RDDs are lazily evaluated 2. RDD and its dependencies are recomputed on action 3. We may wish to use the same RDD multiple times. Persistence (caching)
  • 6. Advanced Spark Programming 1. RDDs are lazily evaluated 2. RDD and its dependencies are recomputed on action 3. We may wish to use the same RDD multiple times. 4. To avoid re-computing, we can persist the RDD Persistence (caching)
  • 7. Advanced Spark Programming 1. If a node that has data persisted on it fails, Spark will recompute the lost partitions of the data when needed. Persistence (caching)
  • 8. Advanced Spark Programming Persistence (caching) 1. If a node that has data persisted on it fails, Spark will recompute the lost partitions of the data when needed. 2. We can also replicate our data on multiple nodes if we want to be able to handle node failure without slowdown.
  • 10. Advanced Spark Programming 1. var nums = sc.parallelize((1 to 100000), 50) 2. Persistence (caching) - Example
  • 11. Advanced Spark Programming 1. var nums = sc.parallelize((1 to 100000), 50) 2. def mysum(itr:Iterator[Int]):Iterator[Int] = { 3. return Array(itr.sum).toIterator 4. } 5. 6. var partitions = nums.mapPartitions(mysum) Persistence (caching) - Example
  • 12. Advanced Spark Programming 1. var nums = sc.parallelize((1 to 100000), 50) 2. def mysum(itr:Iterator[Int]):Iterator[Int] = { 3. return Array(itr.sum).toIterator 4. } 5. 6. var partitions = nums.mapPartitions(mysum) 7. def incrByOne(x:Int ):Int = { a. return x+1; 8. } 9. 10. var partitions1 = partitions.map(incrByOne) 11. partitions1.collect() 12. // say, partitions1 is going to be used very frequently Persistence (caching) - Example
  • 13. Advanced Spark Programming Persistence (caching) - Example persist() partitions1.persist() res21: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at map at <console>:29 partitions1.getStorageLevel res27: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1)
  • 14. Advanced Spark Programming //To unpersist an RDD - before repersisting we need to unpersist partitions1.unpersist() Persistence (caching) - Example persist()
  • 15. Advanced Spark Programming //To unpersist an RDD - before repersisting we need to unpersist partitions1.unpersist() import org.apache.spark.storage.StorageLevel //This persists our RDD into memrory and disk partitions1.persist(StorageLevel.MEMORY_AND_DISK) Persistence (caching) - Example persist()
  • 16. Advanced Spark Programming //To unpersist an RDD - before repersisting we need to unpersist partitions1.unpersist() import org.apache.spark.storage.StorageLevel //This persists our RDD into memrory and disk partitions1.persist(StorageLevel.MEMORY_AND_DISK) partitions1.getStorageLevel res2: org.apache.spark.storage.StorageLevel = StorageLevel(true, true, false, true, 1) StorageLevel.MEMORY_AND_DISK res7: org.apache.spark.storage.StorageLevel = StorageLevel(true, true, false, true, 1) Persistence (caching) - Example persist()
  • 17. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel)
  • 18. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) var mysl = StorageLevel(true, true, false, true, 1)
  • 19. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) useDisk var mysl = StorageLevel(true, true, false, true, 1)
  • 20. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) useDisk useMemory var mysl = StorageLevel(true, true, false, true, 1)
  • 21. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) useDisk useMemory useOffHeap var mysl = StorageLevel(true, true, false, true, 1)
  • 22. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) useDisk useMemory useOffHeap deserialized var mysl = StorageLevel(true, true, false, true, 1)
  • 23. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) useDisk useMemory useOffHeap deserialized var mysl = StorageLevel(true, true, false, true, 1) replication
  • 24. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) useDisk useMemory useOffHeap deserialized var mysl = StorageLevel(true, true, false, true, 1) replication partitions1.persist(mysl)
  • 25. Advanced Spark Programming Persistence - StorageLevel Shortcuts partitions1.persist(StorageLevel.MEMORY_ONLY)
  • 26. Advanced Spark Programming Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication 1 partitions1.persist(StorageLevel.MEMORY_ONLY)
  • 27. Advanced Spark Programming Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions on disk that don't fit in memory, and read them from there when they're needed. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication 1 partitions1.persist(StorageLevel.MEMORY_AND_DISK)
  • 28. Advanced Spark Programming Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication 1 partitions1.persist(StorageLevel.MEMORY_ONLY_SER)
  • 29. Advanced Spark Programming Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication 1 partitions1.persist(StorageLevel.MEMORY_AND_DISK_SER)
  • 30. Advanced Spark Programming Store the RDD partitions only on disk. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication 1 partitions1.persist(StorageLevel.DISK_ONLY)
  • 31. Advanced Spark Programming Same as the levels above, but replicate each partition on two cluster nodes. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication - - - - 2 MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
  • 32. Advanced Spark Programming Store RDD in serialized format in Tachyon. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication 1 OFF_HEAP (experimental)
  • 33. Advanced Spark Programming Which Storage Level to Choose?
  • 34. Advanced Spark Programming ● RDD fits comfortably in memory, use MEMORY_ONLY Which Storage Level to Choose?
  • 35. Advanced Spark Programming ● RDD fits comfortably in memory, use MEMORY_ONLY ● If not, try MEMORY_ONLY_SER Which Storage Level to Choose?
  • 36. Advanced Spark Programming Which Storage Level to Choose? ● RDD fits comfortably in memory, use MEMORY_ONLY ● If not, try MEMORY_ONLY_SER ● Don’t persist to disk unless, the computation is really expensive
  • 37. Advanced Spark Programming Which Storage Level to Choose? ● RDD fits comfortably in memory, use MEMORY_ONLY ● If not, try MEMORY_ONLY_SER ● Don’t persist to disk unless, the computation is really expensive ● For fast fault recovery, use replicated
  • 38. Advanced Spark Programming Data Partitioning ● Layout data to minimize transfer
  • 39. Advanced Spark Programming Data Partitioning ● Layout data to minimize transfer ● Not helpful if you need to scan the dataset only once
  • 40. Advanced Spark Programming ● Layout data to minimize transfer ● Not helpful if you need to scan the dataset only once ● Helpful when dataset is reused multiple times in key-oriented operations Data Partitioning
  • 41. Advanced Spark Programming Data Partitioning ● Layout data to minimize transfer ● Not helpful if you need to scan the dataset only once ● Helpful when dataset is reused multiple times in key-oriented operations ● Available for k-v rdds
  • 42. Advanced Spark Programming Data Partitioning ● Layout data to minimize transfer ● Not helpful if you need to scan the dataset only once ● Helpful when dataset is reused multiple times in key-oriented operations ● Available for k-v rdds ● Causes system to group elements based on key
  • 43. Advanced Spark Programming Data Partitioning ● Layout data to minimize transfer ● Not helpful if you need to scan the dataset only once ● Helpful when dataset is reused multiple times in key-oriented operations ● Available for k-v rdds ● Causes system to group elements based on key ● Spark does not give explicit control which worker node has which key
  • 44. Advanced Spark Programming Data Partitioning ● Layout data to minimize transfer ● Not helpful if you need to scan the dataset only once ● Helpful when dataset is reused multiple times in key-oriented operations ● Available for k-v rdds ● Causes system to group elements based on key ● Spark does not give explicit control which worker node has which key ● It lets program control/ensure which set of key will appear together ○ - based on some hash value for e.g. ○ - or you could range-sort
  • 45. Advanced Spark Programming Data Partitioning - Example - Subscriptions Objective: Count how many users visited a link that was not one of their subscribed topics. UserID LinkInfoUserID UserSubscriptionTopicsInfo UserData (subscriptions) Events
  • 46. Advanced Spark Programming Data Partitioning - Example Code will run fine as is, but it will be inefficient. Lot of network transfer Periodically, we combines UserData with last five mins events (smaller)
  • 47. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") //.partitionBy( new HashPartitioner( 100)) .persist() // Function called periodically to process a logfile of events in the past 5 minutes; // we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs. def processNewLogs( logFileName: String) { val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName) val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs val offTopicVisits = joined.filter({ case (userId, (userInfo, linkInfo)) = > !userInfo.topics.contains( linkInfo.topic) }).count() println(" Number of visits to non-subscribed topics: " + offTopicVisits) }
  • 48. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") .persist()
  • 49. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") .persist() // Function called periodically to process a logfile of events in the past 5 minutes; // we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs. def processNewLogs( logFileName: String) {
  • 50. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") .persist() // Function called periodically to process a logfile of events in the past 5 minutes; // we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs. def processNewLogs( logFileName: String) { val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName)
  • 51. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") .persist() // Function called periodically to process a logfile of events in the past 5 minutes; // we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs. def processNewLogs( logFileName: String) { val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName) val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs
  • 52. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") .persist() // Function called periodically to process a logfile of events in the past 5 minutes; // we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs. def processNewLogs( logFileName: String) { val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName) val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs val offTopicVisits = joined.filter { case (userId, (userInfo, linkInfo)) = > !userInfo.topics.contains( linkInfo.topic) }.count()
  • 53. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") //.partitionBy( new HashPartitioner( 100)) .persist() // Function called periodically to process a logfile of events in the past 5 minutes; // we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs. def processNewLogs( logFileName: String) { val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName) val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs val offTopicVisits = joined.filter { case (userId, (userInfo, linkInfo)) = > !userInfo.topics.contains( linkInfo.topic) }.count() println(" Number of visits to non-subscribed topics: " + offTopicVisits) }
  • 54. Advanced Spark Programming Data Partitioning - Example
  • 55. Advanced Spark Programming Data Partitioning - Example import org.apache.spark.HashPartitioner sc = SparkContext(...) basedata = sc.sequenceFile("hdfs://...") var userData = basedata.partitionBy(new HashPartitioner(8)).persist()
  • 56. Advanced Spark Programming Data Partitioning - Partitioners 1. Reorganize the keys of a PairRDD 2. Can be applied using partionBy 3. Examples a. HashPartitioner b. RangePartitioner
  • 57. Advanced Spark Programming ● Implements hash-based partitioning using Java's Object.hashCode. ● Does not work if the key is an array Data Partitioning - HashPartitioner
  • 58. Advanced Spark Programming ● Partitions sortable records by range into roughly equal ranges ● The ranges are determined by sampling Data Partitioning - RangePartitioner
  • 59. Advanced Spark Programming Data Re-partitioning - Example
  • 60. Advanced Spark Programming Data Re-partitioning - Example import org.apache.spark.HashPartitioner import org.apache.spark.RangePartitioner var rdd = sc.parallelize(1 to 10).map((_,1)) rdd.partitionBy(new HashPartitioner(2)).glom().collect() rdd.partitionBy(new RangePartitioner(2,rdd)).glom().collect()
  • 61. Advanced Spark Programming Operations That Benefit from Partitioning ● cogroup() ● groupWith(), groupByKey(), reduceByKey(), ● join(), leftOuterJoin(), rightOuter Join() ● combineByKey(), and lookup(). Data Partitioning
  • 62. Advanced Spark Programming Data Partitioning - Creating custom partitioner [ [(sandeep,1)], [(giri,1), (abhishek,1)], [(sravani,1), (jude,1)] ] Partition 2 Partition 3Partition 1 [ [(giri,1), (abhishek,1), (jude,1)], [(sravani,1), (sandeep,1)] ] Partition 2Partition 1
  • 63. Advanced Spark Programming Data Partitioning - Creating custom partitioner import org.apache.spark.Partitioner class TwoPartsPartitioner(override val numPartitions: Int) extends Partitioner { def getPartition(key: Any): Int = key match { case s: String => { if (s(0).toUpper > 'J') 1 else 0 } } override def equals(other: Any): Boolean = other.isInstanceOf[TwoPartsPartitioner] override def hashCode: Int = 0 } https://gist.github.com/girisandeep/f90e456da6f2381f9c86e8e6bc4e8260
  • 64. Advanced Spark Programming Data Partitioning - Creating custom partitioner https://gist.github.com/girisandeep/f90e456da6f2381f9c86e8e6bc4e8260 var x = sc.parallelize(Array(("sandeep",1),("giri",1),("abhishek",1),("sravani",1),("jude",1)), 3) x.glom().collect() //Array(Array((sandeep,1)), Array((giri,1), (abhishek,1)), Array((sravani,1), (jude,1))) //[ [(sandeep,1)], [(giri,1), (abhishek,1)], [(sravani,1), (jude,1)] ] var y = x.partitionBy(new TwoPartsPartitioner(2)) y.glom().collect() //Array(Array((giri,1), (abhishek,1), (jude,1)), Array((sandeep,1), (sravani,1))) //[ [(giri,1), (abhishek,1), (jude,1)], [(sandeep,1), (sravani,1)] ]