SlideShare a Scribd company logo
1 of 42
Download to read offline
Anatomy of RDD
A deep dive into the RDD data structure
https://github.com/phatak-dev/anatomy-of-rdd
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● What is RDD?
● Immutable and Distributed
● Partitions
● Laziness
● Caching
● Extending Spark API
What is RDD?
Resilient Distributed Dataset
- A big collection of data with following
properties
- Immutable
- Distributed
- Lazily evaluated
- Type inferred
- Cacheable
Immutable and Distributed
Partitions
● Logical division of data
● Derived from Hadoop Map/Reduce
● All Input,Intermediate and output data will be
represented as partitions
● Partitions are basic unit of parallelism
● RDD data is just collection of partitions
Partition from Input Data
Data in HDFS
Chunk 1 Chunk 2 Chunk 3
Input Format
RDD
Partition 1 Partition 2 Partition 3
Partition example
Partition and Immutability
● All partitions are immutable
● Every transformation generates new partition
● Partition immutability driven by underneath
storage like HDFS
● Partition immutability allows for fault
recovery
Partitions and Distribution
● Partitions derived from HDFS are distributed
by default
● Partitions also location aware
● Location awareness of partitions allow for
data locality
● For computed data, using caching we can
distribute in memory also
Accessing partitions
● We can access partition together rather
single row at a time
● mapParititons API of RDD allows us that
● Accessing partition at a time allows us to do
some partionwise operation which cannot be
done by accessing single row.
Map partition example
Partition for transformed Data
● Partitioning will be different for key/value
pairs that are generated by shuffle operation
● Partitioning is driven by partitioner specified
● By default HashPartitioner is used
● You can use your own partitioner also
Hash Partitioning
Input RDD
Partition 1 Partition 2 Partition 3
Hash(key)%no.of.partitions
Ouptut RDD Partition 1 Partition 2
Hash partition example
Custom Partitioner
● Partition the data according to your data
structure
● Custom partitioning allows control over no of
partitions and the distribution of data across
when grouping or reducing is done
Custom partition example
Look up operation
● Partitioning allows faster lookups
● Lookup operation allows to look up for a
given value by specifying the key
● Using partitioner, lookup determines which
partition look for
● Then it only need to look in that partition
● If no partition is specified, it will fallback to
filter
Lookup example
Laziness
Parent(Dependency)
● Each RDD has access to it’s parent RDD
● Nil is the value of parent for first RDD
● Before computing it’s value, it always
computes it’s parent
● This chain of running allows for laziness
Sub classing
● Each spark operator, creates an instance of
specific sub class of RDD
● map operator results in MappedRDD,
flatMap in FlatMappedRDD etc
● Subclass allows RDD to remember the
operation that is performed in the
transformation
RDD transformations
val dataRDD =
sc.textFile(args
(1))
val splitRDD =
dataRDD.
flatMap(value =>
value.split(“ “)
Hadoop RDD
splitRDD:
FlatMappedRDD
dataRDD :
MappedRDD
Nil
Laziness example
Compute
● Compute is the function for evaluation of
each partition in RDD
● Compute is an abstract method of RDD
● Each sub class of RDD like MappedRDD,
FilteredRDD have to override this method
RDD actions
val dataRDD = sc.
textFile(args(1))
val flatMapRDD =
dataRDD.flatMap
(value => value.split(“
“)
flatMapRDD.collect()
Hadoop RDD
FlatMap RDD
Mapped RDD
Nil
runJob
compute
compute
compute
runJob API
● runJob API of RDD is the api to implement
actions
● runJob allows to take each partition and
allow you evaluate
● All spark actions internally use runJob api.
Run job example
Caching
● cache internally uses persist API
● persist sets a specific storage level for a
given RDD
● Spark context tracks persistent RDD
● When first evaluates, partition will be put into
memory by block manager
Block manager
● Handles all in memory data in spark
● Responsible for
○ Cached Data ( BlockRDD)
○ Shuffle Data
○ Broadcast data
● Partition will be stored in Block with id (RDD.
id, partition_index)
How caching works?
● Partition iterator checks the storage level
● if Storage level is set it calls
cacheManager.getOrCompute(partition)
● as iterator is run for each RDD evaluation, its
transparent to user
Caching example
Extending Spark API
Why?
● Domain specific operators
○ Allows developer to express domain specific
calculation in cleaner way
○ Improves code readability
○ Easy to maintain
● Domain specific RDD’s
○ Better way of expressing domain data
○ Control over partitioning and distribution
DSL Example
● salesRecordRDD: RDD[SalesRecord]
● To make sum of sales
○ In plain spark
■ salesRecord.map(_.itemValue).sum
○ In our dsl
■ salesRecord.totalSales
● Our dsl hides internal representation and
improves readability
How to Extend
● Custom operators to RDD
○ Domain specific operators to specific RDD’s
○ Uses scala implicit mechanism
○ Feels and works like built in operator
● Custom RDD
○ Extend RDD API to create our own RDD
○ Combined with RDD it’s very powerful
Implicits in Scala
● A way of extending Types on the fly
● Implicits also used to pass the parameters
functions which are read from environment
● In our example, we just use the type
extension facility
● All implicits are compile time checked.
Implicits example
Adding operators to RDD’s
● We use scala implicit facility, to add the
custom operators on our RDD
● These operators only show up in our RDD’s
● All implicit conversion are handled by Scala
not by Spark
● Spark internally use similar tricks for
PairRDD’s
Custom operator Example
Extending RDD
● Extending RDD API allows to create our own
custom RDD structure
● Custom RDD’s allows control over
computation
● You can change partitions, locality and
evaluation depending upon your requirement
Discount RDD example

More Related Content

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 

Anatomy of RDD : Deep dive into Spark RDD abstraction

  • 1. Anatomy of RDD A deep dive into the RDD data structure https://github.com/phatak-dev/anatomy-of-rdd
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● What is RDD? ● Immutable and Distributed ● Partitions ● Laziness ● Caching ● Extending Spark API
  • 4. What is RDD? Resilient Distributed Dataset - A big collection of data with following properties - Immutable - Distributed - Lazily evaluated - Type inferred - Cacheable
  • 6. Partitions ● Logical division of data ● Derived from Hadoop Map/Reduce ● All Input,Intermediate and output data will be represented as partitions ● Partitions are basic unit of parallelism ● RDD data is just collection of partitions
  • 7. Partition from Input Data Data in HDFS Chunk 1 Chunk 2 Chunk 3 Input Format RDD Partition 1 Partition 2 Partition 3
  • 9. Partition and Immutability ● All partitions are immutable ● Every transformation generates new partition ● Partition immutability driven by underneath storage like HDFS ● Partition immutability allows for fault recovery
  • 10. Partitions and Distribution ● Partitions derived from HDFS are distributed by default ● Partitions also location aware ● Location awareness of partitions allow for data locality ● For computed data, using caching we can distribute in memory also
  • 11. Accessing partitions ● We can access partition together rather single row at a time ● mapParititons API of RDD allows us that ● Accessing partition at a time allows us to do some partionwise operation which cannot be done by accessing single row.
  • 13. Partition for transformed Data ● Partitioning will be different for key/value pairs that are generated by shuffle operation ● Partitioning is driven by partitioner specified ● By default HashPartitioner is used ● You can use your own partitioner also
  • 14. Hash Partitioning Input RDD Partition 1 Partition 2 Partition 3 Hash(key)%no.of.partitions Ouptut RDD Partition 1 Partition 2
  • 16. Custom Partitioner ● Partition the data according to your data structure ● Custom partitioning allows control over no of partitions and the distribution of data across when grouping or reducing is done
  • 18. Look up operation ● Partitioning allows faster lookups ● Lookup operation allows to look up for a given value by specifying the key ● Using partitioner, lookup determines which partition look for ● Then it only need to look in that partition ● If no partition is specified, it will fallback to filter
  • 21. Parent(Dependency) ● Each RDD has access to it’s parent RDD ● Nil is the value of parent for first RDD ● Before computing it’s value, it always computes it’s parent ● This chain of running allows for laziness
  • 22. Sub classing ● Each spark operator, creates an instance of specific sub class of RDD ● map operator results in MappedRDD, flatMap in FlatMappedRDD etc ● Subclass allows RDD to remember the operation that is performed in the transformation
  • 23. RDD transformations val dataRDD = sc.textFile(args (1)) val splitRDD = dataRDD. flatMap(value => value.split(“ “) Hadoop RDD splitRDD: FlatMappedRDD dataRDD : MappedRDD Nil
  • 25. Compute ● Compute is the function for evaluation of each partition in RDD ● Compute is an abstract method of RDD ● Each sub class of RDD like MappedRDD, FilteredRDD have to override this method
  • 26. RDD actions val dataRDD = sc. textFile(args(1)) val flatMapRDD = dataRDD.flatMap (value => value.split(“ “) flatMapRDD.collect() Hadoop RDD FlatMap RDD Mapped RDD Nil runJob compute compute compute
  • 27. runJob API ● runJob API of RDD is the api to implement actions ● runJob allows to take each partition and allow you evaluate ● All spark actions internally use runJob api.
  • 29. Caching ● cache internally uses persist API ● persist sets a specific storage level for a given RDD ● Spark context tracks persistent RDD ● When first evaluates, partition will be put into memory by block manager
  • 30. Block manager ● Handles all in memory data in spark ● Responsible for ○ Cached Data ( BlockRDD) ○ Shuffle Data ○ Broadcast data ● Partition will be stored in Block with id (RDD. id, partition_index)
  • 31. How caching works? ● Partition iterator checks the storage level ● if Storage level is set it calls cacheManager.getOrCompute(partition) ● as iterator is run for each RDD evaluation, its transparent to user
  • 34. Why? ● Domain specific operators ○ Allows developer to express domain specific calculation in cleaner way ○ Improves code readability ○ Easy to maintain ● Domain specific RDD’s ○ Better way of expressing domain data ○ Control over partitioning and distribution
  • 35. DSL Example ● salesRecordRDD: RDD[SalesRecord] ● To make sum of sales ○ In plain spark ■ salesRecord.map(_.itemValue).sum ○ In our dsl ■ salesRecord.totalSales ● Our dsl hides internal representation and improves readability
  • 36. How to Extend ● Custom operators to RDD ○ Domain specific operators to specific RDD’s ○ Uses scala implicit mechanism ○ Feels and works like built in operator ● Custom RDD ○ Extend RDD API to create our own RDD ○ Combined with RDD it’s very powerful
  • 37. Implicits in Scala ● A way of extending Types on the fly ● Implicits also used to pass the parameters functions which are read from environment ● In our example, we just use the type extension facility ● All implicits are compile time checked.
  • 39. Adding operators to RDD’s ● We use scala implicit facility, to add the custom operators on our RDD ● These operators only show up in our RDD’s ● All implicit conversion are handled by Scala not by Spark ● Spark internally use similar tricks for PairRDD’s
  • 41. Extending RDD ● Extending RDD API allows to create our own custom RDD structure ● Custom RDD’s allows control over computation ● You can change partitions, locality and evaluation depending upon your requirement