More Related Content
Similar to IBM Spark Meetup - RDD & Spark Basics (20)
IBM Spark Meetup - RDD & Spark Basics
- 1. © 2015 IBM Corporation
RDD Deep Dive
• RDD Basics
• How to create
• RDD Operations
• Lineage
• Partitions
• Shuffle
• Type of RDDs
• Extending RDD
• Caching in RDD
- 2. © 2015 IBM Corporation
RDD Basics
• RDD (Resilient Distributed Dataset)
• Distributed collection of Object
• Resilient - Ability to re-compute missing partitions
(node failure)
• Distributed – Split across multiple partitions
• Dataset - Can contain any type, Python/Java/Scala
Object or User defined Object
• Fundamental unit of data in spark
- 3. © 2015 IBM Corporation
RDD Basics – How to create
Two ways
Loading external datasets
Spark supports wide range of sources
Access HDFS data through InputFormat & OutputFormat
of Hadoop.
Supports custom Input/Output format
Parallelizing collection in driver program
val lineRDD = sc.textFile(“hdfs:///path/to/Readme.md”)
textFile(“/my/directory/*”) or textFile(“/my/directory/*.gz”)
SparkContext.wholeTextFiles returns (filename,content) pair
val listRDD = sc.parallelize(List(“spark”,”meetup”,”deepdive”))
- 4. © 2015 IBM Corporation
RDD Operations
Two type of Operations
Transformation
Action
Transformations are lazy, nothing actually happens until an action is
called.
Action triggers the computation
Action returns values to driver or writes data to external storage.
- 5. © 2015 IBM Corporation
Lazy Evaluation
Transformation on RDD, don’t get performed immediately
Spark Internally records metadata to track the operation
Loading data into RDD also gets lazy evaluated
Lazy evaluation reduce number of passes on the data by
grouping operations
MapReduce – Burden on developer to merge the operation,
complex map.
Failure in Persisting the RDD will re-compute complete lineage
every time.
- 6. © 2015 IBM Corporation
RDD In Action
sc.textFile(“hdfs://file.txt")
.flatMap(line=>line.split(" "))
.map(word => (word,1))
.reduceByKey(_+_)
.collect()
I scream you
scream lets all
scream for
icecream!
I wish I were
what I was when
I wished I were
what I am.
I
scream
you
scream
lets
all
scream
for
icecream
(I,1)
(scream,1)
(you,1)
(scream,1)
(lets,1)
(all,1)
(scream,1)
(icecream,1)
(icecream,1)
(scream,3)
(you,1)
(lets,1)
(I,1)
(all,1)
- 8. © 2015 IBM Corporation
RDD Partition
Partition Definition
Fragments of RDD
Fragmentation allows Spark to execute in Parallel.
Partitions are distributed across cluster(Spark worker)
Partitioning
Impacts parallelism
Impacts performance
- 9. © 2015 IBM Corporation
Importance of partition Tuning
Too few partitions
Less concurrency, unused cores.
More susceptible to data skew
Increased memory pressure for groupBy, reduceByKey,
sortByKey, etc.
Too many partitions
Framework overhead (more scheduling latency than the
time needed for actual task.)
Many CPU context-switching
Need “reasonable number” of partitions
Commonly between 100 and 10,000 partitions
Lower bound: At least ~2x number of cores in cluster
Upper bound: Ensure tasks take at least 100ms
- 10. © 2015 IBM Corporation
How Spark Partitions data
Input data partition
Shuffle transformations
Custom Partitioner
- 11. © 2015 IBM Corporation
Partition - Input Data
Spark uses same class as Hadoop to perform Input/Output
sc.textFile(“hdfs://…”) invokes Hadoop TextInputFormat
Below are Knobs which defines #Partitions
dfs.block.size – default 128MB(Hadoop 2.0)
numPartition – can be used to increase number of partition
default is 0 which means 1 partition
mapreduce.input.fileinputformat.split.minsize – default 1kb
Partition Size = Max(minsize,Min(goalSize,blockSize)
goalSize = totalInputSize/numPartitions
32MB, 0, 1KB, 640MB total size - Defaults
Max(1kb,Min(640MB,32MB) ) = 20 partitions
32MB, 30, 1KB , 640MB total size - Want more partition
Max(1kb,Min(32MB,32MB)) = 32 partition
32MB, 5, 1KB = Max(1kb,Min(120MB,32MB)) = 20 – Bigger size partition
32MB,0, 64MB = Max(64MB,Min(640MB,32MB)) = 10 Bigger size
partition
- 12. © 2015 IBM Corporation
Partition - Shuffle transformations
All shuffle transformation provides parameter
for desire number of partition
Default Behavior - Spark Uses HashPartitioner.
If spark.default.parallelism is set , takes that as # of
partitions
If spark.default.parallelism is not set
largest upstream RDD ‘s number of partition
Reduces chances of out of memory
1. groupByKey
2. reduceByKey
3. aggregateByKey
4. sortByKey
5. join
6. cogroup
7. cartesian
8. coalesce
9. repartition
10.repartitionAndSort
WithinPartitions
Shuffle Transformation
- 13. © 2015 IBM Corporation
Partition - Repartitioning
RDD provides two operators
repartition(numPartitions)
Can Increase/decrease number of partitions
Internally does shuffle
expensive due to shuffle
For decreasing partition use coalesce
Coalesce(numPartition,Shuffle:[true/false])
Decreases partitions
Goes for narrow dependencies
Avoids shuffle
In case of drastic reduction may trigger shuffle
- 14. © 2015 IBM Corporation
Custom Partitioner
Partition the data according to use case & data structure
Custom Partitioning allows control over no of partitions and
distribution of data
Extends Partitioner class, need to implement getPartitions &
numPartitons
- 16. © 2015 IBM Corporation
Shuffle - GroupByKey Vs ReduceByKey
val wordCountsWithGroup = rdd
.groupByKey()
.map(t => (t._1, t._2.sum)) .collect()
- 17. © 2015 IBM Corporation
Shuffle - GroupByKey Vs ReduceByKey
val wordPairsRDD = rdd.map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD
.reduceByKey(_ + _)
.collect()
- 18. © 2015 IBM Corporation
The Shuffle
Redistribution of data among partition between stages.
Most of the Performance, Reliability Scalability Issues in Spark occurs
within Shuffle.
Like MapReduce Spark shuffle uses Pull model.
Consistently evolved and still an area of research in Spark
- 19. © 2015 IBM Corporation
Shuffle Overview
• Spark run job stage by stage.
• Stages are build up by DAGScheduler according to RDD’s
ShuffleDependency
• e.g. ShuffleRDD / CoGroupedRDD will have a ShuffleDependency
• Many operator will create ShuffleRDD / CoGroupedRDD
under the hood.
• Repartition/CombineByKey/GroupBy/ReduceByKey/cogroup
• Many other operator will further call into the above operators
• e.g. various join operator will call CoGroup.
• Each ShuffleDependency maps to one stage in Spark Job and
then will lead to a shuffle.
- 20. © 2015 IBM Corporation
You have seen this
join
union
groupBy
Stage3
Stage1
Stage2
A: B:
C: D:
map
E:
F:
G:
- 21. © 2015 IBM Corporation
Shuffle is Expensive
• When doing shuffle, data no longer stay in memory only, gets
written to disk.
• For spark, shuffle process might involve
• Data partition: which might involve very expensive data
sorting works etc.
• Data ser/deser: to enable data been transfer through
network or across processes.
• Data compression: to reduce IO bandwidth etc.
• Disk IO: probably multiple times on one single data block
• E.g. Shuffle Spill, Merge combine
- 22. © 2015 IBM Corporation
Shuffle History
Shuffle module in Spark has evolved over time.
Spark(0.6-0.7) – Same code path as RDD’s persist method.
MEMORY_ONLY , DISK_ONLY options available.
Spark (0.8-0.9)
- Separate code for shuffle, ShuffleBlockManager &
BlockObjectWriter for shuffle only.
- Shuffle optimization - Consolidate Shuffle Write.
Spark 1.0 – Introduced pluggable shuffle framework
Spark 1.1 – Sort based Shuffle Implementation
Spark 1.2 - Netty transfer Implementation. Sort based shuffle is
default now.
Spark 1.2+ - External shuffle service etc.
- 23. © 2015 IBM Corporation
Understanding Shuffle
Input Aggregation
Types of Shuffle
Hash based
Basic Hash Shuffle
Consolidate Hash Shuffle
Sort Based Shuffle
- 24. © 2015 IBM Corporation
Input Aggregation
Like MapReduce, Spark involves aggregate(Combiner) on map side.
Aggregation is done in ShuffleMapTask using
AppendOnlyMap (In Memory Hash Table combiner)
Key’s are never removed , values gets updated
ExternalAppendOnlyMap (In Memory and disk Hash Table combiner)
A Hash Map which can spill to disk
Append Only Map that spill data to disk if insufficient memory
Shuffle file In-Memory Buffer – Shuffle writes to In-memory buffer before
writing to a shuffle file.
- 25. © 2015 IBM Corporation
Shuffle Types – Basic Hash Shuffle
Hash Based shuffle (spark.shuffle.manager). Hash Partitions the data
for reducers
Each map task writes each bucket to a file.
#Map Tasks = M
#Reduce Tasks = R
#Shuffle File = M*R , #In-Memory Buffer = M*R
- 26. © 2015 IBM Corporation
Shuffle Types – Basic Hash Shuffle
Problem
Lets use 100KB as buffer size
We have 10000 reducers
10 Mapper tasks Per Executor
In-Memory Buffer size will = 100KB*10000*10
Buffer need will be 10GB/Executor
This huge amount of Buffer is not acceptable and this
Implementation cant support 10000 reducer.
- 27. © 2015 IBM Corporation
Shuffle Types – Consolidate Hash Shuffle
Solution to decrease the IN-Memory Buffer size , No of File.
Within Executor, Map Tasks writes each Bucket to a Segment of the file.
#Shuffle file/Executor = #Reducers,
# In-Memory Buffer/ Executor=#R( Reducers)
- 28. © 2015 IBM Corporation
Shuffle Types – Sort Based Shuffle
Consolidate Hash Shuffle needs one file for each reducer.
- Total C*R intermediate file , C = # of executor running map
tasks
Still too many files(e.g ~10k reducers),
Need significant memory for compression & serialization
buffer.
Too many open files issue.
Sort Based Shuflle is similar to map-side shuffle from
MapReduce
Introduced in Spark 1.1 , now its default shuffle
- 29. © 2015 IBM Corporation
Shuffle Types – Sort Based Shuffle
Map output records from each task are kept in memory till they can fit.
Once full , data gets sorted by partition and spilled to single file.
Each Map task generate 1 data file and one index file
Utilize external sorter to do the sort work
If map side combiner is required data will be sorted by key and partition
otherwise only by partition
#reducer <=200, no sorting uses hash approach, generate file per reducer
and merge them into a single file
- 30. © 2015 IBM Corporation
Shuffle Reader
On Reader side both Sort & Hash Shuffle uses Hash Shuffle Reader
On reducer side a set of thread fetch remote output map blocks
Once block comes its records are de-serialized and passed into a
result queue.
Records are passed to ExternalAppendOnlyMap , for ordering
operation like sortByKey records are passed to externalSorter.
20
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Reduce Task
Aggregator Aggregator Aggregator Aggregator
Reduce Task Reduce Task Reduce Task
- 31. © 2015 IBM Corporation
Type of RDDS - RDD Interface
Base for all RDDs (RDD.scala), consists of
A Set of partitions (“splits” in Hadoop)
A List of dependencies on parent RDDs
A Function to compute the partition from its
parents
Optional preferred locations for each partition
A Partitioner defines strategy for partitionig
hash/range
Basic operations like map, filter, persist etc
Partitions
Dependencies
Compute
PreferredLocations
Partitioner
map,filter,persist
s
Lineage
Optimized execution
Operations
- 32. © 2015 IBM Corporation
Example: HadoopRDD
partitions = one per HDFS block
dependencies = none
compute(partition) = read corresponding block
preferredLocations(part) = HDFS block location
partitioner = none
- 33. © 2015 IBM Corporation
Example: MapPartitionRDD
partitions = Parent Partition
dependencies = “one-to-one “parent RDD
compute(partition) = apply map on parent
preferredLocations(part) = none (ask parent)
partitioner = none
- 34. © 2015 IBM Corporation
Example: CoGroupRDD
partitions = one per reduce task
dependencies = could be narrow or wide dependency
compute(partition) = read and join shuffled data
preferredLocations(part) = none
partitioner = HashPartitioner(numTasks)
- 35. © 2015 IBM Corporation
Extending RDDs
Extend RDDs to
To add Domain specific transformation/actions
Allow developer to express domain specific calculation in
cleaner way
Improves code readability
Easy to maintain
Domain specific RDD
Better way to express domain specific data
Better control on partitioning and distribution
Way to add new Input data source
- 36. © 2015 IBM Corporation
How to Extend
Add custom operators to RDD
Use scala Impilicits
Feels and works like built in operator
You can add operator to Specific RDD or to all
Custom RDD
Extend RDD API to create our own RDD
Implement compute & getPartitions abstract method
- 37. © 2015 IBM Corporation
Implicit Class
Creates an extension method to existing type
Introduced in Scala 2.10
Implicits are compile time checked. Implicit class gets resolved
into a class definition with implict conversion
We will use Implicit to add new method in RDD
- 38. © 2015 IBM Corporation
Adding new Operator to RDD
We will use Scala Implicit feature to add a new operator to an
existingRDD
This operator will show up only in our RDD
Implicit conversions are handled by Scala
- 39. © 2015 IBM Corporation
Custom RDD Implementation
Extending RDD allow you to create your own custom RDD
structure
Custom RDD allow control on computation, change partition &
locality information
- 40. © 2015 IBM Corporation
Caching in RDD
Spark allows caching/Persisting entire dataset in memory
Persisting RDD in cache
First time when it is computed it will be kept in memory
Reuse the the cache partition in next set of operation
Fault-tolerant, recomputed in case of failure
Caching is key tool for interactive and iterative algorithm
Persist support different storage level
Storage level - In memory , Disk or both , Techyon
Serialized Vs Deserialized
- 41. © 2015 IBM Corporation
Caching In RDD
Spark Context tracks persistent RDDs
Block Manager puts partition in memory when first evaluated
Cache is lazy evaluation , no caching without an action.
Shuffle also keeps its data in Cache after shuffle operations.
We still need to cache shuffle RDDs