2. What is spark
• In-Memory Map/Reduce Engine
• Spark was developed in 2009 by the Berkley
Amp lab
• Converted to an Apache project in 2013
• Scala based
• Scala, Java, and Python API
3. Most Active Big Data Project
within Apache
Data from Spark-Summit 2014
5. Spark VS. Hadoop
• Hadoop Map/Reduce Limitations
• High Latency
• No in-memory caching
• Map/Reduce code very complicated to write
• Spark
• In-Memory Processing
• Very Easy API
• Can run stand alone even on Windows
• 100x faster in memory and 10x faster on disk
7. Spark Word Count Example
file = spark.textFile(“file.name”)
file.flatMap(line = > line.split(“ “))
.map(word=>(word,1))
.reduceByKey(_+_)
8. RDD – Resilient Distributed
Dataset
• Operations
• Transformations
• Actions
• Persistence
• Allows an RDD to persist between operations
• Provides the ability to write to disk if to large for memory
• Parallelized Collections
• Typically you want 2-4 slices per CPU in your cluster
11. Persistence
• Store a RDD for later operations
• Each node persists a partition
• Partitions are fault-tolerant
• persist() or cache()
12. Persistence storage levels
• MEMORY_ONLY - Store RDD as deserialized Java objects in the
JVM
• MEMORY_AND_DISK - Store RDD as deserialized Java objects
in the JVM. If the RDD does not fit in memory, store the partitions
that don't fit on disk
• MEMORY_ONLY_SER - Store RDD as serialized Java objects
(one byte array per partition).
• MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER,
but spill partitions that don't fit in memory to disk
• DISK_ONLY - Store the RDD partitions only on disk.
• MEMORY_ONLY_2, MEMORY_AND_DISK_2 - Same as the
levels above, but replicate each partition on two cluster nodes.
• OFF_HEAP - Store RDD in serialized format in Tachyon
13. Spark Advantages
• Same code can be used for streaming and batch
processing
• In Memory Processing
• Fault tolerant rdd persistence
• Machine Learning library built in
• Spark SQL (Coming Soon)
• Data Graphing (GraphX, Bagel/Pregel)
14. Spark Drawbacks
• No append for output
• Lack of job schedule
• Spark on Yarn not quite ready for prime time
• Still a young project