This presentation was delivered by Adarsh Pannu at IBM's Insight Conference in Nov 2015. For a recording, visit: https://www.youtube.com/watch?v=Tbm7HIlmwJQ
The presentation provides an overview of Apache Spark, a general-purpose big data processing engine built around speed, ease of use and sophisticated analytics. It enumerates the benefits of incorporating Spark in the enterprise, including how it allows developers to write fully-featured distributed applications ranging from traditional data processing pipelines to complex machine learning. The presentation uses the Airline "On Time" data set to explore various components of the Spark stack.
9. Spark in Action
Spark is essentially a collection of APIs.
Best way to appreciate Spark’s value is
via examples. Let’s do so using a real
world dataset
On-Time Arrival Performance –Record
of every US airline flight since 1980s.
Year, Month, DayofMonth
UniqueCarrier,FlightNum
DepTime, ArrTime, ActualElapsedTime
ArrDelay, DepDelay
Origin, Dest, Distance
Cancelled, ...
“Where, When, How Long? ...”
8
10. Spark in Action (contd.)
9
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRS
ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapse
dTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Can
celled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,S
ecurityDelay,LateAircraftDelay
2004,2,5,4,1521,1530,1837,1838,CO,65,N67158,376,368,326,-1,-9,EWR,LAX,2454,...
UniqueCarrier
Origin
Dest
Year,Month,DayofMonth
FlightNum
ActualElapsedTime
Distance
DepTime
11. Spark in Action (contd.)
Which airports had the most flight cancellations?
_________________________________________
Let’s compute this using Spark “Core”
10
14. Spark in Action (contd.)
Which airports had the most flight cancellations?
13
Answer: In 2008, it was:
ORD
ATL
DFW
LGA
EWR
We’re not surprised, are we? ☺
15. Spark builds Data Pipelines (DAGs)
14
Directed (arrows)
Acyclic (no loops)
Graph
• Nodes in the pipeline
are “RDDs”
• Leaf nodes represent
base datasets.
• Intermediate nodes
represent computations
16. Resilient Distributed Datasets (RDD)
15
CO780, IAH, MCI
CO683, IAH, MSY
CO1707, TPA, IAH
...
UA620, SJC, ORD
UA675, ORD, SMF
UA676, ORD, LGA
...
DL282, ATL, CVG
DL2128, CHS, ATL
DL2592, PBI, LGA
DL417, FLL, ATL
...
Immutable collection of objects
• Distributed across machines
Can be operated on in parallel
Can hold any kind of data
Hadoop datasets
Parallelized Scala collections
RDBMS or No-SQL
...
Can be cached, recover from failures
17. Spark Execution
16
• Data pipelines are
broken up into
“stages”
• Each stage is
proceeded in
parallel (across
partitions)
• At stage
boundaries, data
is “shuffled” or
returned to client
• Of course, all of
this is done under
the covers for you!
shuffle
18. Wait ... this is just like MapReduce? What’s the difference?
17
19. Spark vs MapReduce
Map Reduce
HDFS
Map Reduce
Map Filter
HDFS
Reduce Join Sort
Local
Disk
Simple pipelines, heavyweight JVMs, communication through HDFS, No
explicit memory exploitation
Complex pipelined DAGs, Threads (vs JVMs), Memory and
Disk exploitation, Caching, Fast Shuffle, and more...
20. RDD Operations
• Transformations
" Create a new RDD (dataset) from an existing one
" Lazily evaluated
map flatMap filter
reduceByKey groupByKey mapPartitions
join cogroup coalesce
union distinct intersection
sortByKey sample repartition ...
• Actions
" Run a computation (i.e. job), optionally returning data to client
count collect cache
first take saveAsTextFile ...
Spark has Rich Functional, Relational
& Iterative APIs
21. RDD Persistence and Caching
• Spark can persist RDDs in several
ways
" In-memory (a.k.a. caching)
" On disk
" Both
• This allows Spark to avoid re-
computing portions of a DAG
" Beneficial for repeating workloads
such as machine learning
• But you have to tell Spark which
RDDs to persist (and how)
data
cancelled
flights
airports
count take(5)
Skipped
Cached
22. Common misconceptions around Spark
“Spark requires you to load all your data into memory.”
“Spark is a replacement for Hadoop.”
“Spark is only for machine learning.”
“Spark solves global hunger and world peace.”
21
23. Sorting 100 TB of data
22
Hadoop in 2013
2100 machines, 50400 cores
72 mins
Spark in 2014 (Current Record)
206 virtual machines, 6952 cores
23 mins
Also sorted 1 PB in 234 mins
on similar hardware (it scales)
Source: sortbenchmark.org
25. Spark in Action (contd.)
Which airports had the most flight cancellations?
____________________________________________
Can we do compute this using SQL?
Yes!
Can we do that without leaving my Spark environment?
Yes!
24
26. Spark SQL Example
// Specify table schema
val schema = StructType(...)
// Define a data frame using the schema
val rowRDD = flights.map(e => Row.fromSeq(e))
val flightsDF = sqlContext.createDataFrame(rowRDD, schema)
// Give the data frame a name
flightsDF.registerTempTable("flights”)
// Use SQL!
val results = sqlContext.sql("""SELECT Origin, count(*) Cnt FROM flights
where Cancelled = "1"
group by Origin order by Cnt desc”””)
25
RDD <-> DataFrame
interoperability
Ma.. Look! I
can use SQL
seamlessly
with RDDs
27. Spark SQL and DataFrames
• Spark SQL
" SQL engine with an optimizer
" Not ANSI compliant
" Operates on DataFrames (RDDs with schema)
" Primarily goal: Write (even) less code
" Secondary goal: Provide a way around performance penalty
when using non-JVM languages (e.g. Python or R)
• DataFrame API
" Programmatic API to build SQL-ish constructs
" Shares machinery with SQL engine
26
These allow you to create RDD DAG using a higher-level language / API
28. Spark in Action (contd.)
Which flights are likely to be delayed? And by how much?
____________________________________________
Can we do build a predictive model for this?
Yes!
Again, can we do that without leaving my Spark environment?
Yes!
27
29. Spark MLLib Example
val trainRDD = sqlContext.sql("""SELECT ArrDelay, DepDelay, ArrTime,
DepTime, Distance from flights""”).rdd.map {
case Row(label: String, features @ _*) => // Build training set
LabeledPoint(toDouble(label),
Vectors.dense(features.map(toDouble(_)).toArray))
}
// Train a decision tree (regression) model to predict flight delays
val model = DecisionTree.train(trainRDD, Regression, Variance, maxDepth = 5)
// Evaluate model on training set
val valuesAndPreds = trainRDD.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
28
Tight SQL DataFrame/ RDD ->
ML Integration
30. Machine Learning on Spark
• MLLib
" Goal: Make machine learning scalable
" Growing collection of algorithms: Regression, classification,
clustering, recommendation, frequent itemsets, linear algebra
and statistical primitives, ...
• ML Pipelines
" Goal: Move beyond list of algorithms, make ML practical
" Support ML workflows
• E.g.: Load data -> Extract features -> Train Model -> Evaluate
• IBM’s SystemML
" Declarative machine learning using R-like language
" Radically simplifies algorithm development
" Recently open-sourced, being integrated with Spark
29
31. Spark in Action (contd.)
What is the shortest path to travel from Maui to Ithaca?
____________________________________________
Can Spark help with this graph query too?
Yes!
Can we do that without leaving my Spark environment?
Yes!
30
32. SJC
DFW
ORD
LAX
ITH
EWR
OGG
Spark in Action (contd.)
What is the shortest path to travel from Maui to Ithaca?
____________________________________________
Using Spark, we can turn the flight data into schedule “graph”
" Vertices = Cities, Edges = Flights
31
SJC
DFW
ORD
LAX
ITH
EWR
OGG
33. Spark GraphX
• GraphX is an API to
" Represent property graphs Graph[Vertex, Edge]
" Manipulate graphs to yield subgraphs or other data
" View data as both graphs and collections
" Write custom iterative graph algorithms using the Pregel API
32
SJC
DFW
ORD
LAX
ITH
EWR
OGG
Tight RDD <-> Graph API
interoperability
34. Spark in Action (contd.)
At any given moment, there are hundreds of flights in the sky, all
generating streaming data. Sites such as flightaware.com make
this data available to consumers.
____________________________________________
Can we use Spark to analyze this data in motion and possibly
correlate it with historical trends?
Yes!
Can we do that all inside my Spark environment?
Yes!
33
35. Spark Streaming
• Scalable and fault-tolerant
• Micro-batching model
" Input data split into batches based on time intervals
" Batches presented as RDDs
" RDD, DataFrame/SQL and GraphX APIs available to streams
Kafka
Flume
Kinesis
Twitter
...
...
38. Where does Spark fit in a Hadoop World?
37
Cluster
Management
(YARN)
Compute
MapReduce
Storage
(HDFS)
• Consider Spark for new
workloads
• Carefully evaluate
porting existing
workloads
• If it ain’t broken, don’t
fix it!
Spark is an
alternate (and
faster) compute
engine in your
Hadoop stack
39. Spark @ IBM
Enhance it! Distribute it!
Exploit it!
Spark Technology
Center @ SF
Shipping with
BigInsights
Spark as a
Service
Inside our products
(Next Gen Analytics, +++)
40. IBM has made a significant investment in Spark
39
IBM Spark Technology Center
San Francisco
Growing pool of contributors
300+ inventors
Contributed SystemML
Founding member of AMPLab
Partnerships in the ecosystem
http://spark.tc
41. 40
Data
sources
RDBMS
Object stores
NoSQL engines
Document stores
Discovery
& Exploration
Prescriptive
Analytics
Predictive
Analytics
Content
Analytics
Business Intelligence
Data
Management
Hadoop &
NoSQL
Content
Management
Data
Warehousing
Information Integration & Governance
IBM ANALYTICS PLATFORM
Built on Spark. Hybrid. Trusted.
Spark Analytics Operating System
Machine Learning
On premises
On cloud
Data at Rest & In-motion – Inside & Outside the Firewall – Structured & Unstructured
Predict the future
for the business
Delight customers
by understanding
them better
Derive business
value from
unstructured
content
Data-centric
priorities Logical data warehouse Data reservoir Fluid data layer
Business priorities
42. Want to learn more about Spark?
______________________________________________
Reference slides follow.
41
44. 43
Notices and Disclaimers (con’t)
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly
available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
• IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document
Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM
SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON,
OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,
pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ,
Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.
46. We Value Your Feedback!
Don’t forget to submit your Insight session and speaker
feedback! Your feedback is very important to us – we use it
to continually improve the conference.
Access your surveys at insight2015survey.com to quickly
submit your surveys from your smartphone, laptop or
conference kiosk.
45
48. Intro
What is Spark? How does it relate
to Hadoop? When would you use
it?
1-2 hours
Basic
Understand basic technology and
write simple programs
1-2 days
Intermediate
Start writing complex Spark
programs even as you understand
operational aspects
5-15 days, to
weeks and
months
Expert
Become a Spark Black Belt! Know
Spark inside out.
Months to
years
How deep do you want to go?
49. Intro Spark
Go through these additional presentations to understand the value of Spark. These
speakers also attempt to differentiate Spark from Hadoop, and enumerate its comparative
strengths. (Not much code here)
# Turning Data into Value, Ion Stoica, Spark Summit 2013 Video & Slides 25 mins
https://spark-summit.org/2013/talk/turning-data-into-value
# Spark: What’s in it your your business? Adarsh Pannu
(This presentation itself ☺)
# How Companies are Using Spark, and Where the Edge in Big Data Will Be, Matei
Zaharia, Video & Slides 12 mins
http://conferences.oreilly.com/strata/strata2014/public/schedule/detail/33057
# Spark Fundamentals I (Lesson 1 only), Big Data University <20 mins
https://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundamentals/
50. Basic Spark
# Pick up some Scala through this article co-authored
by Scala’s creator, Martin Odersky. Link
http://www.artima.com/scalazine/articles/steps.html
Estimated time: 2 hours
51. Basic Spark (contd.)
# Do these two courses. They cover Spark basics and include a
certification. You can use the supplied Docker images for all other
labs.
7 hours
52. Basic Spark (contd.)
# Go to spark.apache.org and study the Overview and the
Spark Programming Guide. Many online courses borrow
liberally from this material. Information on this site is
updated with every new Spark release.
Estimated 7-8 hours.
53. Intermediate Spark
# Stay at spark.apache.org. Go through the component specific
Programming Guides as well as the sections on Deploying and More.
Browse the Spark API as needed.
Estimated time 3-5 days and more.
54. Intermediate Spark (contd.)
• Learn about the operational aspects of Spark:
# Advanced Apache Spark (DevOps) 6 hours $ EXCELLENT!
Video https://www.youtube.com/watch?v=7ooZ4S7Ay6Y
Slides https://www.youtube.com/watch?v=7ooZ4S7Ay6Y
# Tuning and Debugging Spark Slides 48 mins
Video https://www.youtube.com/watch?v=kkOG_aJ9KjQ
• Gain a high-level understanding of Spark architecture:
# Introduction to AmpLab Spark Internals, Matei Zaharia, 1 hr 15 mins
Video https://www.youtube.com/watch?v=49Hr5xZyTEA
# A Deeper Understanding of Spark Internals, Aaron Davidson, 44 mins
Video https://www.youtube.com/watch?v=dmL0N3qfSc8
PDF https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-
Understanding-of-Spark-Internals-Aaron-Davidson.pdf
55. Intermediate Spark (contd.)
• Experiment, experiment, experiment ...
# Setup your personal 3-4 node cluster
# Download some “open” data. E.g. “airline” data on
stat-computing.org/dataexpo/2009/
# Write some code, make it run, see how it performs, tune it, trouble-shoot it
# Experiment with different deployment modes (Standalone + YARN)
# Play with different configuration knobs, check out dashboards, etc.
# Explore all subcomponents (especially Core, SQL, MLLib)
56. Read the original academic papers
# Resilient Distributed Datasets: A Fault-
Tolerant Abstraction for In-Memory Cluster
Computing, Matei Zaharia, et. al.
# Discretized Streams: An Efficient and Fault-
Tolerant Model for Stream Processing on
Large Clusters, Matei Zaharia, et. al.
# GraphX: A Resilient Distributed Graph
System on Spark, Reynold S. Xin, et. al.
# Spark SQL: Relational Data Processing in
Spark, Michael Armbrust, et. al.
Advanced Spark: Original Papers
57. Advanced Spark: Enhance your Scala skills
This book by
Odersky is excellent
but it isn’t meant to
give you a quick
start. It’s deep stuff.
# Use this as your
primary Scala text
# Excellent MooC by Odersky. Some of
the material is meant for CS majors.
Highly recommended for STC
developers.
35+ hours
58. Advanced Spark: Browse Conference Proceedings
Spark Summits cover technology and use cases. Technology is also covered in
various other places so you could consider skipping those tracks. Don’t forget to
check out the customer stories. That is how we learn about enablement
opportunities and challenges, and in some cases, we can see through the Spark
hype ☺
100+ hours of FREE videos and associated PDFs available on spark-
summit.org. You don’t even have to pay the conference fee! Go back in time and
“attend” these conferences!
59. Advanced Spark: Browse YouTube Videos
YouTube is full of training videos, some good, other not so
much. These are the only channels you need to watch though.
There is a lot of repetition in the material, and some of the
videos are from the conferences mentioned earlier.
60. Advanced Spark: Check out these books
Provides a good overview of
Spark much of material is also
available through other sources
previously mentioned.
Covers concrete statistical analysis /
machine learning use cases. Covers
Spark APIs and MLLib. Highly
recommended for data scientists.
61. Advanced Spark: Yes ... read the code
Even if you don’t intend to contribute to Spark, there are a ton of valuable
comments in the code that provide insights into Spark’s design and these will
help you write better Spark applications. Don’t be shy! Go to github.com/
apache/spark and check it to out.