SlideShare a Scribd company logo
1 of 61
Download to read offline
© 2015 IBM Corporation
DSK-3572
Apache Spark:
What's in It for Your Business?
Adarsh Pannu
Senior Technical Staff Member
IBM Analytics Platform
Outline
•  Spark Overview
•  Spark Stack
•  Spark in Action
•  Spark vs MapReduce
•  Spark @ IBM
1
Started in 2009 at UC
Berkeley’s AMP Lab
_______________________
6 years!
!
7+ components!
!
129+ third-party packages!
!
500+ contributors!
!
400,000+ lines of code!
3
!
1 platform!
!
!
Apache Spark!
!
____________________________________________
The Analytics Operating System
1.  A general-purpose data processing (compute) engine
2.  A technology that interoperates well with Apache Hadoop
3.  A big data ecosystem
4
Why Spark?
5
Expressiveness Speed
+ 2-10x faster on disk
+ 100x faster in-memory
(vs. Hadoop MapReduce)
+ Capabilities
+ Productivity
More function, less code
Spark Stack
6
Spark Core
SQL Streaming MLLib GraphX
ML PipelinesDataFrames
Spark in the real world
7
Spark in Action
Spark is essentially a collection of APIs.
Best way to appreciate Spark’s value is
via examples. Let’s do so using a real
world dataset
On-Time Arrival Performance –Record
of every US airline flight since 1980s.
Year, Month, DayofMonth
UniqueCarrier,FlightNum
DepTime, ArrTime, ActualElapsedTime
ArrDelay, DepDelay
Origin, Dest, Distance
Cancelled, ...
“Where, When, How Long? ...”
8
Spark in Action (contd.)
9
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRS
ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapse
dTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Can
celled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,S
ecurityDelay,LateAircraftDelay
2004,2,5,4,1521,1530,1837,1838,CO,65,N67158,376,368,326,-1,-9,EWR,LAX,2454,...
UniqueCarrier
Origin
Dest
Year,Month,DayofMonth
FlightNum
ActualElapsedTime
Distance
DepTime
Spark in Action (contd.)
Which airports had the most flight cancellations?
_________________________________________
Let’s compute this using Spark “Core”
10
11
2004,2,5...EWR...1 2004,2,5...HNL...0 2004,2,5...ORD...1 2004,2,5...ORD...1
(EWR, 1) (HNL,0) (ORD, 1) (ORD, 1)
(EWR, 1) (ORD, 1) (ORD, 1)
(EWR, 1) (ORD, 2)
(ORD, 2) (EWR, 1)
[ (ORD, 2), (EWR, 1) ]
sc.textFile("hdfs://localhost:9000/user/Adarsh/routes.csv").
map(row => (row.split(",")(16), row.split(",")(21).toInt)).
filter(row => row._2 == 1).
reduceByKey((a,b) => a + b).
sortBy(_._2, ascending=false).
collect
12
sc.textFile("hdfs://localhost:9000/user/Adarsh/routes.csv").
map(row => (row.split(",")(16), row.split(",")(21).toInt)).
filter(row => row._2 == 1).
reduceByKey((a,b) => a + b).
sortBy(_._2, ascending=false).
collect
Not bad, eh? Do try this at
home ... just not on
MapReduce!
Spark in Action (contd.)
Which airports had the most flight cancellations?
13
Answer: In 2008, it was:
ORD
ATL
DFW
LGA
EWR
We’re not surprised, are we? ☺
Spark builds Data Pipelines (DAGs)
14
Directed (arrows)
Acyclic (no loops)
Graph
•  Nodes in the pipeline
are “RDDs”
•  Leaf nodes represent
base datasets.
•  Intermediate nodes
represent computations
Resilient Distributed Datasets (RDD)
15
CO780, IAH, MCI
CO683, IAH, MSY
CO1707, TPA, IAH
...
UA620, SJC, ORD
UA675, ORD, SMF
UA676, ORD, LGA
...
DL282, ATL, CVG
DL2128, CHS, ATL
DL2592, PBI, LGA
DL417, FLL, ATL
...
Immutable collection of objects
•  Distributed across machines
Can be operated on in parallel
Can hold any kind of data
Hadoop datasets
Parallelized Scala collections
RDBMS or No-SQL
...
Can be cached, recover from failures
Spark Execution
16
•  Data pipelines are
broken up into
“stages”
•  Each stage is
proceeded in
parallel (across
partitions)
•  At stage
boundaries, data
is “shuffled” or
returned to client
•  Of course, all of
this is done under
the covers for you!
shuffle
Wait ... this is just like MapReduce? What’s the difference?
17
Spark vs MapReduce
Map Reduce
HDFS
Map Reduce
Map Filter
HDFS
Reduce Join Sort
Local
Disk
Simple pipelines, heavyweight JVMs, communication through HDFS, No
explicit memory exploitation
Complex pipelined DAGs, Threads (vs JVMs), Memory and
Disk exploitation, Caching, Fast Shuffle, and more...
RDD Operations
•  Transformations
"  Create a new RDD (dataset) from an existing one
"  Lazily evaluated
map flatMap filter
reduceByKey groupByKey mapPartitions
join cogroup coalesce
union distinct intersection
sortByKey sample repartition ...
•  Actions
"  Run a computation (i.e. job), optionally returning data to client
count collect cache
first take saveAsTextFile ...
Spark has Rich Functional, Relational
& Iterative APIs
RDD Persistence and Caching
•  Spark can persist RDDs in several
ways
"  In-memory (a.k.a. caching)
"  On disk
"  Both
•  This allows Spark to avoid re-
computing portions of a DAG
"  Beneficial for repeating workloads
such as machine learning
•  But you have to tell Spark which
RDDs to persist (and how)
data
cancelled
flights
airports
count take(5)
Skipped
Cached
Common misconceptions around Spark
“Spark requires you to load all your data into memory.”
“Spark is a replacement for Hadoop.”
“Spark is only for machine learning.”
“Spark solves global hunger and world peace.”
21
Sorting 100 TB of data
22
Hadoop in 2013
2100 machines, 50400 cores
72 mins
Spark in 2014 (Current Record)
206 virtual machines, 6952 cores
23 mins
Also sorted 1 PB in 234 mins
on similar hardware (it scales)
Source: sortbenchmark.org
OK! Let’s get back to the Spark application.
23
Spark in Action (contd.)
Which airports had the most flight cancellations?
____________________________________________
Can we do compute this using SQL?
Yes!
Can we do that without leaving my Spark environment?
Yes!
24
Spark SQL Example
// Specify table schema
val schema = StructType(...)
// Define a data frame using the schema
val rowRDD = flights.map(e => Row.fromSeq(e))
val flightsDF = sqlContext.createDataFrame(rowRDD, schema)
// Give the data frame a name
flightsDF.registerTempTable("flights”)
// Use SQL!
val results = sqlContext.sql("""SELECT Origin, count(*) Cnt FROM flights
where Cancelled = "1"
group by Origin order by Cnt desc”””)
25
RDD <-> DataFrame
interoperability
Ma.. Look! I
can use SQL
seamlessly
with RDDs
Spark SQL and DataFrames
•  Spark SQL
"  SQL engine with an optimizer
"  Not ANSI compliant
"  Operates on DataFrames (RDDs with schema)
"  Primarily goal: Write (even) less code
"  Secondary goal: Provide a way around performance penalty
when using non-JVM languages (e.g. Python or R)
•  DataFrame API
"  Programmatic API to build SQL-ish constructs
"  Shares machinery with SQL engine
26
These allow you to create RDD DAG using a higher-level language / API
Spark in Action (contd.)
Which flights are likely to be delayed? And by how much?
____________________________________________
Can we do build a predictive model for this?
Yes!
Again, can we do that without leaving my Spark environment?
Yes!
27
Spark MLLib Example
val trainRDD = sqlContext.sql("""SELECT ArrDelay, DepDelay, ArrTime,
DepTime, Distance from flights""”).rdd.map {
case Row(label: String, features @ _*) => // Build training set
LabeledPoint(toDouble(label),
Vectors.dense(features.map(toDouble(_)).toArray))
}
// Train a decision tree (regression) model to predict flight delays
val model = DecisionTree.train(trainRDD, Regression, Variance, maxDepth = 5)
// Evaluate model on training set
val valuesAndPreds = trainRDD.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
28
Tight SQL DataFrame/ RDD ->
ML Integration
Machine Learning on Spark
•  MLLib
"  Goal: Make machine learning scalable
"  Growing collection of algorithms: Regression, classification,
clustering, recommendation, frequent itemsets, linear algebra
and statistical primitives, ...
•  ML Pipelines
"  Goal: Move beyond list of algorithms, make ML practical
"  Support ML workflows
•  E.g.: Load data -> Extract features -> Train Model -> Evaluate
•  IBM’s SystemML
"  Declarative machine learning using R-like language
"  Radically simplifies algorithm development
"  Recently open-sourced, being integrated with Spark
29
Spark in Action (contd.)
What is the shortest path to travel from Maui to Ithaca?
____________________________________________
Can Spark help with this graph query too?
Yes!
Can we do that without leaving my Spark environment?
Yes!
30
SJC
DFW
ORD
LAX
ITH
EWR
OGG
Spark in Action (contd.)
What is the shortest path to travel from Maui to Ithaca?
____________________________________________
Using Spark, we can turn the flight data into schedule “graph”
"  Vertices = Cities, Edges = Flights
31
SJC
DFW
ORD
LAX
ITH
EWR
OGG
Spark GraphX
•  GraphX is an API to
"  Represent property graphs Graph[Vertex, Edge]
"  Manipulate graphs to yield subgraphs or other data
"  View data as both graphs and collections
"  Write custom iterative graph algorithms using the Pregel API
32
SJC
DFW
ORD
LAX
ITH
EWR
OGG
Tight RDD <-> Graph API
interoperability
Spark in Action (contd.)
At any given moment, there are hundreds of flights in the sky, all
generating streaming data. Sites such as flightaware.com make
this data available to consumers.
____________________________________________
Can we use Spark to analyze this data in motion and possibly
correlate it with historical trends?
Yes!
Can we do that all inside my Spark environment?
Yes!
33
Spark Streaming
•  Scalable and fault-tolerant
•  Micro-batching model
"  Input data split into batches based on time intervals
"  Batches presented as RDDs
"  RDD, DataFrame/SQL and GraphX APIs available to streams
Kafka
Flume
Kinesis
Twitter
...
...
Interoperability within the Spark Stack:
A sample scenario
35
ModelCore
DataatrestDatainmotion
36
!
1 platform!
!
!
Apache Spark!
!
____________________________________________
The Analytics Operating System
Where does Spark fit in a Hadoop World?
37
Cluster
Management
(YARN)
Compute
MapReduce
Storage
(HDFS)
•  Consider Spark for new
workloads
•  Carefully evaluate
porting existing
workloads
•  If it ain’t broken, don’t
fix it!
Spark is an
alternate (and
faster) compute
engine in your
Hadoop stack
Spark @ IBM
Enhance it! Distribute it!
Exploit it!
Spark Technology
Center @ SF
Shipping with
BigInsights
Spark as a
Service
Inside our products
(Next Gen Analytics, +++)
IBM has made a significant investment in Spark
39
IBM Spark Technology Center
San Francisco

Growing pool of contributors

300+ inventors

Contributed SystemML

Founding member of AMPLab

Partnerships in the ecosystem



http://spark.tc
40
Data
sources
RDBMS
Object stores
NoSQL engines
Document stores
Discovery

& Exploration
Prescriptive
Analytics
Predictive
Analytics
Content
Analytics
Business Intelligence
Data

Management
Hadoop &
NoSQL
Content
Management
Data

Warehousing
Information Integration & Governance
IBM ANALYTICS PLATFORM
Built on Spark. Hybrid. Trusted.
Spark Analytics Operating System

Machine Learning
On premises
 On cloud
Data at Rest & In-motion – Inside & Outside the Firewall – Structured & Unstructured
Predict the future
for the business
Delight customers
by understanding
them better
Derive business
value from
unstructured
content
Data-centric
priorities Logical data warehouse Data reservoir Fluid data layer
Business priorities
Want to learn more about Spark?
______________________________________________
Reference slides follow.
41
42
Notices and Disclaimers
Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form
without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for
accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to
update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO
EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO,
LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted
according to the terms and conditions of the agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as
illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other
results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or
services available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the
views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or
other guidance or advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the
identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the
customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will
ensure that the customer is in compliance with any law.
43
Notices and Disclaimers (con’t)
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly
available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
•  IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document
Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM
SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON,
OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,
pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ,
Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.
© 2015 IBM Corporation
Thank You
We Value Your Feedback!
Don’t forget to submit your Insight session and speaker
feedback! Your feedback is very important to us – we use it
to continually improve the conference.
Access your surveys at insight2015survey.com to quickly
submit your surveys from your smartphone, laptop or
conference kiosk.
45
Reference slides
46
Intro
What is Spark? How does it relate
to Hadoop? When would you use
it?
1-2 hours
Basic
Understand basic technology and
write simple programs
1-2 days
Intermediate
Start writing complex Spark
programs even as you understand
operational aspects
5-15 days, to
weeks and
months
Expert
Become a Spark Black Belt! Know
Spark inside out.
Months to
years
How deep do you want to go?
Intro Spark
Go through these additional presentations to understand the value of Spark. These
speakers also attempt to differentiate Spark from Hadoop, and enumerate its comparative
strengths. (Not much code here)
#  Turning Data into Value, Ion Stoica, Spark Summit 2013 Video & Slides 25 mins
https://spark-summit.org/2013/talk/turning-data-into-value
#  Spark: What’s in it your your business? Adarsh Pannu
(This presentation itself ☺)
#  How Companies are Using Spark, and Where the Edge in Big Data Will Be, Matei
Zaharia, Video & Slides 12 mins
http://conferences.oreilly.com/strata/strata2014/public/schedule/detail/33057
#  Spark Fundamentals I (Lesson 1 only), Big Data University <20 mins
https://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundamentals/
Basic Spark
# Pick up some Scala through this article co-authored
by Scala’s creator, Martin Odersky. Link
http://www.artima.com/scalazine/articles/steps.html
Estimated time: 2 hours
Basic Spark (contd.)
#  Do these two courses. They cover Spark basics and include a
certification. You can use the supplied Docker images for all other
labs.
7 hours
Basic Spark (contd.)
#  Go to spark.apache.org and study the Overview and the
Spark Programming Guide. Many online courses borrow
liberally from this material. Information on this site is
updated with every new Spark release.
Estimated 7-8 hours.
Intermediate Spark
#  Stay at spark.apache.org. Go through the component specific
Programming Guides as well as the sections on Deploying and More.
Browse the Spark API as needed.
Estimated time 3-5 days and more.
Intermediate Spark (contd.)
•  Learn about the operational aspects of Spark:
#  Advanced Apache Spark (DevOps) 6 hours $ EXCELLENT!
Video https://www.youtube.com/watch?v=7ooZ4S7Ay6Y
Slides https://www.youtube.com/watch?v=7ooZ4S7Ay6Y
#  Tuning and Debugging Spark Slides 48 mins
Video https://www.youtube.com/watch?v=kkOG_aJ9KjQ
•  Gain a high-level understanding of Spark architecture:
#  Introduction to AmpLab Spark Internals, Matei Zaharia, 1 hr 15 mins
Video https://www.youtube.com/watch?v=49Hr5xZyTEA
#  A Deeper Understanding of Spark Internals, Aaron Davidson, 44 mins
Video https://www.youtube.com/watch?v=dmL0N3qfSc8
PDF https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-
Understanding-of-Spark-Internals-Aaron-Davidson.pdf
Intermediate Spark (contd.)
•  Experiment, experiment, experiment ...
#  Setup your personal 3-4 node cluster
#  Download some “open” data. E.g. “airline” data on
stat-computing.org/dataexpo/2009/
#  Write some code, make it run, see how it performs, tune it, trouble-shoot it
#  Experiment with different deployment modes (Standalone + YARN)
#  Play with different configuration knobs, check out dashboards, etc.
#  Explore all subcomponents (especially Core, SQL, MLLib)
Read the original academic papers
#  Resilient Distributed Datasets: A Fault-
Tolerant Abstraction for In-Memory Cluster
Computing, Matei Zaharia, et. al.
#  Discretized Streams: An Efficient and Fault-
Tolerant Model for Stream Processing on
Large Clusters, Matei Zaharia, et. al.
#  GraphX: A Resilient Distributed Graph
System on Spark, Reynold S. Xin, et. al.
#  Spark SQL: Relational Data Processing in
Spark, Michael Armbrust, et. al.
Advanced Spark: Original Papers
Advanced Spark: Enhance your Scala skills
This book by
Odersky is excellent
but it isn’t meant to
give you a quick
start. It’s deep stuff.
#  Use this as your
primary Scala text
#  Excellent MooC by Odersky. Some of
the material is meant for CS majors.
Highly recommended for STC
developers.
35+ hours
Advanced Spark: Browse Conference Proceedings
Spark Summits cover technology and use cases. Technology is also covered in
various other places so you could consider skipping those tracks. Don’t forget to
check out the customer stories. That is how we learn about enablement
opportunities and challenges, and in some cases, we can see through the Spark
hype ☺
100+ hours of FREE videos and associated PDFs available on spark-
summit.org. You don’t even have to pay the conference fee! Go back in time and
“attend” these conferences!
Advanced Spark: Browse YouTube Videos
YouTube is full of training videos, some good, other not so
much. These are the only channels you need to watch though.
There is a lot of repetition in the material, and some of the
videos are from the conferences mentioned earlier.
Advanced Spark: Check out these books
Provides a good overview of
Spark much of material is also
available through other sources
previously mentioned.
Covers concrete statistical analysis /
machine learning use cases. Covers
Spark APIs and MLLib. Highly
recommended for data scientists.
Advanced Spark: Yes ... read the code
Even if you don’t intend to contribute to Spark, there are a ton of valuable
comments in the code that provide insights into Spark’s design and these will
help you write better Spark applications. Don’t be shy! Go to github.com/
apache/spark and check it to out.

More Related Content

What's hot

Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 

What's hot (20)

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
 

Viewers also liked

Operational Intelligence Using Hadoop
Operational Intelligence Using HadoopOperational Intelligence Using Hadoop
Operational Intelligence Using Hadoop
DataWorks Summit
 

Viewers also liked (20)

Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Apache Spark: Coming up to speed
Apache Spark: Coming up to speedApache Spark: Coming up to speed
Apache Spark: Coming up to speed
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Exploring language classification with spark and the spark notebook
Exploring language classification with spark and the spark notebookExploring language classification with spark and the spark notebook
Exploring language classification with spark and the spark notebook
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Apache Spark and Oracle Stream Analytics
Apache Spark and Oracle Stream AnalyticsApache Spark and Oracle Stream Analytics
Apache Spark and Oracle Stream Analytics
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark Streaming - The simple way
Spark Streaming - The simple waySpark Streaming - The simple way
Spark Streaming - The simple way
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Operational Intelligence Using Hadoop
Operational Intelligence Using HadoopOperational Intelligence Using Hadoop
Operational Intelligence Using Hadoop
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 

Similar to Apache Spark: The Analytics Operating System

Similar to Apache Spark: The Analytics Operating System (20)

Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark core
Spark coreSpark core
Spark core
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Apache Spark: The Analytics Operating System

  • 1. © 2015 IBM Corporation DSK-3572 Apache Spark: What's in It for Your Business? Adarsh Pannu Senior Technical Staff Member IBM Analytics Platform
  • 2. Outline •  Spark Overview •  Spark Stack •  Spark in Action •  Spark vs MapReduce •  Spark @ IBM 1
  • 3. Started in 2009 at UC Berkeley’s AMP Lab _______________________ 6 years! ! 7+ components! ! 129+ third-party packages! ! 500+ contributors! ! 400,000+ lines of code!
  • 5. 1.  A general-purpose data processing (compute) engine 2.  A technology that interoperates well with Apache Hadoop 3.  A big data ecosystem 4
  • 6. Why Spark? 5 Expressiveness Speed + 2-10x faster on disk + 100x faster in-memory (vs. Hadoop MapReduce) + Capabilities + Productivity More function, less code
  • 7. Spark Stack 6 Spark Core SQL Streaming MLLib GraphX ML PipelinesDataFrames
  • 8. Spark in the real world 7
  • 9. Spark in Action Spark is essentially a collection of APIs. Best way to appreciate Spark’s value is via examples. Let’s do so using a real world dataset On-Time Arrival Performance –Record of every US airline flight since 1980s. Year, Month, DayofMonth UniqueCarrier,FlightNum DepTime, ArrTime, ActualElapsedTime ArrDelay, DepDelay Origin, Dest, Distance Cancelled, ... “Where, When, How Long? ...” 8
  • 10. Spark in Action (contd.) 9 Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRS ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapse dTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Can celled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,S ecurityDelay,LateAircraftDelay 2004,2,5,4,1521,1530,1837,1838,CO,65,N67158,376,368,326,-1,-9,EWR,LAX,2454,... UniqueCarrier Origin Dest Year,Month,DayofMonth FlightNum ActualElapsedTime Distance DepTime
  • 11. Spark in Action (contd.) Which airports had the most flight cancellations? _________________________________________ Let’s compute this using Spark “Core” 10
  • 12. 11 2004,2,5...EWR...1 2004,2,5...HNL...0 2004,2,5...ORD...1 2004,2,5...ORD...1 (EWR, 1) (HNL,0) (ORD, 1) (ORD, 1) (EWR, 1) (ORD, 1) (ORD, 1) (EWR, 1) (ORD, 2) (ORD, 2) (EWR, 1) [ (ORD, 2), (EWR, 1) ] sc.textFile("hdfs://localhost:9000/user/Adarsh/routes.csv"). map(row => (row.split(",")(16), row.split(",")(21).toInt)). filter(row => row._2 == 1). reduceByKey((a,b) => a + b). sortBy(_._2, ascending=false). collect
  • 13. 12 sc.textFile("hdfs://localhost:9000/user/Adarsh/routes.csv"). map(row => (row.split(",")(16), row.split(",")(21).toInt)). filter(row => row._2 == 1). reduceByKey((a,b) => a + b). sortBy(_._2, ascending=false). collect Not bad, eh? Do try this at home ... just not on MapReduce!
  • 14. Spark in Action (contd.) Which airports had the most flight cancellations? 13 Answer: In 2008, it was: ORD ATL DFW LGA EWR We’re not surprised, are we? ☺
  • 15. Spark builds Data Pipelines (DAGs) 14 Directed (arrows) Acyclic (no loops) Graph •  Nodes in the pipeline are “RDDs” •  Leaf nodes represent base datasets. •  Intermediate nodes represent computations
  • 16. Resilient Distributed Datasets (RDD) 15 CO780, IAH, MCI CO683, IAH, MSY CO1707, TPA, IAH ... UA620, SJC, ORD UA675, ORD, SMF UA676, ORD, LGA ... DL282, ATL, CVG DL2128, CHS, ATL DL2592, PBI, LGA DL417, FLL, ATL ... Immutable collection of objects •  Distributed across machines Can be operated on in parallel Can hold any kind of data Hadoop datasets Parallelized Scala collections RDBMS or No-SQL ... Can be cached, recover from failures
  • 17. Spark Execution 16 •  Data pipelines are broken up into “stages” •  Each stage is proceeded in parallel (across partitions) •  At stage boundaries, data is “shuffled” or returned to client •  Of course, all of this is done under the covers for you! shuffle
  • 18. Wait ... this is just like MapReduce? What’s the difference? 17
  • 19. Spark vs MapReduce Map Reduce HDFS Map Reduce Map Filter HDFS Reduce Join Sort Local Disk Simple pipelines, heavyweight JVMs, communication through HDFS, No explicit memory exploitation Complex pipelined DAGs, Threads (vs JVMs), Memory and Disk exploitation, Caching, Fast Shuffle, and more...
  • 20. RDD Operations •  Transformations "  Create a new RDD (dataset) from an existing one "  Lazily evaluated map flatMap filter reduceByKey groupByKey mapPartitions join cogroup coalesce union distinct intersection sortByKey sample repartition ... •  Actions "  Run a computation (i.e. job), optionally returning data to client count collect cache first take saveAsTextFile ... Spark has Rich Functional, Relational & Iterative APIs
  • 21. RDD Persistence and Caching •  Spark can persist RDDs in several ways "  In-memory (a.k.a. caching) "  On disk "  Both •  This allows Spark to avoid re- computing portions of a DAG "  Beneficial for repeating workloads such as machine learning •  But you have to tell Spark which RDDs to persist (and how) data cancelled flights airports count take(5) Skipped Cached
  • 22. Common misconceptions around Spark “Spark requires you to load all your data into memory.” “Spark is a replacement for Hadoop.” “Spark is only for machine learning.” “Spark solves global hunger and world peace.” 21
  • 23. Sorting 100 TB of data 22 Hadoop in 2013 2100 machines, 50400 cores 72 mins Spark in 2014 (Current Record) 206 virtual machines, 6952 cores 23 mins Also sorted 1 PB in 234 mins on similar hardware (it scales) Source: sortbenchmark.org
  • 24. OK! Let’s get back to the Spark application. 23
  • 25. Spark in Action (contd.) Which airports had the most flight cancellations? ____________________________________________ Can we do compute this using SQL? Yes! Can we do that without leaving my Spark environment? Yes! 24
  • 26. Spark SQL Example // Specify table schema val schema = StructType(...) // Define a data frame using the schema val rowRDD = flights.map(e => Row.fromSeq(e)) val flightsDF = sqlContext.createDataFrame(rowRDD, schema) // Give the data frame a name flightsDF.registerTempTable("flights”) // Use SQL! val results = sqlContext.sql("""SELECT Origin, count(*) Cnt FROM flights where Cancelled = "1" group by Origin order by Cnt desc”””) 25 RDD <-> DataFrame interoperability Ma.. Look! I can use SQL seamlessly with RDDs
  • 27. Spark SQL and DataFrames •  Spark SQL "  SQL engine with an optimizer "  Not ANSI compliant "  Operates on DataFrames (RDDs with schema) "  Primarily goal: Write (even) less code "  Secondary goal: Provide a way around performance penalty when using non-JVM languages (e.g. Python or R) •  DataFrame API "  Programmatic API to build SQL-ish constructs "  Shares machinery with SQL engine 26 These allow you to create RDD DAG using a higher-level language / API
  • 28. Spark in Action (contd.) Which flights are likely to be delayed? And by how much? ____________________________________________ Can we do build a predictive model for this? Yes! Again, can we do that without leaving my Spark environment? Yes! 27
  • 29. Spark MLLib Example val trainRDD = sqlContext.sql("""SELECT ArrDelay, DepDelay, ArrTime, DepTime, Distance from flights""”).rdd.map { case Row(label: String, features @ _*) => // Build training set LabeledPoint(toDouble(label), Vectors.dense(features.map(toDouble(_)).toArray)) } // Train a decision tree (regression) model to predict flight delays val model = DecisionTree.train(trainRDD, Regression, Variance, maxDepth = 5) // Evaluate model on training set val valuesAndPreds = trainRDD.map { point => val prediction = model.predict(point.features) (point.label, prediction) } 28 Tight SQL DataFrame/ RDD -> ML Integration
  • 30. Machine Learning on Spark •  MLLib "  Goal: Make machine learning scalable "  Growing collection of algorithms: Regression, classification, clustering, recommendation, frequent itemsets, linear algebra and statistical primitives, ... •  ML Pipelines "  Goal: Move beyond list of algorithms, make ML practical "  Support ML workflows •  E.g.: Load data -> Extract features -> Train Model -> Evaluate •  IBM’s SystemML "  Declarative machine learning using R-like language "  Radically simplifies algorithm development "  Recently open-sourced, being integrated with Spark 29
  • 31. Spark in Action (contd.) What is the shortest path to travel from Maui to Ithaca? ____________________________________________ Can Spark help with this graph query too? Yes! Can we do that without leaving my Spark environment? Yes! 30
  • 32. SJC DFW ORD LAX ITH EWR OGG Spark in Action (contd.) What is the shortest path to travel from Maui to Ithaca? ____________________________________________ Using Spark, we can turn the flight data into schedule “graph” "  Vertices = Cities, Edges = Flights 31 SJC DFW ORD LAX ITH EWR OGG
  • 33. Spark GraphX •  GraphX is an API to "  Represent property graphs Graph[Vertex, Edge] "  Manipulate graphs to yield subgraphs or other data "  View data as both graphs and collections "  Write custom iterative graph algorithms using the Pregel API 32 SJC DFW ORD LAX ITH EWR OGG Tight RDD <-> Graph API interoperability
  • 34. Spark in Action (contd.) At any given moment, there are hundreds of flights in the sky, all generating streaming data. Sites such as flightaware.com make this data available to consumers. ____________________________________________ Can we use Spark to analyze this data in motion and possibly correlate it with historical trends? Yes! Can we do that all inside my Spark environment? Yes! 33
  • 35. Spark Streaming •  Scalable and fault-tolerant •  Micro-batching model "  Input data split into batches based on time intervals "  Batches presented as RDDs "  RDD, DataFrame/SQL and GraphX APIs available to streams Kafka Flume Kinesis Twitter ... ...
  • 36. Interoperability within the Spark Stack: A sample scenario 35 ModelCore DataatrestDatainmotion
  • 38. Where does Spark fit in a Hadoop World? 37 Cluster Management (YARN) Compute MapReduce Storage (HDFS) •  Consider Spark for new workloads •  Carefully evaluate porting existing workloads •  If it ain’t broken, don’t fix it! Spark is an alternate (and faster) compute engine in your Hadoop stack
  • 39. Spark @ IBM Enhance it! Distribute it! Exploit it! Spark Technology Center @ SF Shipping with BigInsights Spark as a Service Inside our products (Next Gen Analytics, +++)
  • 40. IBM has made a significant investment in Spark 39 IBM Spark Technology Center San Francisco Growing pool of contributors 300+ inventors Contributed SystemML Founding member of AMPLab Partnerships in the ecosystem http://spark.tc
  • 41. 40 Data sources RDBMS Object stores NoSQL engines Document stores Discovery
 & Exploration Prescriptive Analytics Predictive Analytics Content Analytics Business Intelligence Data
 Management Hadoop & NoSQL Content Management Data
 Warehousing Information Integration & Governance IBM ANALYTICS PLATFORM Built on Spark. Hybrid. Trusted. Spark Analytics Operating System
 Machine Learning On premises On cloud Data at Rest & In-motion – Inside & Outside the Firewall – Structured & Unstructured Predict the future for the business Delight customers by understanding them better Derive business value from unstructured content Data-centric priorities Logical data warehouse Data reservoir Fluid data layer Business priorities
  • 42. Want to learn more about Spark? ______________________________________________ Reference slides follow. 41
  • 43. 42 Notices and Disclaimers Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided. Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law.
  • 44. 43 Notices and Disclaimers (con’t) Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. •  IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
  • 45. © 2015 IBM Corporation Thank You
  • 46. We Value Your Feedback! Don’t forget to submit your Insight session and speaker feedback! Your feedback is very important to us – we use it to continually improve the conference. Access your surveys at insight2015survey.com to quickly submit your surveys from your smartphone, laptop or conference kiosk. 45
  • 48. Intro What is Spark? How does it relate to Hadoop? When would you use it? 1-2 hours Basic Understand basic technology and write simple programs 1-2 days Intermediate Start writing complex Spark programs even as you understand operational aspects 5-15 days, to weeks and months Expert Become a Spark Black Belt! Know Spark inside out. Months to years How deep do you want to go?
  • 49. Intro Spark Go through these additional presentations to understand the value of Spark. These speakers also attempt to differentiate Spark from Hadoop, and enumerate its comparative strengths. (Not much code here) #  Turning Data into Value, Ion Stoica, Spark Summit 2013 Video & Slides 25 mins https://spark-summit.org/2013/talk/turning-data-into-value #  Spark: What’s in it your your business? Adarsh Pannu (This presentation itself ☺) #  How Companies are Using Spark, and Where the Edge in Big Data Will Be, Matei Zaharia, Video & Slides 12 mins http://conferences.oreilly.com/strata/strata2014/public/schedule/detail/33057 #  Spark Fundamentals I (Lesson 1 only), Big Data University <20 mins https://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundamentals/
  • 50. Basic Spark # Pick up some Scala through this article co-authored by Scala’s creator, Martin Odersky. Link http://www.artima.com/scalazine/articles/steps.html Estimated time: 2 hours
  • 51. Basic Spark (contd.) #  Do these two courses. They cover Spark basics and include a certification. You can use the supplied Docker images for all other labs. 7 hours
  • 52. Basic Spark (contd.) #  Go to spark.apache.org and study the Overview and the Spark Programming Guide. Many online courses borrow liberally from this material. Information on this site is updated with every new Spark release. Estimated 7-8 hours.
  • 53. Intermediate Spark #  Stay at spark.apache.org. Go through the component specific Programming Guides as well as the sections on Deploying and More. Browse the Spark API as needed. Estimated time 3-5 days and more.
  • 54. Intermediate Spark (contd.) •  Learn about the operational aspects of Spark: #  Advanced Apache Spark (DevOps) 6 hours $ EXCELLENT! Video https://www.youtube.com/watch?v=7ooZ4S7Ay6Y Slides https://www.youtube.com/watch?v=7ooZ4S7Ay6Y #  Tuning and Debugging Spark Slides 48 mins Video https://www.youtube.com/watch?v=kkOG_aJ9KjQ •  Gain a high-level understanding of Spark architecture: #  Introduction to AmpLab Spark Internals, Matei Zaharia, 1 hr 15 mins Video https://www.youtube.com/watch?v=49Hr5xZyTEA #  A Deeper Understanding of Spark Internals, Aaron Davidson, 44 mins Video https://www.youtube.com/watch?v=dmL0N3qfSc8 PDF https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper- Understanding-of-Spark-Internals-Aaron-Davidson.pdf
  • 55. Intermediate Spark (contd.) •  Experiment, experiment, experiment ... #  Setup your personal 3-4 node cluster #  Download some “open” data. E.g. “airline” data on stat-computing.org/dataexpo/2009/ #  Write some code, make it run, see how it performs, tune it, trouble-shoot it #  Experiment with different deployment modes (Standalone + YARN) #  Play with different configuration knobs, check out dashboards, etc. #  Explore all subcomponents (especially Core, SQL, MLLib)
  • 56. Read the original academic papers #  Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing, Matei Zaharia, et. al. #  Discretized Streams: An Efficient and Fault- Tolerant Model for Stream Processing on Large Clusters, Matei Zaharia, et. al. #  GraphX: A Resilient Distributed Graph System on Spark, Reynold S. Xin, et. al. #  Spark SQL: Relational Data Processing in Spark, Michael Armbrust, et. al. Advanced Spark: Original Papers
  • 57. Advanced Spark: Enhance your Scala skills This book by Odersky is excellent but it isn’t meant to give you a quick start. It’s deep stuff. #  Use this as your primary Scala text #  Excellent MooC by Odersky. Some of the material is meant for CS majors. Highly recommended for STC developers. 35+ hours
  • 58. Advanced Spark: Browse Conference Proceedings Spark Summits cover technology and use cases. Technology is also covered in various other places so you could consider skipping those tracks. Don’t forget to check out the customer stories. That is how we learn about enablement opportunities and challenges, and in some cases, we can see through the Spark hype ☺ 100+ hours of FREE videos and associated PDFs available on spark- summit.org. You don’t even have to pay the conference fee! Go back in time and “attend” these conferences!
  • 59. Advanced Spark: Browse YouTube Videos YouTube is full of training videos, some good, other not so much. These are the only channels you need to watch though. There is a lot of repetition in the material, and some of the videos are from the conferences mentioned earlier.
  • 60. Advanced Spark: Check out these books Provides a good overview of Spark much of material is also available through other sources previously mentioned. Covers concrete statistical analysis / machine learning use cases. Covers Spark APIs and MLLib. Highly recommended for data scientists.
  • 61. Advanced Spark: Yes ... read the code Even if you don’t intend to contribute to Spark, there are a ton of valuable comments in the code that provide insights into Spark’s design and these will help you write better Spark applications. Don’t be shy! Go to github.com/ apache/spark and check it to out.