SlideShare a Scribd company logo
1 of 32
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot, PhD
ytpouliot@google.com
8/12/2015
Spark: New Vistas of
Computing Power
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Spark: An Open Source Cluster-Based
Distributed Computing
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Distributed Computing A La Spark
Here, “distributed computing” means…
• Multi-node cluster: master, slaves
• Paradigm is:
o Minimize networking by allocating chunks of data to slaves
o Slaves receive code to run on their subset of the data
• Lots of redundancy
• No communication between nodes
o only with the master
• Operating on commodity hardware
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Distributed Computing Is The Rage
Everywhere … Except In Academia
That Should Disturb You
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Some Ideal Applications
here, using Spark’s MLlib library
The highly distributed nature of Spark means it is ideal for …
• Generating lots of…
o Trees in a random forest
o Permutations for computing distributions
o Samplings (boostrapping, sub-sampling)
• Cleaning up textual data, e.g.,
o NLP on EHR records
o Mapping variant spellings of drug names to UMLS CUIs
• Normalizing datasets
• Computing basic statistics
• Hypothesis testing
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Spark = Speed
• Focus of Spark is to make data analytics fast
• fast to run code
• fast to write code
• To run programs faster, Spark provides primitives for in-memory cluster
computing
o job can load data into memory and queried repeatedly
o much quicker than with disk-based systems like Hadoop MapReduce
• To make programming faster, Spark integrates into the Scala programming
language
o Enables manipulation of distributed datasets as if they were local collections
o Can use Spark interactively to query big data from the Scala interpreter
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
HDFS
Marketing-Level
Architecture Stack
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Architecture: Closer To Reality
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Dynamics
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Data Sources Integration
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Ecosystem
Yannick Pouliot Consulting
© 2015 all rights reserved
Spark’s Machine Learning Library: MLlib
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
MLlib
Current
Functionality
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Spark Programming Model
Spark follows REPL model: read–eval–print loop
o Similar to R and python shell
o Ideal for exploratory data analysis
Writing a Spark program typically consists of:
1. Reading some input data into local memory
2. Invoking transformation or actions that operate on a subset of data in local memory
3. Running those transformations/actions in a distributed fashion on the network (memory
or disk)
4. Deciding what actions to undertake next
Best of all, it can all be done within the shell, just like R (or Python)
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
RDD: The Secret Sauce
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
RDD = Distributed Data Frame
• RDD = “Resilient Data Dataset”
o Similar to an R data frame, but laid out across a cluster of machines as a collection of partitions
• Partition = subset of the data
o Spark Master node remembers all transformations applied to an RDD
• if a partition is lost (e.g., a slave machine goes down), it can easily be reconstructed on some
other machine in the cluster (“lineage”)
o “Resilient” = RDD can always be reconstructed because of lineage tracking
• RDDs are Spark’s fundamental abstraction for representing a collection of
objects that can be distributed across multiple machines in a cluster
o val rdd = sc.parallelize(Array(1, 2, 2, 4), 4)
• 4 = number of “partitions”
• Partitions are fundamental unit of parallelism
o val rdd2 = sc.textFile("linkage")
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Partitions = Data Slices
• Partitions are “slices” a dataset is cut into
• Spark will run one task for each partition
• Typically 2-4 partitions for each CPU in cluster
o Normally, Spark tries to set the number of partitions automatically based on your cluster
o Can also be set manually as a second parameter to parallelize (e.g. sc.parallelize(data, 4))
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
A Brief Word About Scala
• One of the two languages Spark is built on (the other being Java)
• Compiles to Java byte code
• REPL based, so good for exploratory data analysis
• Pure OO language
• Much more streamlined than Java
o Way fewer lines of code
• Lots of type inferring
• Seamless calls to Java
o E.g., you might be using a Java method inside a Scale object…
• Broad’s GATK is written in Scala
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Example: Word Count In Spark Using Scala
scala> val hamlet = sc.textFile(“/Users/akuntamukkala/temp/gutenburg.txt”)
hamlet: org.apache.spark.rdd.RDD[String] =` MappedRDD[1] at textFile at <console>:12
scala> val topWordCount = hamlet.flatMap(str=>str.split(“ “)). filter(!_.isEmpty).map(word=>(word,1)).reduceByKey(_+_).map{case (word, count) => (count, word)}.sortByKey(false)
topWordCount: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[10] at sortByKey at <console>:14
scala>
topWordCount.take(5).foreach(x=>println(x))
(1044,the)
(730,and)
(679,of)
(648,to)
(511,I)
RDD lineage
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
More On Word Count
Yannick Pouliot Consulting
© 2015 all rights reserved
A Quick Tour of
Common Spark
Functions
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Common
Transformations
TRANSFORMATION & PURPOSE EXAMPLE & RESULT
filter(func) Purpose: new RDD by selecting those
data elements on which func returns true
scala> val rdd =
sc.parallelize(List(“ABC”,”BCD”,”DEF”)) scala> val
filtered = rdd.filter(_.contains(“C”)) scala>
filtered.collect() Result:
Array[String] = Array(ABC, BCD)
map(func) Purpose: return new RDD by applying
func on each data element
scala> val rdd=sc.parallelize(List(1,2,3,4,5)) scala>
val times2 = rdd.map(_*2) scala>
times2.collect() Result:
Array[Int] = Array(2, 4, 6, 8, 10)
flatMap(func) Purpose: Similar to map but func
returns a Seq instead of a value. For example,
mapping a sentence into a Seq of words
scala> val rdd=sc.parallelize(List(“Spark is
awesome”,”It is fun”)) scala> val
fm=rdd.flatMap(str=>str.split(“ “)) scala>
fm.collect() Result:
Array[String] = Array(Spark, is, awesome, It, is, fun)
reduceByKey(func,[numTasks]) Purpose: To
aggregate values of a key using a function.
“numTasks” is an optional parameter to specify
number of reduce tasks
scala> val word1=fm.map(word=>(word,1)) scala>
val wrdCnt=word1.reduceByKey(_+_) scala>
wrdCnt.collect() Result:
Array[(String, Int)] = Array((is,2), (It,1),
(awesome,1), (Spark,1), (fun,1))
groupByKey([numTasks]) Purpose: To convert
(K,V) to (K,Iterable<V>)
scala> val cntWrd = wrdCnt.map{case (word,
count) => (count, word)} scala>
cntWrd.groupByKey().collect() Result:
Array[(Int, Iterable[String])] =
Array((1,ArrayBuffer(It, awesome, Spark, fun)),
(2,ArrayBuffer(is)))
distinct([numTasks]) Purpose: Eliminate duplicates
from RDD
scala> fm.distinct().collect() Result:
Array[String] = Array(is, It, awesome, Spark, fun)
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
ACTION & PURPOSE EXAMPLE & RESULT
count() Purpose: get the number of data
elements in the RDD
scala> val rdd = sc.parallelize(list(‘A’,’B’,’c’))
scala> rdd.count() Result:
long = 3
collect() Purpose: get all the data elements in
an RDD as an array
scala> val rdd = sc.parallelize(list(‘A’,’B’,’c’))
scala> rdd.collect() Result:
Array[char] = Array(A, B, c)
reduce(func) Purpose: Aggregate the data
elements in an RDD using this function which
takes two arguments and returns one
scala> val rdd = sc.parallelize(list(1,2,3,4))
scala> rdd.reduce(_+_) Result:
Int = 10
take (n) Purpose: : fetch first n data elements in
an RDD. computed by driver program.
Scala> val rdd = sc.parallelize(list(1,2,3,4))
scala> rdd.take(2) Result:
Array[Int] = Array(1, 2)
foreach(func) Purpose: execute function for
each data element in RDD. usually used to
update an accumulator(discussed later) or
interacting with external systems.
Scala> val rdd = sc.parallelize(list(1,2,3,4))
scala> rdd.foreach(x=>println(“%s*10=%s”.
format(x,x*10)))Result:
1*10=10 4*10=40 3*10=30 2*10=20
first() Purpose: retrieves the first data element
in RDD. Similar to take(1)
scala> val rdd = sc.parallelize(list(1,2,3,4))
scala> rdd.first() Result:
Int = 1
saveAsTextFile(path) Purpose: Writes the
content of RDD to a text file or a set of text files
to local file system/ HDFS
scala> val hamlet =
sc.textFile(“/users/akuntamukkala/
temp/gutenburg.txt”) scala>
hamlet.filter(_.contains(“Shakespeare”)).
saveAsTextFile(“/users/akuntamukkala/temp/
filtered”)Result:
akuntamukkala@localhost~/temp/filtered$ ls
_SUCCESS part-00000 part-00001
Common
Actions
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
TRANSFORMATION AND PURPOSE EXAMPLE AND RESULT
union()
Purpose: new RDD containing all elements from
source RDD and argument.
Scala> val rdd1=sc.parallelize(List(‘A’,’B’))
scala> val rdd2=sc.parallelize(List(‘B’,’C’))
scala> rdd1.union(rdd2).collect()
Result:
Array[Char] = Array(A, B, B, C)
intersection()
Purpose: new RDD containing only common
elements from source RDD and argument.
Scala> rdd1.intersection(rdd2).collect()
Result:
Array[Char] = Array(B)
cartesian()
Purpose: new RDD cross product of all elements
from source RDD and argument
Scala> rdd1.cartesian(rdd2).collect()
Result:
Array[(Char, Char)] = Array((A,B), (A,C), (B,B),
(B,C))
subtract()
Purpose: new RDD created by removing data
elements in source RDD in common with
argument
scala> rdd1.subtract(rdd2).collect() Result:
Array[Char] = Array(A)
join(RDD,[numTasks])
Purpose: When invoked on (K,V) and (K,W), this
operation creates a new RDD of (K, (V,W))
scala> val personFruit =
sc.parallelize(Seq((“Andy”, “Apple”), (“Bob”,
“Banana”), (“Charlie”, “Cherry”),
(“Andy”,”Apricot”)))
scala> val personSE = sc.parallelize(Seq((“Andy”,
“Google”), (“Bob”, “Bing”), (“Charlie”, “Yahoo”),
(“Bob”,”AltaVista”)))
scala> personFruit.join(personSE).collect()
Result:
Array[(String, (String, String))] =
Array((Andy,(Apple,Google)),
(Andy,(Apricot,Google)), (Charlie,(Cherry,Yahoo)),
(Bob,(Banana,Bing)), (Bob,(Banana,AltaVista)))
cogroup(RDD,[numTasks])
Purpose: To convert (K,V) to (K,Iterable<V>)
scala> personFruit.cogroup(personSe).collect()
Result:
Array[(String, (Iterable[String], Iterable[String]))] =
Array((Andy,(ArrayBuffer(Apple,
Apricot),ArrayBuffer(google))),
(Charlie,(ArrayBuffer(Cherry),ArrayBuffer(Yahoo))
), (Bob,(ArrayBuffer(Banana),ArrayBuffer(Bing,
AltaVista))))
Common
Set Operations
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Spark comes with
R binding!
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Spark and R: A Marriage Made In Heaven
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
SparkR: A Package For Computing with Spark
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Side Bar: Running Spark R on Amazon AWS
1. Launch Spark cluster at Amazon:
./spark-ec2
--key-pair=spark-df
--identity-file=/Users/code/Downloads/spark-df.pem
--region=eu-west-1
-s 1
--instance-type c3.2xlarge launch mysparkr
2. Launch SparkR locally:
chmod u+w /root/spark/
./spark/bin/sparkR
AWS is probably the best
way for academics to access
Spark, because of complexity
deploying the infrastructure
Yannick Pouliot Consulting
© 2015 all rights reserved
And Not Just R:
Python
and Spark
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Python Code
Example
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Yannick Pouliot Consulting
© 2015 all rights reserved
Questions?

More Related Content

What's hot

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsCheng Lian
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 

What's hot (20)

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 

Similar to Spark Distributed Computing Explained

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkWill Du
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015dhiguero
 
Spark Intro @ analytics big data summit
Spark  Intro @ analytics big data summitSpark  Intro @ analytics big data summit
Spark Intro @ analytics big data summitSujee Maniyam
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache SparkNaukri.com
 
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...William Markito Oliveira
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 

Similar to Spark Distributed Computing Explained (20)

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Spark Intro @ analytics big data summit
Spark  Intro @ analytics big data summitSpark  Intro @ analytics big data summit
Spark Intro @ analytics big data summit
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 

More from Yannick Pouliot

Systems Immunology -- 2014
Systems Immunology -- 2014Systems Immunology -- 2014
Systems Immunology -- 2014Yannick Pouliot
 
Managing experiment data using Excel and Friends
Managing experiment data using Excel and FriendsManaging experiment data using Excel and Friends
Managing experiment data using Excel and FriendsYannick Pouliot
 
Essential UNIX skills for biologists
Essential UNIX skills for biologistsEssential UNIX skills for biologists
Essential UNIX skills for biologistsYannick Pouliot
 
A guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databasesA guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databasesYannick Pouliot
 
Ontologically-Aware Automated Gating
Ontologically-Aware Automated GatingOntologically-Aware Automated Gating
Ontologically-Aware Automated GatingYannick Pouliot
 
Why The Cloud Is A Computational Biologist's Best Friend
Why The Cloud Is A Computational Biologist's Best FriendWhy The Cloud Is A Computational Biologist's Best Friend
Why The Cloud Is A Computational Biologist's Best FriendYannick Pouliot
 
There’s No Avoiding It: Programming Skills You’ll Need
There’s No Avoiding It:  Programming Skills You’ll NeedThere’s No Avoiding It:  Programming Skills You’ll Need
There’s No Avoiding It: Programming Skills You’ll NeedYannick Pouliot
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataYannick Pouliot
 
Predicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataPredicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataYannick Pouliot
 
Repositioning Old Drugs For New Indications Using Computational Approaches
Repositioning Old Drugs For New Indications Using Computational ApproachesRepositioning Old Drugs For New Indications Using Computational Approaches
Repositioning Old Drugs For New Indications Using Computational ApproachesYannick Pouliot
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyYannick Pouliot
 

More from Yannick Pouliot (11)

Systems Immunology -- 2014
Systems Immunology -- 2014Systems Immunology -- 2014
Systems Immunology -- 2014
 
Managing experiment data using Excel and Friends
Managing experiment data using Excel and FriendsManaging experiment data using Excel and Friends
Managing experiment data using Excel and Friends
 
Essential UNIX skills for biologists
Essential UNIX skills for biologistsEssential UNIX skills for biologists
Essential UNIX skills for biologists
 
A guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databasesA guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databases
 
Ontologically-Aware Automated Gating
Ontologically-Aware Automated GatingOntologically-Aware Automated Gating
Ontologically-Aware Automated Gating
 
Why The Cloud Is A Computational Biologist's Best Friend
Why The Cloud Is A Computational Biologist's Best FriendWhy The Cloud Is A Computational Biologist's Best Friend
Why The Cloud Is A Computational Biologist's Best Friend
 
There’s No Avoiding It: Programming Skills You’ll Need
There’s No Avoiding It:  Programming Skills You’ll NeedThere’s No Avoiding It:  Programming Skills You’ll Need
There’s No Avoiding It: Programming Skills You’ll Need
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological Data
 
Predicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataPredicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening Data
 
Repositioning Old Drugs For New Indications Using Computational Approaches
Repositioning Old Drugs For New Indications Using Computational ApproachesRepositioning Old Drugs For New Indications Using Computational Approaches
Repositioning Old Drugs For New Indications Using Computational Approaches
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 

Recently uploaded

Call Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service NoidaCall Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service NoidaPooja Gupta
 
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingCall Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingNehru place Escorts
 
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Miss joya
 
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...narwatsonia7
 
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...narwatsonia7
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Bookingnarwatsonia7
 
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersBook Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersnarwatsonia7
 
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service MumbaiVIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbaisonalikaur4
 
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000aliya bhat
 
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...narwatsonia7
 
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service ChennaiCall Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service ChennaiNehru place Escorts
 
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safenarwatsonia7
 
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...narwatsonia7
 
call girls in munirka DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in munirka  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in munirka  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in munirka DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowNehru place Escorts
 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...narwatsonia7
 

Recently uploaded (20)

Call Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service NoidaCall Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service Noida
 
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingCall Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
 
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
 
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
 
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
 
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Servicesauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
 
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
 
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersBook Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
 
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service MumbaiVIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
 
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
 
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
 
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service ChennaiCall Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
 
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
 
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
 
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
 
call girls in munirka DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in munirka  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in munirka  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in munirka DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
 

Spark Distributed Computing Explained

  • 1. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot, PhD ytpouliot@google.com 8/12/2015 Spark: New Vistas of Computing Power
  • 2. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Spark: An Open Source Cluster-Based Distributed Computing
  • 3. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Distributed Computing A La Spark Here, “distributed computing” means… • Multi-node cluster: master, slaves • Paradigm is: o Minimize networking by allocating chunks of data to slaves o Slaves receive code to run on their subset of the data • Lots of redundancy • No communication between nodes o only with the master • Operating on commodity hardware
  • 4. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Distributed Computing Is The Rage Everywhere … Except In Academia That Should Disturb You
  • 5. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Some Ideal Applications here, using Spark’s MLlib library The highly distributed nature of Spark means it is ideal for … • Generating lots of… o Trees in a random forest o Permutations for computing distributions o Samplings (boostrapping, sub-sampling) • Cleaning up textual data, e.g., o NLP on EHR records o Mapping variant spellings of drug names to UMLS CUIs • Normalizing datasets • Computing basic statistics • Hypothesis testing
  • 6. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Spark = Speed • Focus of Spark is to make data analytics fast • fast to run code • fast to write code • To run programs faster, Spark provides primitives for in-memory cluster computing o job can load data into memory and queried repeatedly o much quicker than with disk-based systems like Hadoop MapReduce • To make programming faster, Spark integrates into the Scala programming language o Enables manipulation of distributed datasets as if they were local collections o Can use Spark interactively to query big data from the Scala interpreter
  • 7. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved HDFS Marketing-Level Architecture Stack
  • 8. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Architecture: Closer To Reality
  • 9. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Dynamics
  • 10. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Data Sources Integration
  • 11. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Ecosystem
  • 12. Yannick Pouliot Consulting © 2015 all rights reserved Spark’s Machine Learning Library: MLlib
  • 13. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved MLlib Current Functionality
  • 14. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Spark Programming Model Spark follows REPL model: read–eval–print loop o Similar to R and python shell o Ideal for exploratory data analysis Writing a Spark program typically consists of: 1. Reading some input data into local memory 2. Invoking transformation or actions that operate on a subset of data in local memory 3. Running those transformations/actions in a distributed fashion on the network (memory or disk) 4. Deciding what actions to undertake next Best of all, it can all be done within the shell, just like R (or Python)
  • 15. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved RDD: The Secret Sauce
  • 16. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved RDD = Distributed Data Frame • RDD = “Resilient Data Dataset” o Similar to an R data frame, but laid out across a cluster of machines as a collection of partitions • Partition = subset of the data o Spark Master node remembers all transformations applied to an RDD • if a partition is lost (e.g., a slave machine goes down), it can easily be reconstructed on some other machine in the cluster (“lineage”) o “Resilient” = RDD can always be reconstructed because of lineage tracking • RDDs are Spark’s fundamental abstraction for representing a collection of objects that can be distributed across multiple machines in a cluster o val rdd = sc.parallelize(Array(1, 2, 2, 4), 4) • 4 = number of “partitions” • Partitions are fundamental unit of parallelism o val rdd2 = sc.textFile("linkage")
  • 17. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Partitions = Data Slices • Partitions are “slices” a dataset is cut into • Spark will run one task for each partition • Typically 2-4 partitions for each CPU in cluster o Normally, Spark tries to set the number of partitions automatically based on your cluster o Can also be set manually as a second parameter to parallelize (e.g. sc.parallelize(data, 4))
  • 18. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved A Brief Word About Scala • One of the two languages Spark is built on (the other being Java) • Compiles to Java byte code • REPL based, so good for exploratory data analysis • Pure OO language • Much more streamlined than Java o Way fewer lines of code • Lots of type inferring • Seamless calls to Java o E.g., you might be using a Java method inside a Scale object… • Broad’s GATK is written in Scala
  • 19. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Example: Word Count In Spark Using Scala scala> val hamlet = sc.textFile(“/Users/akuntamukkala/temp/gutenburg.txt”) hamlet: org.apache.spark.rdd.RDD[String] =` MappedRDD[1] at textFile at <console>:12 scala> val topWordCount = hamlet.flatMap(str=>str.split(“ “)). filter(!_.isEmpty).map(word=>(word,1)).reduceByKey(_+_).map{case (word, count) => (count, word)}.sortByKey(false) topWordCount: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[10] at sortByKey at <console>:14 scala> topWordCount.take(5).foreach(x=>println(x)) (1044,the) (730,and) (679,of) (648,to) (511,I) RDD lineage
  • 20. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved More On Word Count
  • 21. Yannick Pouliot Consulting © 2015 all rights reserved A Quick Tour of Common Spark Functions
  • 22. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Common Transformations TRANSFORMATION & PURPOSE EXAMPLE & RESULT filter(func) Purpose: new RDD by selecting those data elements on which func returns true scala> val rdd = sc.parallelize(List(“ABC”,”BCD”,”DEF”)) scala> val filtered = rdd.filter(_.contains(“C”)) scala> filtered.collect() Result: Array[String] = Array(ABC, BCD) map(func) Purpose: return new RDD by applying func on each data element scala> val rdd=sc.parallelize(List(1,2,3,4,5)) scala> val times2 = rdd.map(_*2) scala> times2.collect() Result: Array[Int] = Array(2, 4, 6, 8, 10) flatMap(func) Purpose: Similar to map but func returns a Seq instead of a value. For example, mapping a sentence into a Seq of words scala> val rdd=sc.parallelize(List(“Spark is awesome”,”It is fun”)) scala> val fm=rdd.flatMap(str=>str.split(“ “)) scala> fm.collect() Result: Array[String] = Array(Spark, is, awesome, It, is, fun) reduceByKey(func,[numTasks]) Purpose: To aggregate values of a key using a function. “numTasks” is an optional parameter to specify number of reduce tasks scala> val word1=fm.map(word=>(word,1)) scala> val wrdCnt=word1.reduceByKey(_+_) scala> wrdCnt.collect() Result: Array[(String, Int)] = Array((is,2), (It,1), (awesome,1), (Spark,1), (fun,1)) groupByKey([numTasks]) Purpose: To convert (K,V) to (K,Iterable<V>) scala> val cntWrd = wrdCnt.map{case (word, count) => (count, word)} scala> cntWrd.groupByKey().collect() Result: Array[(Int, Iterable[String])] = Array((1,ArrayBuffer(It, awesome, Spark, fun)), (2,ArrayBuffer(is))) distinct([numTasks]) Purpose: Eliminate duplicates from RDD scala> fm.distinct().collect() Result: Array[String] = Array(is, It, awesome, Spark, fun)
  • 23. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved ACTION & PURPOSE EXAMPLE & RESULT count() Purpose: get the number of data elements in the RDD scala> val rdd = sc.parallelize(list(‘A’,’B’,’c’)) scala> rdd.count() Result: long = 3 collect() Purpose: get all the data elements in an RDD as an array scala> val rdd = sc.parallelize(list(‘A’,’B’,’c’)) scala> rdd.collect() Result: Array[char] = Array(A, B, c) reduce(func) Purpose: Aggregate the data elements in an RDD using this function which takes two arguments and returns one scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.reduce(_+_) Result: Int = 10 take (n) Purpose: : fetch first n data elements in an RDD. computed by driver program. Scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.take(2) Result: Array[Int] = Array(1, 2) foreach(func) Purpose: execute function for each data element in RDD. usually used to update an accumulator(discussed later) or interacting with external systems. Scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.foreach(x=>println(“%s*10=%s”. format(x,x*10)))Result: 1*10=10 4*10=40 3*10=30 2*10=20 first() Purpose: retrieves the first data element in RDD. Similar to take(1) scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.first() Result: Int = 1 saveAsTextFile(path) Purpose: Writes the content of RDD to a text file or a set of text files to local file system/ HDFS scala> val hamlet = sc.textFile(“/users/akuntamukkala/ temp/gutenburg.txt”) scala> hamlet.filter(_.contains(“Shakespeare”)). saveAsTextFile(“/users/akuntamukkala/temp/ filtered”)Result: akuntamukkala@localhost~/temp/filtered$ ls _SUCCESS part-00000 part-00001 Common Actions
  • 24. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved TRANSFORMATION AND PURPOSE EXAMPLE AND RESULT union() Purpose: new RDD containing all elements from source RDD and argument. Scala> val rdd1=sc.parallelize(List(‘A’,’B’)) scala> val rdd2=sc.parallelize(List(‘B’,’C’)) scala> rdd1.union(rdd2).collect() Result: Array[Char] = Array(A, B, B, C) intersection() Purpose: new RDD containing only common elements from source RDD and argument. Scala> rdd1.intersection(rdd2).collect() Result: Array[Char] = Array(B) cartesian() Purpose: new RDD cross product of all elements from source RDD and argument Scala> rdd1.cartesian(rdd2).collect() Result: Array[(Char, Char)] = Array((A,B), (A,C), (B,B), (B,C)) subtract() Purpose: new RDD created by removing data elements in source RDD in common with argument scala> rdd1.subtract(rdd2).collect() Result: Array[Char] = Array(A) join(RDD,[numTasks]) Purpose: When invoked on (K,V) and (K,W), this operation creates a new RDD of (K, (V,W)) scala> val personFruit = sc.parallelize(Seq((“Andy”, “Apple”), (“Bob”, “Banana”), (“Charlie”, “Cherry”), (“Andy”,”Apricot”))) scala> val personSE = sc.parallelize(Seq((“Andy”, “Google”), (“Bob”, “Bing”), (“Charlie”, “Yahoo”), (“Bob”,”AltaVista”))) scala> personFruit.join(personSE).collect() Result: Array[(String, (String, String))] = Array((Andy,(Apple,Google)), (Andy,(Apricot,Google)), (Charlie,(Cherry,Yahoo)), (Bob,(Banana,Bing)), (Bob,(Banana,AltaVista))) cogroup(RDD,[numTasks]) Purpose: To convert (K,V) to (K,Iterable<V>) scala> personFruit.cogroup(personSe).collect() Result: Array[(String, (Iterable[String], Iterable[String]))] = Array((Andy,(ArrayBuffer(Apple, Apricot),ArrayBuffer(google))), (Charlie,(ArrayBuffer(Cherry),ArrayBuffer(Yahoo)) ), (Bob,(ArrayBuffer(Banana),ArrayBuffer(Bing, AltaVista)))) Common Set Operations
  • 25. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Spark comes with R binding!
  • 26. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Spark and R: A Marriage Made In Heaven
  • 27. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved SparkR: A Package For Computing with Spark
  • 28. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Side Bar: Running Spark R on Amazon AWS 1. Launch Spark cluster at Amazon: ./spark-ec2 --key-pair=spark-df --identity-file=/Users/code/Downloads/spark-df.pem --region=eu-west-1 -s 1 --instance-type c3.2xlarge launch mysparkr 2. Launch SparkR locally: chmod u+w /root/spark/ ./spark/bin/sparkR AWS is probably the best way for academics to access Spark, because of complexity deploying the infrastructure
  • 29. Yannick Pouliot Consulting © 2015 all rights reserved And Not Just R: Python and Spark
  • 30. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved Python Code Example
  • 31. Yannick Pouliot Consulting © 2015 all rights reserved Yannick Pouliot Consulting © 2015 all rights reserved
  • 32. Yannick Pouliot Consulting © 2015 all rights reserved Questions?

Editor's Notes

  1. Generic hardware, Apache Open Source
  2. UMLS = Unified Medical Language System
  3. HDFS is elective
  4. http://www.slideshare.net/SparkSummit/07-venkataraman-sun?next_slideshow=1
  5. Data Frames, Mllib, ML Pipelines
  6. http://spark.apache.org/docs/latest/mllib-guide.html
  7.  It facilitates two types of operations: transformation and action. A transformation is an operation such as filter(), map(), or union() on an RDD that yields another RDD. An action is an operation such as count(), first(), take(n), or collect() that triggers a computation, returns a value back to the Master, or writes to a stable storage system.
  8. More later on transformations and actions
  9. Note: no type definition
  10. http://www.scala-lang.org/
  11. A lot of transformations are lazy: NOT run until an action is requested
  12. http://blog.rstudio.org/2015/07/14/spark-1-4-for-rstudio/
  13. pyspark even cooler when run within a notebook